Let’s Free From Spam Products in Ecommerce

Original Source Here

6. Modelling and Predictions

refer: Transfer Learning

refer: Repository of Feature vectors

refer: EfficientNet Features

  • As Part of this case study, we have used the technique of TransferLearning to get the embeddings of Image and Texts and used them to find similar nearest Products. Tensorflow hub helped a lot to get the Feature vectors.
  • We have used the state of art DeepLearning techniques like EfficientNet and variations of it for Images and also Bert Variations for Text.
  • We also Tried various techiques like Resnets, InceptionV3 etc, But these does not show any significant results. EfficientNetB2 works Well for Images. For Texts, we have tried with TF-IDF, Bert(uncased), Siamese Bert, But TF-IDF works outstandingly here.
  • We used Rapids Library for faster Training and tried KNN to get similar products for Images and used Pairwise Distance for Texts.
  • From the EDA, we get to know that Products having similar Phash tends to be more similar and take the leverage of it and this increased correct identifications.
  • Finally we combined the predictions from Images, Texts and Phashes.

Summary of Results

Summary Of Results

API Creation and Sample Predictions

def get_similar_images(img_path, title, phash, num_matches):#Step:1 ====> Reading the image and resizing
sample_image = zeros((1, 224, 224, 3),dtype='float32')
image = cv2.imread(img_path)
sample_image[0, ] = cv2.resize(image, (224, 224))
#Step2: ====> Getting the Image Embeddings by loading model efficientNetB2
model_effB2 = load_model('/content/drive/MyDrive/AppliedAI/DocumentClassification/efficientB2.h5')
embeds_image = model_effB2.predict(sample_image, use_multiprocessing=True, workers=4, verbose = 3)
#Step3: ====> Preprocessing the Title and getting the embeddings of title
Input_Text = title.lower()
Input_Text = re.sub(r'[\n\t\r\\-]+', ' ', Input_Text)
Input_Text = re.sub(r"won't", "will not", Input_Text)
Input_Text = re.sub(r"can\'t", "can not", Input_Text)
Input_Text = re.sub(r"n\'t", " not", Input_Text)
Input_Text = re.sub(r"\'re", " are", Input_Text)
Input_Text = re.sub(r"\'s", " is", Input_Text)
Input_Text = re.sub(r"\'d", " would", Input_Text)
Input_Text = re.sub(r"\'ll", " will", Input_Text)
Input_Text = re.sub(r"\'t", " not", Input_Text)
Input_Text = re.sub(r"\'ve", " have", Input_Text)
Input_Text = re.sub(r"\'m", " am", Input_Text)
Input_Text = re.sub(r"[^0-9a-zA-Z_]+", ' ', Input_Text)
new_Text = ''
all_tokens = nltk.word_tokenize(Input_Text)
for word in all_tokens:
if word not in stop_words:
lemma = lemmatizer.lemmatize(word)
new_Text += ' ' + lemma
new_Text = new_Text.lstrip()
#Step4: getting title embedings and loading the tfidf model
tfidf_title_vectorizer = joblib.load('/content/drive/MyDrive/AppliedAI/DocumentClassification/tfidf_vectorizer.pickle')
tfidf_title_features = joblib.load('/content/drive/MyDrive/AppliedAI/DocumentClassification/all_title_features.pickle')
embeds_texts = tfidf_title_vectorizer.transform([title])
# print("Preprocess Completed and got the embeddings to title")
#Step5: Loading the efficientNetB2 Knn model
knn_model = joblib.load('/content/drive/MyDrive/AppliedAI/DocumentClassification/knn_effenet_model.pkl')
#Step6: getting Image similar Ids
img_distances, img_indices = knn_model.kneighbors(embeds_image)
index = np.where(img_distances < 5)[0]
post_ids = img_indices[index][0]
pred_images = train_data.iloc[post_ids]["posting_id"].values
#Step7: getting title similar Ids
pair_distances = pairwise_distances(tfidf_title_features, embeds_texts.reshape(1, -1))
pair_distances = pair_distances.flatten()
indices_labells = np.argsort(pair_distances)[0:num_matches]
pred_titles = train_data.iloc[indices_labells]['posting_id'].values
#step8: getting similar Image_phash
pred_image_phash = train_data[train_data['image_phash'] == phash]['posting_id'].values
#getting the unique in all the three predictions
unique_preds = np.unique(np.concatenate([pred_titles, pred_image_phash, pred_images]))
# print("returing the all unique predicted")
return unique_preds
Predictions 2
Prediction 3

From Above It is Clear that Our First Cut Approach is Doing reasonably Good.


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: