How we use engagement-based embeddings to improve search and recommendation on Faire



Original Source Here

How we use engagement-based embeddings to improve search and recommendation on Faire

Introduction

Faire’s wholesale marketplace connects over 600,000 independent retailers with over 85,000 brands around the world. Our Discovery Team is continuously looking to improve the platform experience for our community of global customers — including designing and implementing models and algorithms that help personalize product rankings across the search results page, category navigation, brand page, and recommendation carousels. To learn more, check out past articles from The Craft Blog about Building Faire’s new marketplace ranking infrastructure and Real-time ranking at Faire: the feature store.

While the previous articles focus on the foundations of our ranking infrastructure, in this post we’ll share more about our efforts in embeddings. We’ve seen the broader industry investing heavily in this area over the past decade, where a numerical vector representation of various marketplace entities is learned through machine learning models. At Faire, we also use a broad range of embeddings to help inform Search and Recommendation, quickly surface the most relevant products customers are searching for, or inspire our retailers by suggesting popular brands and products they haven’t tried yet.

In particular, we’ll explain how we learn retailer, product, and brand embeddings, we’ll dive into the training data generation, model architecture, and loss function that delivers the best balance between performance and training time. You’ll also learn more about how we serve the embeddings in production that power various Search and Recommendation surfaces to drive meaningful business impact.

Factorization machine

In the past, we used LightFM to learn retailer, product, and brand embeddings. LightFM is a package that implements a factorization machine method that makes use of both retailer-product/brand interactions as well as side features to generate retailer and product embeddings, whose dot product (plus bias terms) approximates the likelihood of interaction between them.

LightFM served us well in the early days. However, the black-box nature of LightFM makes it hard to evolve model architecture to accommodate the ever-growing size and complexity of Faire’s business needs. As a result, we decided to migrate away from LightFM to PyTorch stack, which gives us further transparency, flexibility, and scalability to accommodate a much larger product catalog and more model architectures to explore with. Below, we’ll explain how Faire built our own PyTorch stack for embedding training, based on historical retailer<>product engagements.

Training data

To learn retailer-product embeddings, we mainly generate training data using our onsite add-to-cart, checkout, and product page visits signal as positive samples in the following tabular data form:

Here, it might seem like we only generated positive labels/signals in our training data. How about negative samples? For that, we could do in-batch shuffling during the training process to generate negative samples on the fly. This has the benefit of significantly simplifying the training data generation process and improving training pipeline running time, as we can do this in-memory (We will explore more sophisticated sampling techniques as next steps).

In the loss function section below, we’ll explain in detail how we carry out the in-batch random negative sampling.

Model architecture

We implemented our model training in PyTorch. In particular, we used typical two-tower model architecture (inspired by the work from Youtube), where we have both a retailer tower and a product tower for embedding generation and calculate cosine similarity to approximate retailer<>product preference in the final layer.

Below is the simplified model architecture. We’ll explain in detail the retailer/product embedding and sparse feature embedding logic in PyTorch.

Retailer and product embedding initialization and lookup

We leverage the PyTorch embedding module for the retailer and product embedding initialization and look-up. Below is a code snipped for initialization. Essentially, the retailer embedding look-up table is a matrix of the shape [n_retailer, embedding_length] similarly, the product embedding look-up table is of the shape [n_product, embedding_length]

self.retailer_embeddings = nn.Embedding(self.n_retailer, embedding_length).to(
self.device
)
self.product_embeddings = nn.Embedding(self.n_product, embedding_length).to(
self.device
)

Sparse features

In addition to retailer and product embeddings themselves, other features can also help the model learn the latent representations — such as the product’s brand, the product’s taxonomy type, and the product’s country, etc.

To generalize the current nn.Embedding module setup for retailer and product, we also apply the embedding initialization for these so-called side features. We pool these embeddings together for final product embedding. With this design, we can take advantage of the highly optimized nn.Embedding module to accommodate high-cardinality side features, such as taxonomy types.

The code to set up side features with nn.Embedding is also pretty straightforward, where we only need to know the product<>sparse_feature mapping. One detail here is we use nn.ModuleDict to register this map of embeddings.

def _initialize_item_sparse_feature_bias_embedding(self):
sparse_embeddings = dict()
biases = dict()
total_emb_dim = 0
for f in self.item_sparse_features:
sparse_feature_cardinality = len(self.sparse_feature_idx_mappers[f])
sparse_embeddings[f] = nn.Embedding(
sparse_feature_cardinality, self.embedding_length).to(
self.device
)
biases[f] = nn.Embedding(cardinality, 1).to(self.device)
# Use ModuleDict to register embeddings to model parameters
return nn.ModuleDict(sparse_embeddings), nn.ModuleDict(biases)

Loss function

Weighted approximate ranking pairwise (WARP) loss

Now that we know how to calculate retailer<>product affinity score, next, we’ll explain how we calculate the ranking loss given a positive retailer<>product engagement pair.

Intuitively, this loss function is trying to learn the embedding of the engaged product compared to the never engaged one so that in the latent embedding space, the engaged retailer<>product will be closer while the unengaged retailer<>product will be further.

Modified WARP loss

When implementing model training in PyTorch using the WARP loss definition above, we realize that the algorithm would:

  1. randomly sample another product that the retailer has never engaged with before and
  2. iterate until negative sample similarity is higher

This would result in making training unscalable and time-consuming:

  1. If we implement negative sampling during training data generation step, it would significantly increase the training data generation time and size.
  2. If we implement this during model training time, we also cannot guarantee the never engaged before condition as we would have to know all the historical retailer<>product engagements which cannot fit in memory.
  3. Lastly, the iterative process could also increase training time.

To address these problems, we implemented the following modification. Through offline iteration, we realized that in-batch shuffling produces both a good approximation of the exact loss and reasonable training time. Below, is the pseudo-code to illustrate in-batch random permutation:

for i in range(0, K):
shuffled_product_ids = batch[torch.randperm(batch.product_ids)]
negative_product_embeddings = self._gen_item_embeddings(shuffled_product_ids)
negative_prediction = user_embeddings * negative_product_embeddings
predictions.append(negative_prediction.unsqueeze(1))

Here is an example for batch-size = 10 and shuffle time K = 3

We can see that after shuffling, the shuffled product is most likely one that retailers have never engaged with before. In reality, we might have a situation where the retailer has already engaged with the shuffled products. Through offline iteration and analysis, we have verified that random permutation generates high-quality embeddings even with potential false negative samples. We shuffled three times, meaning we will cap the iteration steps at 3. Now we have the modified weighted approximate ranking loss as follows:

Model training and serving

Now that we’ve walked through training data generation, model architecture, and loss function, we have all the ingredients for final model training.

  1. We will build the training script as a docker image and run it in Airflow every other day.
  2. We will store trained retailer, product, and brand embeddings in our feature store, and will be able to look up embeddings online.
  3. For all active brands, we will store top similar brands calculated based on brand<>brand embedding similarity. The topK functionality comes from an offline embedding indexer such as Annoy and Faiss.

Embedding applications

There are two major application types of embeddings at Faire:

  1. Candidate generation with NearestSearch (NN)
  2. Embedding similarity scores based retrieval and ranking.

We gave four examples that cover both application types across our discovery surfaces. Below, is a simple diagram to help showcase the main discovery surfaces at Faire:

Category navigation and search

At Faire, the category page is driven by personalization. Adding retailer<>product embedding similarities as a personalization signal in the retrieval step would help retailers find the most desired products and therefore drive a lift in GMV.

While engagement-based embedding could be a powerful lever in retrieval, computing embedding cosine similarities are extremely resource-consuming, especially when millions of products are sold on Faire. Ideally, Approximated Nearest Neighbor (ANN) based retrieval techniques are the optimal option to support full products scan. In our case, the current retrieval infrastructure does not support a full ANN retrieval yet. We found a workaround by utilizing the built-in Elasticsearch rescore functionality that takes in a rescore script and executes the script on the top candidates in terms of first-pass retrieval score.

Another key parameter in this experiment is the rescore window size. We chose the rescore size carefully to balance the trade-off between retrieval result quality and latency. After latency benchmarking, we landed on rescoring the top 100k candidates for each request, which only added 20ms latency overhead to the retrieval step while still big enough for embeddings to show its power.

We launched the experiment with a control group using Product Quality Score only and treatment using retailer<>product similarity for second pass rescoring. The flowchart below illustrates the high-level retrieval process:

We are also experimenting with directly adding retailer<>product embedding similarity into our Search ranker. In particular, in our offline analysis, we observed that this embedding similarity feature is the most important new feature we introduced by a big margin!

Brand page

Another way we leverage the embeddings is through our recommendation use case. When a retailer visits a particular brand page, we feature and recommend similar brands to help them discover additional products.

For this use case, we find top brands similar to the current brand, using brand<>brand similarity score calculated from brand embeddings and store the top candidates in DynamoDB with brand id as a key. Then, during the online serving time, we retrieve similar brands given the current brand, through DynamoDB and rank in real-time for final ranking.

Homepage

We’re also leveraging embeddings on our homepage, where for each individual carousel, we rank products and brands based on their retailer<>product/brand embedding similarity score. Below, we showcase a food and beverage retailer’s home page, that’s fully personalized for them.

Conclusion

In this article, we walked through the training data generation, choice of loss function, and showed you how we construct a two-tower model to learn retailer and product embeddings based on historical onsite retailer<>product engagements. We also covered several embedding applications across our major discovery surfaces — which ultimately, help our customers grow their businesses. These include helping retailers find more personalized products through Search ranking and Category Navigation retrieval, making it easier for them to discover new brands they might be interested in, and personalizing homepages based on retailer profiles. Overall, we showcased that embeddings are crucial components in modern Search and Recommendation systems, and with the current deep learning ecosystem, it’s never been easier to set one up for your own use case!

Shout outs

Thank you to this group of talented Data Scientists and Engineers for this amazing work — Sam Kenny, Qinyu Wang, Xiaomeng Hu, Wei Jiang, to name a few.

We’ve only just begun to scratch the surface of embedding learning and applications. The team has also shipped query<>product embeddings trained on query and product engagements and Multi-Model Embedding trained on product description and product image etc. Stay tuned for future blog posts to learn more about how we’re using data and machine learning to power Faire’s platform!

If you want to join a world-class Data Team working on innovative challenges to help entrepreneurs around the world chase their dreams — join us! www.faire.com/careers

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: