Multilingual Text Similarity Matching using Embedding



Original Source Here

Multilingual Text Similarity Matching using Embedding

Using sentence-transformer for symmetric semantic search

Photo by Raquel Martínez on Unsplash

The topic for today is about calculating the similarity score between two sentences of the same or different languages. We will be utilizing the sentence-transformer framework which comes with its own pre-trained multilingual transformer models.

We can make use of these models to compute the text embedding for more than 50 languages. The output embedding(s) can then be used for symmetric semantic search.

Differences between symmetric and asymmetric semantic search

Symmetric semantic search focuses on finding similar questions from a corpus based on input queries. For example, given “How to learn artificial intelligence online?” as the input query, the expected output should be something like “How to learn AI on the web?” Most of the time, you could potentially flip over the data in the queries and corpus and still end up with the same pairings as output. Symmetric semantic search is mostly used for text mining or intent classification tasks.

On the other hand, asymmetric semantic search revolves around finding answers from a corpus based on input queries. For example, given “What is AI?” as the input query, you would expect the output to be something like “AI is a technology that mimics human intelligence to perform tasks. They can learn and improve their knowledge based on the information obtained.” The input queries are not limited to just questions. It can be keywords or short phrases. Asymmetric semantic search is suitable for search engine related tasks.

At the time of this writing, the sentence-transformer framework provides the following pre-trained models meant for multilingual symmetric semantic search:

  • distiluse-base-multilingual-cased-v1 — Multi-Lingual model of Universal Sentence Encoder for 15 languages.
  • distiluse-base-multilingual-cased-v2 — Multi-Lingual model of Universal Sentence Encoder for 50 languages.
  • paraphrase-multilingual-MiniLM-L12-v2 — Multi-lingual model of paraphrase-multilingual-MiniLM-L12-v2, extended to 50+ languages.
  • paraphrase-multilingual-mpnet-base-v2 — Multi-lingual model of paraphrase-mpnet-base-v2, extended to 50+ languages.

Practically, we can utilize these models to calculate the similarity between an English sentence and a Spanish sentence. For example, given the following sentences in our corpus:

What are you doing?
I am a boy
Can you help me?
A woman is playing violin.
The quick brown fox jumps over the lazy dog

and the input query as follows:

Qué estás haciendo

The sentence with the highest similarity score should be:

What are you doing?

For simplicity, the workflow for our symmetric semantic search is as follows:

  1. Compute the embedding for both query and corpus text
  2. Calculate the cosine-similarity between both embedding(s)
  3. Find the top 5 index with the highest similarity scores

Setup

Before that, let’s create a new virtual environment and install all the necessary packages.

Install with pip

You can easily install sentence-transformer package:

pip install -U sentence-transformers

Install with conda

As for Anaconda users, you can install the package directly as follows:

conda install -c conda-forge sentence-transformers

Proceed to the next section for the implementation.

Implementation

In your working directory, create a new Python file called main.py.

Import

Add the following import statement at the top of the file:

from sentence_transformers import SentenceTransformer, util
import torch

Model initialization

Then, initialize the model by calling the SentenceTransformer class and pass in the name of your desired model:

model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')

During the initial run, the module will download the pre-trained model files as cache in the following directory:

# linux
~/.cache/huggingface/transformers
# windows (replace username with your username)
C:\Users\<username>\.cache\huggingface\transformers

You can modify the cache folder to the current working directory as follows:

model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2', cache_folder='/')

For production, you should move the model to the working directory and load locally. For example, given that the model files are located at the models folder, you can initialize your model as follows:

model = SentenceTransformer('models/sentence-transformers_paraphrase-multilingual-MiniLM-L12-v2')

If you are testing on a CPU-only machine, simply set the device argument to cpu:

model = SentenceTransformer('models/sentence-transformers_paraphrase-multilingual-MiniLM-L12-v2', device='cpu')

Corpus and queries

Next, initialize the data for your corpus and queries. In this case, I have a list of 7 strings as the corpus data while queries contains a list of 3 strings in different languages.

corpus = [
'I am a boy',
'What are you doing?',
'Can you help me?',
'A man is riding a horse.',
'A woman is playing violin.',
'A monkey is chasing after a goat',
'The quick brown fox jumps over the lazy dog'
]
queries = ['I am in need of assistance', '我是男孩子', 'Qué estás haciendo']

Encode data into embedding

Call the encode function to convert the corpus into embedding. Set the convert_to_tensor argument to True to get Python Tensor as output. Also, initialize a new variable called top_k and assign it as the minimum value of 5 and the total length of the corpus. We will use this variable later on to get the indexes with the highest similarity score.

corpus_embedding = model.encode(corpus, convert_to_tensor=True)top_k = min(5, len(corpus))

The encode function accepts a list of strings or a single string as input.

Calculate cosine-similarity

The final step is to loop through all the items in queries and perform the following actions:

  • calculate the embedding for a single query. Each embedding has the following shape: torch.Size([384])
  • call the util.cos_sim function to get the similarity score between the input query and the corpus
  • call the torch.topk function to get the topk results
  • print out the output as reference
for query in queries:
query_embedding = model.encode(query, convert_to_tensor=True)

cos_scores = util.cos_sim(query_embedding, corpus_embedding)[0]
top_results = torch.topk(cos_scores, k=top_k)
print("Query:", query)
print("---------------------------")
for score, idx in zip(top_results[0], top_results[1]):
print(f'{round(score.item(), 3)} | {corpus[idx]}')

The top_results variable is a tuple containing:

  • tensor array representing the similarity scores between the input query and the corpus
tensor([ 0.3326,  0.2809,  0.2258, -0.0133, -0.0333])
  • tensor array representing the index of the input query
tensor([2, 0, 1, 4, 3])

You can find the complete code at the following gist:

Output

You should get the following output on your terminal when you run the script:

Query: I am in need of assistance
---------------------------
0.333 | Can you help me?
0.281 | I am a boy
0.226 | What are you doing?
-0.013 | A woman is playing violin.
-0.033 | A man is riding a horse.
Query: 我是男孩子
---------------------------
0.919 | I am a boy
0.343 | What are you doing?
0.192 | Can you help me?
0.058 | A monkey is chasing after a goat
-0.001 | The quick brown fox jumps over the lazy dog
Query: Qué estás haciendo
---------------------------
0.952 | What are you doing?
0.396 | I am a boy
0.209 | Can you help me?
0.037 | A woman is playing violin.
0.032 | The quick brown fox jumps over the lazy dog

Optimization

The implementation above works great for small corpus (below 1 million items). For a large corpus, the execution will be relatively slow. Hence, we need to optimize the implementation so that it will work seamlessly. Some of the most popular optimization techniques include:

  • normalize the embedding and use dot product as score function
  • use approximate nearest neighbor to partition corpus into smaller fraction of similar embedding(s)

To keep it simple and short, this tutorial will only cover the first technique. When you normalize the embedding, the output vector will have length of 1. As a result, we can calculate the similarity score using dot product instead of cosine similarity. Dot product is a lost faster and you will end up with the same similarity scores.

Normalize the embedding

There are two ways to normalize the embedding. The first method is to set the normalize_embeddings argument to True when calling the encode function.

corpus_embedding = model.encode(corpus, convert_to_tensor=True, normalize_embeddings=True)

Alternatively, you can utilize the util.normalize_embeddings function to normalize an existing embedding:

corpus_embedding = model.encode(corpus, convert_to_tensor=True)
corpus_embedding = util.normalize_embeddings(corpus_embedding)

Calculate dot product

Call the util.semantic_search function and pass in util.dot_score as the input argument for score_function. It will return a list of dictionaries with the keys corpus_id and score. Also, the list is sorted by decreasing cosine similarity scores.

hits = util.semantic_search(query_embedding, corpus_embedding, score_function=util.dot_score)

Upon modification, the new implementation code should be as follows:

You should get the same output as the first implementation when you run the script:

Query: I am in need of assistance
---------------------------
0.333 | Can you help me?
0.281 | I am a boy
0.226 | What are you doing?
-0.013 | A woman is playing violin.
-0.033 | A man is riding a horse.
Query: 我是男孩子
---------------------------
0.919 | I am a boy
0.343 | What are you doing?
0.192 | Can you help me?
0.058 | A monkey is chasing after a goat
-0.001 | The quick brown fox jumps over the lazy dog
Query: Qué estás haciendo
---------------------------
0.952 | What are you doing?
0.396 | I am a boy
0.209 | Can you help me?
0.037 | A woman is playing violin.
0.032 | The quick brown fox jumps over the lazy dog

Conclusion

Let’s recap what you have learned today.

This article started off with a brief introduction on the sentence-transformer module. Then, it compared on the differences between symmetric and asymmetric semantic search.

Subsequently, it covered on the setup and installation. sentence-transformer can be installed with pip or conda.

In the implementation section, this article highlighted the steps to encode corpus into embedding as well as similarity score calculation using cosine similarity.

The final section discussed on optimization techniques. One of the optimization technique is to normalize the embedding to be of length 1 and then calculate the similarity score using dot product.

Thanks for reading this piece. Have a great day ahead!

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: