Original Source Here
Multilingual Text Similarity Matching using Embedding
Using sentence-transformer for symmetric semantic search
The topic for today is about calculating the similarity score between two sentences of the same or different languages. We will be utilizing the sentence-transformer
framework which comes with its own pre-trained multilingual transformer models.
We can make use of these models to compute the text embedding for more than 50 languages. The output embedding(s) can then be used for symmetric semantic search.
Differences between symmetric and asymmetric semantic search
Symmetric semantic search focuses on finding similar questions from a corpus based on input queries. For example, given “How to learn artificial intelligence online?” as the input query, the expected output should be something like “How to learn AI on the web?” Most of the time, you could potentially flip over the data in the queries and corpus and still end up with the same pairings as output. Symmetric semantic search is mostly used for text mining or intent classification tasks.
On the other hand, asymmetric semantic search revolves around finding answers from a corpus based on input queries. For example, given “What is AI?” as the input query, you would expect the output to be something like “AI is a technology that mimics human intelligence to perform tasks. They can learn and improve their knowledge based on the information obtained.” The input queries are not limited to just questions. It can be keywords or short phrases. Asymmetric semantic search is suitable for search engine related tasks.
At the time of this writing, the sentence-transformer
framework provides the following pre-trained models meant for multilingual symmetric semantic search:
distiluse-base-multilingual-cased-v1
— Multi-Lingual model of Universal Sentence Encoder for 15 languages.distiluse-base-multilingual-cased-v2
— Multi-Lingual model of Universal Sentence Encoder for 50 languages.paraphrase-multilingual-MiniLM-L12-v2
— Multi-lingual model of paraphrase-multilingual-MiniLM-L12-v2, extended to 50+ languages.paraphrase-multilingual-mpnet-base-v2
— Multi-lingual model of paraphrase-mpnet-base-v2, extended to 50+ languages.
Practically, we can utilize these models to calculate the similarity between an English sentence and a Spanish sentence. For example, given the following sentences in our corpus:
What are you doing?
I am a boy
Can you help me?
A woman is playing violin.
The quick brown fox jumps over the lazy dog
and the input query as follows:
Qué estás haciendo
The sentence with the highest similarity score should be:
What are you doing?
For simplicity, the workflow for our symmetric semantic search is as follows:
- Compute the embedding for both query and corpus text
- Calculate the cosine-similarity between both embedding(s)
- Find the top 5 index with the highest similarity scores
Setup
Before that, let’s create a new virtual environment and install all the necessary packages.
Install with pip
You can easily install sentence-transformer
package:
pip install -U sentence-transformers
Install with conda
As for Anaconda users, you can install the package directly as follows:
conda install -c conda-forge sentence-transformers
Proceed to the next section for the implementation.
Implementation
In your working directory, create a new Python file called main.py
.
Import
Add the following import statement at the top of the file:
from sentence_transformers import SentenceTransformer, util
import torch
Model initialization
Then, initialize the model by calling the SentenceTransformer
class and pass in the name of your desired model:
model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')
During the initial run, the module will download the pre-trained model files as cache in the following directory:
# linux
~/.cache/huggingface/transformers# windows (replace username with your username)
C:\Users\<username>\.cache\huggingface\transformers
You can modify the cache folder to the current working directory as follows:
model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2', cache_folder='/')
For production, you should move the model to the working directory and load locally. For example, given that the model files are located at the models
folder, you can initialize your model as follows:
model = SentenceTransformer('models/sentence-transformers_paraphrase-multilingual-MiniLM-L12-v2')
If you are testing on a CPU-only machine, simply set the device
argument to cpu
:
model = SentenceTransformer('models/sentence-transformers_paraphrase-multilingual-MiniLM-L12-v2', device='cpu')
Corpus and queries
Next, initialize the data for your corpus and queries. In this case, I have a list of 7 strings as the corpus
data while queries
contains a list of 3 strings in different languages.
corpus = [
'I am a boy',
'What are you doing?',
'Can you help me?',
'A man is riding a horse.',
'A woman is playing violin.',
'A monkey is chasing after a goat',
'The quick brown fox jumps over the lazy dog'
]queries = ['I am in need of assistance', '我是男孩子', 'Qué estás haciendo']
Encode data into embedding
Call the encode
function to convert the corpus into embedding. Set the convert_to_tensor
argument to True
to get Python Tensor as output. Also, initialize a new variable called top_k
and assign it as the minimum value of 5 and the total length of the corpus. We will use this variable later on to get the indexes with the highest similarity score.
corpus_embedding = model.encode(corpus, convert_to_tensor=True)top_k = min(5, len(corpus))
The
encode
function accepts a list of strings or a single string as input.
Calculate cosine-similarity
The final step is to loop through all the items in queries and perform the following actions:
- calculate the embedding for a single query. Each embedding has the following shape:
torch.Size([384])
- call the
util.cos_sim
function to get the similarity score between the input query and the corpus - call the
torch.topk
function to get the topk results - print out the output as reference
for query in queries:
query_embedding = model.encode(query, convert_to_tensor=True)
cos_scores = util.cos_sim(query_embedding, corpus_embedding)[0]
top_results = torch.topk(cos_scores, k=top_k) print("Query:", query)
print("---------------------------")
for score, idx in zip(top_results[0], top_results[1]):
print(f'{round(score.item(), 3)} | {corpus[idx]}')
The top_results
variable is a tuple containing:
- tensor array representing the similarity scores between the input query and the corpus
tensor([ 0.3326, 0.2809, 0.2258, -0.0133, -0.0333])
- tensor array representing the index of the input query
tensor([2, 0, 1, 4, 3])
You can find the complete code at the following gist:
Output
You should get the following output on your terminal when you run the script:
Query: I am in need of assistance
---------------------------
0.333 | Can you help me?
0.281 | I am a boy
0.226 | What are you doing?
-0.013 | A woman is playing violin.
-0.033 | A man is riding a horse.Query: 我是男孩子
---------------------------
0.919 | I am a boy
0.343 | What are you doing?
0.192 | Can you help me?
0.058 | A monkey is chasing after a goat
-0.001 | The quick brown fox jumps over the lazy dogQuery: Qué estás haciendo
---------------------------
0.952 | What are you doing?
0.396 | I am a boy
0.209 | Can you help me?
0.037 | A woman is playing violin.
0.032 | The quick brown fox jumps over the lazy dog
Optimization
The implementation above works great for small corpus (below 1 million items). For a large corpus, the execution will be relatively slow. Hence, we need to optimize the implementation so that it will work seamlessly. Some of the most popular optimization techniques include:
- normalize the embedding and use dot product as score function
- use approximate nearest neighbor to partition corpus into smaller fraction of similar embedding(s)
To keep it simple and short, this tutorial will only cover the first technique. When you normalize the embedding, the output vector will have length of 1. As a result, we can calculate the similarity score using dot product instead of cosine similarity. Dot product is a lost faster and you will end up with the same similarity scores.
Normalize the embedding
There are two ways to normalize the embedding. The first method is to set the normalize_embeddings
argument to True
when calling the encode
function.
corpus_embedding = model.encode(corpus, convert_to_tensor=True, normalize_embeddings=True)
Alternatively, you can utilize the util.normalize_embeddings
function to normalize an existing embedding:
corpus_embedding = model.encode(corpus, convert_to_tensor=True)
corpus_embedding = util.normalize_embeddings(corpus_embedding)
Calculate dot product
Call the util.semantic_search
function and pass in util.dot_score
as the input argument for score_function
. It will return a list of dictionaries with the keys corpus_id
and score
. Also, the list is sorted by decreasing cosine similarity scores.
hits = util.semantic_search(query_embedding, corpus_embedding, score_function=util.dot_score)
Upon modification, the new implementation code should be as follows:
You should get the same output as the first implementation when you run the script:
Query: I am in need of assistance
---------------------------
0.333 | Can you help me?
0.281 | I am a boy
0.226 | What are you doing?
-0.013 | A woman is playing violin.
-0.033 | A man is riding a horse.Query: 我是男孩子
---------------------------
0.919 | I am a boy
0.343 | What are you doing?
0.192 | Can you help me?
0.058 | A monkey is chasing after a goat
-0.001 | The quick brown fox jumps over the lazy dogQuery: Qué estás haciendo
---------------------------
0.952 | What are you doing?
0.396 | I am a boy
0.209 | Can you help me?
0.037 | A woman is playing violin.
0.032 | The quick brown fox jumps over the lazy dog
Conclusion
Let’s recap what you have learned today.
This article started off with a brief introduction on the sentence-transformer
module. Then, it compared on the differences between symmetric and asymmetric semantic search.
Subsequently, it covered on the setup and installation. sentence-transformer
can be installed with pip
or conda
.
In the implementation section, this article highlighted the steps to encode corpus into embedding as well as similarity score calculation using cosine similarity.
The final section discussed on optimization techniques. One of the optimization technique is to normalize the embedding to be of length 1 and then calculate the similarity score using dot product.
Thanks for reading this piece. Have a great day ahead!
AI/ML
Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot