Generate distractors for MCQs using Word Vectors, Sentence Transformers and MMR algorithm



Original Source Here

Generate distractors for MCQs using Word Vectors, Sentence Transformers and MMR algorithm

Use NLP to generate wrong answers for MCQs in Edtech

Image from Pixabay

If you are working at the crossroads of NLP and Edtech, you will sooner or later encounter the problem of generating distractors (wrong answer choices) for a given question and answer, automatically using NLP.

What are distractors?

Distractors are the wrong answers to a multiple-choice question.

For example, if a given multiple choice question has Barack Obama as the correct answer then we need to generate wrong choices (distractors) like George Bush, Bill Clinton, Donald Trump, etc.

Image by Author

Given a correct answer, how do you find distractors?

There are several methods to find distractors based on knowledge graphs, as well as unsupervised algorithms like word2vec, fastText, etc. A few of them can be found here in my previous blog post.

Let’s put our focus on Sense2vec , which is on the same lines as word2vec where you query with a given word or phrase and get back similar words/phrases in the vector space.

Assume that we have a question and a correct answer.

Q: Who is the 44th president of the United States?
A: Barack Obama

Now the goal is to find wrong answer choices (distractors) for the word “Barack Obama”. That could be other presidents or presidential candidates.

But there is a problem! If you query sense2vec with “Barack Obama” for similar words, you will encounter many near-duplicates and misspelled words like Barrack Obama, President Obama, Obama, etc that you need to filter. You also encounter other duplicates like George Bush, George W Bush, etc from which you only need to keep only one as a distractor.

Similar words to ‘Barack Obama’ from the sense2vec algorithm:

Barrack Obama
President Obama
Obama
George Bush
George W Bush
Bill Clinton
John McCain

How do we tackle this problem of filtering near duplicates? We can effectively use sentence transformer embeddings and Maximum Marginal Relevance to achieve this. And we will see how in a moment with code!

The algorithm

The colab notebook with full code for this tutorial can be found here.

Let’s begin by installing the sense2vec library and downloading and initializing the sense2vec model.

Let’s use our search word “Barack Obama” and see what sense2vec outputs –

Output:

[‘Barrack Obama’, ‘George W. Bush’, ‘George W Bush’, ‘Ronald Reagan’, ‘George Bush’, ‘John McCain’, ‘Jimmy Carter’, ‘President Obama’, ‘Bill Clinton’, ‘Obama’, ‘Hilary Clinton’, ‘Hillary Clinton’, ‘president’, ‘Sarah Palin’, ‘President’, ‘Nancy Pelosi’]

Now, in order to effectively filter near-duplicates let us convert the original answer word as well as each of the words from similar word results in sense2vec to a vector using Sentence Transformers.

Now that we have a vector embedding for each word/phrase, our goal is to pick top N diverse (far away from each other) words in the embedding space.

N in our multiple-choice question case could be four. So iteratively we want to pick the three most diverse words/phrases for our search keyword (“Barrack Obama”). But how do we achieve this?

We have maximum marginal relevance (MMR) to our rescue! There is a great implementation of MMR algorithm in KeyBert library by Maarten Grootendorst!

To quote from the MMR algorithm implementation page above –

MMR considers the similarity of keywords/keyphrases with the document, along with the similarity of already selected keywords and keyphrases. This results in a selection of keywords that maximize their diversity with respect to the document.

Here document could be replaced by our original query keyword (Barack Obama). And we will slightly modify the MMR algorithm so that the first word selected is our original keyword (Barack Obama) and the rest of the words sequentially chosen by MMR algorithm are supposed to be diverse from the original keyword as well as diverse from the words in the existing list till then. This makes sure that we eliminate near-duplicates by not choosing them because of the diversity criteria. Also, the algorithm has a parameter to control the diversity.

And you can see that the output is –

Barack Obama
— — — — — — — — — ->
John McCain
George W Bush
Sarah Palin
Bill Clinton

You can clearly see that near-duplicates like ‘Barrack Obama’, ‘George W. Bush’, ‘George Bush’, ‘President Obama’, ‘Obama’, etc are filtered and are not present in the final output.

Conclusion

Word vector algorithms can be used as a means to generate distractors (wrong choices) for Multiple Choice Questions given a correct answer.

But because of the unsupervised nature of the training involved in word vector algorithms, many similar words and near-duplicates are returned.

Since these words are sometimes semantically similar, using word edit distance or word matching in phrases doesn’t work well. Hence we use transformer-based algorithms SBERT, to convert any phrase or word into a vector and filter neighbors there. These algorithms are trained to place words with spelling mistakes as well as semantic similar words nearby in the embedding space.

From a given set of possible distractors, we can effectively choose the top four diverse keywords/phrases to be given as distractors for MCQ using the Maximum Marginal Relevance (MMR) algorithm.

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: