How to Train Bert For Q&A in Any Language

https://miro.medium.com/max/1200/0*MZZZZlMVGgzoOZ4p

Original Source Here

Q&A Head

We have a fully trained core Bert model, we can take that core and add several heads that allow us to perform different tasks with the model. However, these heads will initially be untrained — and so we must train them!

If you prefer video, we cover everything here too:

For Q&A the most popular dataset is the Stanford Question and Answering Dataset (SQuAD). Alongside the original English version, there are several other languages now available.

Language options (on HF datasets at the time of writing)Spanish: squad_es
Portuguese: squad_v1_pt
Italian: squad_it
Korean: [squad_kor_v1, squad_kor_v2]
Thai: [iapp_wiki_qa_squad, thaiqa_squad]
English: [squad, squad_v2, squadshifts, squad_adversarial]

To download the English SQuAD data, we use:

Language options (at the time of writing)Spanish: squad_es
Portuguese: squad_v1_pt
Italian: squad_it
Korean: [squad_kor_v1, squad_kor_v2]
Thai: [iapp_wiki_qa_squad, thaiqa_squad]
English: [squad, squad_v2, squadshifts, squad_adversarial]

For each sample, our data can be broken into three components:

  • Question — a string containing the question that we will ask Bert.
  • Context — a larger sequence (paragraphs) that contain the answer to our question.
  • Answer — a slice of the context that answers our question.

Given a question and context, our Q&A model must read both and return the token positions of the predicted answer within the context.

Example of our question, context, and answer input and the (hoped for) correct prediction from the model. Note that these span values are not exact and are assuming one word == one token.

Formatting Answers

Before we begin tokenization, training, etc — we need to reformat our answers feature into the correct format for training. Presently it looks like:

{'text': ['the answer is here'], 'answer_start': [71]}

The 71 represents the character position in our context string where the answer begins. We can get the answer range by simply adding the length of text to answer_start — and we do this first:

The data format is now ready for tokenization.

Tokenization

We need to tokenize the SQuAD data so that it is readable by our Bert model. For the context and question features we can do using the standard tokenizer() function:

Which encodes both our context and question strings into single arrays of tokens. This will act as the input to our Q&A training, but we have no targets yet.

Our targets are the start and end positions of the answer, which we previously built using the character start and end positions within the context strings. However, we will be feeding tokens into Bert, so we need to provide the token start and end positions.

To do this, we need to convert the character start and end positions into token start and end positions — easily done with our add_token_positions function:

This function adds two more tensors to our Encoding object (which we feed into Bert) — the start_positions and end_positions.

Our tensors are now ready for training the Bert Q&A head.

Training

We will be training using PyTorch, which means we will need to convert the tensors we’ve built into a PyTorch Dataset object.

We will feed our Dataset to our Q&A training loop using a Dataloader object, which we initialize with:

And finally, we setup our model parameters and begin the training loop.

After training our model all we need to do is save it!

And we’re done, we’ve trained a Q&A model from scratch.

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: