Named Entity Recognition with BERT in PyTorch

Original Source Here

Named Entity Recognition with BERT in PyTorch

How to leverage a pre-trained BERT model for custom data to predict the entity of each word in a text

Photo by Aaron Burden on Unsplash

When it comes to dealing with NLP problems, BERT oftentimes comes up as a machine learning model that we can count on in terms of its performance. The fact that it’s been pre-trained on more than 2,500M words and its bidirectional nature to learn information from a sequence of words makes it a powerful model to use.

I wrote about how we can leverage BERT for text classification before, and in this article, we’re going to focus more on how to use BERT for named entity recognition (NER) tasks.

What is NER?

NER is a task in NLP to identify and extract meaningful information (or we can call it entities) in a sentence or text. An entity can be a single word or even a group of words that refer to the same category.

As an example, let’s say we the following sentence and we want to extract information about a person’s name from this sentence.

The first step of a NER task is to detect an entity. This can be a word or a group of words that refer to the same category. As an example:

  • Bond’ ➡️ an entity that consists of a single word
  • James Bond’ ➡️ an entity that consists of two words, but they are referring to the same category.

To make sure that our BERT model knows that an entity can be a single word or a group of words, then we need to provide information about the beginning and the ending of an entity on our training data via the so-called Inside-Outside-Beginning (IOB) tagging. We will see more about this on our dataset later in this article.

After detecting an entity, the next step in a NER task is to categorize the detected entity. The categories of an entity can be anything depending on our use case. Below is an example of categories of entities:

  • Person: Bond, James Bond, Sam, Anna, Frank, Leonardo DiCaprio
  • Location: New York, Vienna, Munich, London
  • Organization: Google, Apple, Stanford University, Deutsche Bank
  • Location: Central Park, Brandenburger Tor, Times Square

These entities are basically the label of our data during the training process of our BERT model, which we will look at in detail later in the following section.


As previously mentioned, BERT is a transformers-based machine learning model that will come in pretty handy if we want to solve NLP-related tasks.

If you’re not yet familiar with BERT, I recommend you to read my previous article about text classification with BERT before reading this article. There you’ll find information about what BERT actually is, what kind of input data the model expects, and the output that you’ll get from the model.

What differentiates between BERT for text classification and the NER problem is how we set the output of the model. For a text classification problem, we only use the embedding vector output from the special [CLS] token, as you can see in the visualization below:

Image by author

Meanwhile, if we want to use BERT for NER tasks, we need to use the embedding vector output from all of the tokens, as you can see in the visualization below:

Image by author

By using the embedding vector output from all of the tokens, then we can classify texts at the token level. This is exactly what we want since we want our BERT model to predict the entity of each token. Now without further ado, let’s go to the implementation.

About the Dataset

The dataset that we’re going to use in this article is the CoNLL-2003 dataset, which is a dataset specifically used for NER task. You can download the data on Kaggle via the link below.

This dataset is distributed under Open Database v1.0 license, so we are free to share and use this dataset for our own purpose. Now let’s take a look at what the dataset looks like.

As we can see above, we have a dataframe which consists of the text and the label. The label corresponds to entity category of each word in a text.

In total, there are 9 entity categories, which are:

  • geo for geographical entity
  • org for organization entity
  • per for person entity
  • gpe for geopolitical entity
  • tim for time indicator entity
  • art for artifact entity
  • eve for event entity
  • nat for natural phenomenon entity
  • O is assigned if a word doesn’t belong to any entity.

Let’s take a look at the unique labels available on our dataset:

As you might notice, each entity category is preceeded with the letter I or B . This corresponds to what previously mentioned as IOB tagging. I means Intermediate and B means Beginning. Let’s take a look at the following sentence to understand the concept of IOB tagging a little bit more.

  • ‘Kevin’ has B-pers label since it’s the beginning of a person entity
  • ‘Durant’ has I-pers label because it’s the continuation of a person entity
  • ‘Brooklyn’ has B-org since it’s the beginning of an organization entity
  • ‘Nets’ has I-org label since it’s the continuation of an organization entity
  • Other words are assigned O label as they don’t belong to any entity

Data Preprocessing

Before we are able to use a BERT model to classify the entity of a token, of course, we need to do data preprocessing first, which includes two parts: tokenization and adjusting the label to match the tokenization. Let’s start with tokenization first.


Tokenization can be easily implemented with BERT, as we can use BertTokenizerFast class from a pretrained BERT base model with HuggingFace.

To give you an example how BERT tokenizer works, let’s take a look at one of the texts from our dataset:

Tokenizing the text above with BertTokenizerFast is very straightforward:

We provide several arguments when calling tokenizer method from BertTokenizerFast class above:

  • padding : to pad the sequence with a special [PAD] token to the maximum length that we specify. The maximum length of a sequence for a BERT model is 512.
  • max_length : maximum length of a sequence.
  • truncation : this is a Boolean value. If we set the value to True, then tokens that exceed the maximum length will not be used.
  • return_tensors : the tensor type that is returned, depending on machine learning frameworks that we use. Since we’re using PyTorch, then we use pt .

And below is the output of the tokenization process:

As you can see, the output that we get from the tokenization process is a dictionary, which contains three variables:

  • input_ids: The id representation of the tokens in a sequence. In BERT, the id 101 is reserved for the special [CLS] token, the id 102 is reserved for the special [SEP] token, and the id 0 is reserved for [PAD] token.
  • token_type_ids: To identify the sequence in which a token belongs to. Since we only have one sequence per text, then all the values of token_type_idswill be 0.
  • attention_mask : To identify whether a token is a real token or padding. The value would be 1 if it’s a real token, and 0 if it’s a [PAD] token.

From the input_ids above, we can decode the ids back into the original sequence with decode method as follows:

We got our original sequence back after implementing decode method with the addition of special tokens from BERT such as [CLS] token at the beginning of the sequence, [SEP] token at the end of the sequence, and a bunch of [PAD] tokens to fulfill the required maximum length of 512.

After this tokenization process, we need to proceed to the next step, which is adjusting the label of each token.

Adjusting Label After Tokenization

This is a very important step that we need to do after the tokenization process. This is because the length of the sequence is no longer matching the length of the original label after the tokenization process.

The BERT tokenizer uses the so-called word-piece tokenizer under the hood, which is a sub-word tokenizer. This means that BERT tokenizer will likely to split one word into one or more meaningful sub-words.

As an example, let’s say we have the following sequence:

The sequence above has in total 13 tokens and thus, it also has 13 labels. However, after BERT tokenization, we get the following result:

There are two problems that we need to address after tokenization process:

  • The addition of special tokens from BERT such as [CLS], [SEP], and [PAD]
  • The fact that some tokens are splitted into sub-words.

As sub-word tokenization, word-piece tokenization splits uncommon words into their sub-words, such as ‘Geir’ and ‘Haarde’ in the example above. This sub-word tokenization helps the BERT model to learn the semantic meaning of related words.

The consequence of this word piece tokenization and the addition of special tokens from BERT is that the sequence length after tokenization is no longer matching the length of the initial label.

From the example above, now there are in total 512 tokens in the sequence after tokenization, while the length of the label is still the same as before. Also, the first token in a sequence is no longer the word ‘Prime’, but the newly added [CLS] token, so we need to shift our label as well.

To solve this problem, we need to adjust the label such that it has the same length as the sequence after tokenization. To do this, we can utilize the word_ids method from the tokenization result as follows:

As you can see from the code snippet above, each splitted token shares the same word_ids , where special tokens from BERT such as [CLS], [SEP], and [PAD] all do not have specificword_ids.

These word_ids will be very useful to adjust the length of the label by applying either of these two methods:

  1. We only provide a label to the first sub-word of each splitted token. The continuation of the sub-word then will simply have ‘-100’ as a label. All tokens that don’t have word_ids will also be labeled with ‘-100’.
  2. We provide the same label among all of the sub-words that belong to the same token. All tokens that don’t have word_ids will be labeled with ‘-100’.

The function in the code snippet below will do exactly the step defined above.

If you want to apply the first method, set label_all_tokens to False. If you want to apply the second method, set label_all_tokens to True, as you can see in the following code snippet:

In the rest of this article, we’re going to implement the first method, in which we will only provide a label to the first sub-word in each token and set label_all_tokens to False.

Dataset Class

Before we train our BERT model for NER task, we need to create a dataset class to generate and fetch data in a batch.

In the code snippet above, we call BertTokenizerFast class with tokenizer variable in the __init__ function to tokenize our input texts, and align_label function to adjust our label after tokenization process.

Next, let’s split our data randomly into training, vaidation, and test. However, mind you that the total number of data is 47959. Hence, for demonstration purpose and to speed up the training process, I’m going to take only 1000 of them. You can, of course, take all of the data for model training.

Model Building

In this article, we’re going to use a pretrained BERT base model from HuggingFace. Since we’re going to classify text in the token level, then we need to use BertForTokenClassification class.

BertForTokenClassification class is a model that wraps BERT model and adds linear layers on top of BERT model that will act as token-level classifiers.

In the code snippet above, first, we instantiate the model and set the output of each token classifier equal to the number of unique entities on our dataset, which in our case is 17.

Next, we will define a function for the training loop.

Training Loop

The training loop for our BERT model is the standard PyTorch training loop with a few additions, as you can see below:

In the training loop above, I only train the model for 5 epochs and then use SGD as the optimizer. The loss computation in each batch is already taken care of by BertForTokenClassification class.

In each epoch of the training loop, there is also an important step that we need to do. After model prediction, we need to ignore all of the tokens that have ‘-100’ as the label, as you can see in lines 36, 37, 62, and 63.

Below is the example of the training output after we train our BERT model for 5 epochs:

Of course, the output that you’ll see may vary when you train your own BERT model as there is stochasticity in the training process.

There are a lot of things that you can do to improve the performance of our model. If you notice, we have a data imbalance problem as there are a lot of tokens with ‘O’ label. We can improve our model, for example, by applying class weights during the training process.

Also, you can try different optimizers such as the Adam optimizer with weight decay regularization.

Evaluate Model on Test Data

Now that we have trained our model, we can evaluate its performance on unseen test data with the following code snippet.

In my case, the trained model achieved an average of 92,22% accuracy on the test set. You can of course, change the metrics to F1 score, precision, or recall.

Alternatively, we can use the trained model to predict the entity of each word of a text or a sentence with the following code:

If everything works perfectly, then our model will be able to perform reasonably well to predict the entity of each word of an unseen sentence as you can see above.


In this article, we have implemented BERT for Named Entity Recognition (NER) task. This means that we have trained BERT model to predict the IOB tagging of a custom text or a custom sentence in a token level.

I hope that this article helps you to get started with BERT for NER task. You can find all of the code implemented in this article in this notebook.


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: