Training your own transformer-based NER tagger

Original Source Here

Training your own transformer-based NER tagger

In this post, we will focus on Named Entity Recognition (or simply NER) task. This is one of the most widely solved problems in the natural language processing field which provides multiple fascinating applications by finding entities of your interest in the documents which can be later used to detect extract interesting relations. We will consider developing NER tagger for processing financial documents which are characterized by very formal language, specific and domain-focused dictionary, and multiple entities which are of interest for any financial analyst who is performing any investigations to make decisions for auditing, investing, financial reporting, extract suspicious events or perform network analysis to link entities between multiple documents.

In this tutorial, we will apply state-of-the-art deep learning techniques like bidirectional long-short term memory (Bi-LSTM) neural networks, combined with conditional random fields (CRF) which explicitly model dependencies between the labels as a transition matrix which generally tends to improve performance at the final step. Similar techniques are employed in multiple related papers to solve various related tasks like POS tagging, general NER, biomedical NER detection, or electronic health records processing. Moreover, it is very easy to apply state-of-the-art pre-trained language models, such as BERT, ELMo, RoBERTa, XLNet, which are implemented in a well-known transformers package, and to significantly improve the performance of the final tagger. While custom implementation might be rather tedious task, this is vastly simplified using Zalando’s Flair framework which provides such implementation right out of the box. Moreover, it includes several additional features such as integration with transformers package, ability to use embedding models as input (their inputs are basically concatenated), simulated annealing-based learning rate change schedulers. Hence, one can easily test multiple models without unnecessary complexity which comes in performing suitable tokenization and encoding for particular language models, data loading and batching, output postprocessing or evaluation.

The setup

Ontonotes corpus was selected for benchmarking which widely used to train general NER taggers. It contains a considerable amount of formal or business-related tagged text, such as news, conversational telephone speech, weblogs, Usenet newsgroups, broadcasts, talk shows. We excluded texts which are not relevant, such as religious texts. Additionally, we used a text corpus of tagged SEC documents which can be found in this repository. Unfortunately, it is impossible to merge these datasets or use them to perform fine-tuning as they are not compatible and contain a different number of entity types. Ontonotes corpus contains 20 types of different entities, while SEC corpus contains only the most basic types such as Person, Location, Organization, or Miscellaneous.

The following models were considered for the experiment:

  • Google’s BERT (Bidirectional Encoder Representations from Transformers), as the original transformer model which sparkled the whole transformers revolution
  • Facebook’s RoBERTa, as a more robust and optimized version of the original BERT model
  • ELMo model with deep contextualized word representations, created by Allen Institute
  • Flair contextual string embeddings created by Zalando. For better performance, we used both forward and backward embeddings
  • XLNet model which applies autoregressive pretraining and overcomes BERT in a multitude of tasks
  • XLM-Roberta, a large multilingual model based on Facebook’s RoBERTa

Moreover, we checked for options for models pretrained using financial document corpus and used FinBERT trained by Hong Kong University of Science and Technology particularly to address financial communication analysis. FinBERT is trained using over 5 billion tokens, including corporate reports 10-K & 10-Q (2.5B tokens). earnings call transcripts (1.3B tokens) and analyst reports (1.1B tokens). While it is originally trained for financial sentiment analysis tasks, it can be easily used as a general language model. It perfectly addresses our needs and goals, hence, it is used additionally together with previously described models.

For our experiments, we used HPZ Data Science Workstation with two Nvidia RTX8000 GPUs (with 48 GB VRAM per each) and 370 GB amount of RAM. Given the availability of such resources, we considered using the largest versions of the language models discussed above. The following parameters were used:

  • the learning rate was set to 0.1;
  • 100 epochs for training;
  • LSTM hidden layer size was set to 256;
  • early stopping (patience) parameter equal to 3. Initial runs were performed with patience parameter set to 10, however, it did not prove to be beneficial.

We used two different batch sizes, particularly for performance testing:

  • Batch size = 32 which the more conservative setting;
  • Batch size = 512 to test performance if larger datasets should be applicable for training.

Results for the Ontonotes text corpus

Initial runs were performed batch size of 32.

The training performance on the validation set is shown in the figure below.

Validation loss on different transformer models (batch size = 32)

Performance in terms of precision and F1-Score are summarized in the table below.

OntoNotes tagging performance

The results obtained using a batch size of 512 are presented below. Unfortunately, ELMo could not be tested as it running it with large batch sizes resulted in a runtime error


and could not be processed; the smallest batch size which made it possible to be run was only 64. Training tagger with both Flair embedding models and batch size of 512 also failed, therefore, it had to be reduced to 256 to perform training successfully.

Validation loss on different transformer models (batch size = 512)

Results per entity type are not shown in this post, but those who are interested are able to find them in the GitHub repository with the relevant code.

Tagging performance and training times in minutes (OntoNotes dataset)

The obtained results clearly indicate that using large batches did not deteriorate final results, but the time required for training reduced more than 3 times.

Server load analysis

We also measured GPU load during the training process using nvidia-smi utility. This tool provides a multitude of GPU use and utilization metrics of our interest. The following measures were used to check load during training;

  • GPU utilization — is defined as “percent of the time over the past sample period during which one or more kernels was executing on the GPU”. It used to measure the level of GPU utilization during traing process
  • Memory utilization — the documentation defines it as “percent of the time over the past sample period during which global (device) memory was being read or written”. In this experiment, it helps to evaluate the efficiency of memory use
  • Used memory percentage — defines the percentage of GPU memory required at a particular time.

We measured server load for both small ar large batches. The figures below descrive GPU utilization for the whole training period when batch size is equal to 32. GPUs were generally utilized at a level of 40–50% which might indicate underutilization which stayed similar for each model.

We also calculated mean resource utilization values over the whole training process. To our surprise, ELMo based model showed to be rather resource-demanding, as it managed to use almost all available GPU memory. XLNet and XLM-RoBERTa models also required a larger amount of memory, yet this less surprising as those models are really large, and fully loading them into GPU requires a significant amount of resources.

It is no surprise that GPU load utilization is much higher with a large batch size. The figure below illustrates this as well. Unfortunately, these results do not include ELMo based model training, which failed to train using a batch size larger than 64. Again, the mean resource utilization chart indicates that there were still resources for even larger batch sizes to be used if required; however, this might be difficult for XLNet and XLM-RoBERTa based model training.

Training tagger for financial document tagging

Finally, we trained NER taggers using tagged SEC data for training. Their performance results are presented below.

The table above shows that FinBERT based taggers showed quite competitive performance compared to other taggers. It is no surprise that it outperformed other taggers in detecting organization names.

SEC dataset tagging performance

Due to the relatively small training dataset, it did not long to train the taggers. The results indicate that training took less than 10 minutes for each model.

Tagging performance and training times in minutes (SEC dataset)

Finally, overall results prove that using a specialized language model proved to be useful, as FinBERT based tagger had a second-best performance. However, a larger corpus would be beneficial to properly validate this fact. Nevertheless, such results indicate that further research in this field is very promising.

Final points

In this post, we explored capabilities to train models for named entity recognition using recent state-of-the-art architectures and pre-trained language models. While selecting optimal configuration can be quite challenging, performance between taggers in this experiment did not differ very significantly (although the application of RoBERTa enabled to achieve best results). Moreover, we tested BERT model pretrained particularly for financial communication, and it turned to be useful when NER is targeted at processing financial documents. Next, one can perform multiple improvements, optimize its architecture (e.g., use Bi-GRU instead of Bi-LSTM), perform fine-tuning, or do some other size and performance optimizations. Further, it would be interesting to apply this model for other tasks which could benefit from NER, such as run matching or linking on various financial documents, extract relations between different entities or subjects.

The code of the whole experiment, together with preprocessed datasets, notebooks and results, is available on GitHub at


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: