Data Analysis For Entity Extraction

Original Source Here

Data Analysis For Entity Extraction

Data analysis is one of the primary steps before machine learning models can be trained. This is because analysis of the data helps us in finding hidden patterns which can then be used to train machine learning models efficiently. We as humans need to first understand and be comfortable with the data even before the machine begins to learn on it, else it may lead to GIGO (Garbage-In, Garbage-Out).

Below are some considerations to make before training an entity extraction model to extract business entities.

Domain knowledge

Understanding business context in which different entities are used in a document is important. Taking an example of a person’s name, it can be used in different contexts say in a medical document as a patient or in a mortgage document as a lender or a borrower etc. Meaning of an entity varies across documents. Understanding these details helps to better understand the context in which the entities need to be extracted from a document so that it makes sense to the business.

Sequence size

Pre-trained models like BERT can’t be trained with sequence size of more than 512 words. This is a problem because an entity might be present at any position inside a sentence. For example, an entity might be present at start, middle or end of a sentence. If a sentence is a huge paragraph with say 1000 words and an entity is present at the end of this sentence, it becomes difficult to train this entity at 512 sequence length because all information after 512 word length gets lost due to the sentence being truncated at 512 word length.

Wordpiece tokenizer makes the matter worse. For example, a word such as “Johnson” might get split as “John ##son” before training begins. This increases the sequence size further due to addition of extra tokens.

Also, large sequences take more time to train than the smaller ones. Hence, it is very important to understand the sequence length requirements of an entity during data analysis. This helps to train the models more efficiently and in less time. For entities that are present beyond 512 word length, a hybrid approach can be implemented to first shorten the sequence using a pre-defined rule so that the entity is visible with 512 sequence length.

Number of trainable pages

Sometimes in a document say with 100 pages, certain entities might be present at the start of the document like in page number 1 or 2 across the training corpus. In such a scenario the model is not required to see all the 100 pages to learn an entity and can be made to train on only the first n pages in which the entity appears. This helps in reducing the training as well as inference time because the model now needs to train/infer only on certain n pages.

Retaining or removing certain punctuations

Overall it is a good idea to remove punctuations before training an entity extraction model. Punctuations add randomness to the data and results in a low F1 score. However, you might want to retain certain punctuations which could distinguish between different sentence formats.

For example, lets take an example of an entity that appears in two different sentence formats.

Sentence format 1 → The client is John Smith (“Borrower”), State Bank is (“Lender”).

Sentence format 2 → Borrower : John Smith

If punctuations are removed from the above sentences, that would result in

Sentence format 1 → The client is John Smith Borrower State Bank is Lender

Sentence format 2 → Borrower John Smith

The result is that the model now incorrectly identifies State Bank as the borrower because the pattern that it has learnt in sentence format 2.

Sentence format 1 → The client is John Smith Borrower State Bank is Lender

Sentence format 2 → Borrower John Smith


As in any other machine learning use-case, data analysis is one of the most important steps before training a model for entity extraction. Taking above factors into consideration will help you to train the models efficiently. It also helps to be aware of the different patterns in your data and what to expect from model inference.


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: