Original Source Here
Eventually, the architectures are evaluated on a set of supervised (SemEval 2010 Task 8, KPB-37 and TACRED) and few-shot learning relation extraction tasks (FewRel). For the supervised task, a classification layer is added that uses a softmax loss to predict the relation classes of the input. For the few-shot tasks, the authors use the dot product between relation representation of the query statement and each of the candidate statements as a similarity score. Task-specific finetuning is performed with the following hyperparameters:
- Transformer Architecture: 24 layers, 1024 hidden size, 16 heads
- Weight Initialization: BERT(large)x
- Post Transformer Layer: Dense with linear activation (KBP-37 and TACRED), or Layer Normalization layer (SemEval 2010 and FewRel)
- Training Epochs: 1 to 10
- Learning Rate: 3e-5 with Adam (supervised) or 1e-4 with SGD (few-shot)
- Batch Size 64 (supervised) or 256 (few-shot)
For all four tasks, the model using the Entity Markers input representation and Entity Start output representation achieves the best scores. Intuitively this is because Transformer based models like BERT learn the impact of a token to other tokens using the Attention Mechanism. This mechanism allows BERT to learn the impact of a previous token like E1ₛ on E2ₑ and vice versa which is in this case the relation. That might be an explanation why this approach works best for this specific task.
Fair enough… But why Blanks? Let’s look at the status quo: We’re able to train a relation extractor on pre-labeled data. However, finding a labeled dataset that has the necessary size is hard, so we need to find a way to train the model on an unlabeled dataset. To get to that point one has to make a few assumptions:
- For any pair of relation statement r and t, the inner product fθ(r) fθ(t ) should be high if r and t express semantically similar relation statements and low otherwise.
- Observing a high redundancy in web text each relation between arbitrary pairs of entities is likely to be stated multiple times. Subsequently
r=(x, s₁, s₂) is more likely to encode the same semantic relation as
t = (y,u₁, u) if s₁ refers to the same entity as u₁ and s₂ refers to the same entity as u₂.
Given those assumptions, the authors aim to learn a statement encoder fθ that can be used to determine whether or not two relation statements encode the same relation. To do that they define the following binary classifier
to assign a probability to the case that r and r̂ encode the same relation (l=1) or not.
As a neural network would be able to minimize that loss by just learning an entity linking model and not the actual relation between those entities we have to actively prevent it from doing so by introducing a special [BLANK] symbol and use this to mask the first entities s₁, u₁ by a given probability α and the second entities s₂, u₂ by a probability α. As a result, only α² of the relation statements in the trainset contain both of the entities by name which requires the model to do more than simply identifying named entities in r. During training, we set α to 0.7.
The training setup is similar to training BERT. However, we’ll introduce a second loss additionally to the masked language model loss: The matching the blanks loss. This loss is simply the binary cross-entropy that is used to optimize the logits of fθ(r) and fθ(t) so that the inner product will be close to 1 or 0 depending on the input data.
For generating the training corpus, the authors of the paper use English Wikipedia and extract text passages from the HTML paragraph blocks, ignoring lists, and tables. They use Google’s off-the-shelf entity linking system to annotate text spans. The span annotations include not only named entities but other referential entities such as common nouns and pronouns. To prevent a large bias towards relation statements that involve popular entities, they limit the number of relation statements that contain the same entity by randomly sampling a constant number of relation statements that contain any given entity. Since using their entity linking system for a large amount of data is too expensive we use our custom entity annotation system. Alternatively, one can use the one from SpaCy. To extract the nouns and pronouns one can simply apply dependency tree parsing using SpaCy.
In practice, it is not possible to compare every pair of relation statements, to calculate the binary cross-entropy for every combination, so we’ll have to use noise-contrastive estimation to bootstrap our loss estimation. Instead of summing over all pairs of relation statements that do not contain the same pair of entities, we sample a set of negatives that are either uniformly sampled from the set of all relation statement pairs or are sampled from the set of relation statements that share just a single entity. One has to include the second set of hard negatives to account for the fact that most r randomly sampled relation statement pairs are very unlikely to be remotely related in terms of their topics. This approach enables the network to learn from pairs of relation statements that refer to similar — but different — relations.
For the supervised matching tasks, the BERT based model pretrained on the MTB task as well as the BERT based model not utilizing MTB but “only” the alternative way passing the entities to the model using entity markers outperform state of the art methods by 5% to 12% according to the paper.
The state-of-the-art methods to which the authors refer are Wang et al. (2016), Zhang and Wang (2015), Bilan and Roth (2018), and Han et al. (2018) depending on the Task.
Also, the authors state that when given access to all of the training data, the BERT-based model without MTB pretraining approaches the performance of the BERT model using MTB. However, keeping all relation types during training, and vary the number of types per example, the model pretrained on the MTB task only needs 6% of the training data to match the performance of the BERT model trained on all of the training data.
Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot