Fine-tuning BERT for semantic sentence pairs classification

Original Source Here

Fine-tuning BERT for semantic sentence pairs classification

Fine-tuning BERT using TensorFlow and tensorflow_hub on sentence pairs classification identifying whether 2 sentences are semantically equivalent or not

Image is taken from ASHAWire

Bidirectional Encoder Representations from Transformers ( BERT )

BERT is the MVP of the latest state-of-the-art NLP models. The real fame of BERT is its universal nature as it can be applied to most NLP real-world use cases like text summarization, classification/sentiment detection, language inference, Q/A and much more.

Let’s first theoretically see how BERT can actually help us in today’s problem statement and under the hood working of it on performing such a task efficiently. BERT is trained using 2 different techniques Masked Language Modeling and Next Sentence Prediction. These 2 training phases allows BERT to be able to generalize on a variety of NLP tasks.

Masked language Modeling ( MLM ) randomly mask/hide some of the tokens and then let the model predict those words. In this case, the final hidden vectors corresponding to the mask tokens are fed into an output Softmax over the vocabulary, as in a standard LM to get the original words. This method of training gives us a deep bi-directional model that is strictly more powerful than either a left-to-right model or the shallow concatenation of a left-to-right and a right-to-left model.

Next Sentence Prediction( NSP ) is an important technique to discuss here based on our problem statement. Many important downstream tasks such as Question Answering (QA) and Natural Language Inference (NLI) are based on understanding the relationship between two sentences, which is not directly captured by language modelling. That’s why BERT is trained on predicting the next sentence B given 2 sentences at a time A and B . BERT is able to identify the relationship of 2 sequences based on the semantic structure of both sequences including how often 2 sequences occur together and the similar information they contain. More details on BERT working can be found in this paper.

Project Overview :

Our today’s task is to identify whether 2 sequences are semantically equivalent or not based on the labels provided as training. In the code below we’ll see how to transform both sequences into a format that BERT accepts by creating efficient custom input pipelines and preprocessing for BERT. Finally, we will take a BERT model from tf_hub and fine-tune it on our data based on original paper guides for fine-tuning.

Table of contents:

[Step 1]: Setting up Tensorflow and Colab Runtime

[Step 2]: Installing TensorFlow Model garden package and other required dependencies

[Step 3] : Importing libraries and defining BERT path URL

[Step 4] : Getting the dataset from Tensorflow Datasets

[Step 5] : Preprocessing Data using separate Tensorflow model

[Step 6]: Wrapping a function to apply preprocessing model on the entire dataset using .map method

[Step 7]: Creating a TensorFlow Input Pipeline with

[Step 8] : Adding a Classification Head to the BERT hub.Keraslayer Execution

[Step 9] : Fine-Tuning BERT for sentence pairs Classification

[Step 10] : Evaluating Model performance visually

[Step 11] : Exporting Model for Inference

[Step 12] : Testing our BERT Model

Setting up Tensorflow and Colab Runtime

In this project, we will use Tesla T4 GPU for our BERT model . I hope you at least know how to set up a GPU . Type below commands in order to verify this requirement . However, it’s not much necessary to have this GPU but you can try restarting your runtime multiple times in order to get one like this .

import tensorflow as tfprint("GPU is", "available" if tf.config.experimental.list_physical_devices("GPU") else "NOT AVAILABLE")!nvidia-smi
The output of above code [ Image by Author ]

Installing TensorFlow Model garden package and other required dependencies

This process is simple just install the required libraries for BERT preprocessing and model-related tasks . Try to execute the below commands one by one in each cell of COLAB . After installing the below libraries you have to restart your kernel to avoid some unexpected errors .

# Install all libraries separately# 1.
pip install -q -U tensorflow-text
# 2.
!pip install -q tf-models-official
# 3.
pip install -U tfds-nightly

Importing libraries and defining BERT path URL

Next, we’ll import the modules from libraries we’ve just installed and also we’ll provide a link for the BERT model to be used and its respective tokenizer . Every model has its own data requirements and even format So, we need to use the respective tokenizer/preprocessor for our BERT model.

import os
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tfimport tensorflow_hub as hub
import tensorflow_datasets as tfds
from official.modeling import tf_utils
from official import nlp
from official.nlp import bert
# Load the required submodulesimport official.nlp.optimization
from official.nlp import optimization # to create AdamW optimizer
import official.nlp.bert.bert_models
import official.nlp.bert.configs
import official.nlp.bert.run_classifier
import official.nlp.bert.tokenization
import official.nlp.modeling.losses
import official.nlp.modeling.models
import official.nlp.modeling.networks
tf.get_logger().setLevel('ERROR')# Model URL
hub_bert_url = “"
# Preprocessor URL for BERT
hub_handle_preprocess = “"

bert_preprocess = hub.load(hub_handle_preprocess)
tok = bert_preprocess.tokenize(tf.constant([‘Hello TensorFlow!’]))
print(tok) # Prints token ids of above input

Getting the dataset from Tensorflow Datasets

The Microsoft Research Paraphrase Corpus (Dolan & Brockett, 2005) is a corpus of sentence pairs automatically extracted from online news sources, with human annotations for whether the sentences in the pair are semantically equivalent. We will be downloading it using TensorFlow .

– Number of labels: 2.

– Size of training dataset: 3668.

– Size of evaluation dataset: 408.

– Maximum sequence length of training and evaluation dataset: 128.

# Below command will download data into 'glue' variable as a dictionary
glue, info = tfds.load('glue/mrpc', with_info=True,
# It's small, load the whole dataset
Data description [ Image by Author ]

We’ll be using sentence1 and sentence2 as input features to predict label indicating whether a pair is semantically equivalent using binary labels [ 0, 1 ].

Preprocessing Data using separate Tensorflow model

Each preprocessing model also provides a method, .bert_pack_inputs(tensors , seq_lenght), which takes a list of tokens (like tokabove) and a sequence length argument. This packs the inputs to create a dictionary of tensors in the format expected by the BERT model. We will be applying a Tensorflow function that takes raw inputs and returns tensor objects that are ready to be fed to the BERT model.

def make_bert_preprocess_model(sentence_features, seq_length=128):"""
Returns Model mapping string features to BERT inputs.
Args:sentence_features: a list with the names of string-valued features.seq_length: an integer that defines the sequence length of BERT inputs.Returns:A Keras Model that can be called on a list or dict of string Tensors(with the order or names, resp., given by sentence_features) andreturns a dict of tensors for input to BERT.""" input_segments = [ tf.keras.layers.Input(shape=(), dtype=tf.string, name=ft) for ft in sentence_features] # Tokenize the text to word pieces. bert_preprocess = hub.load(hub_handle_preprocess) tokenizer = hub.KerasLayer(bert_preprocess.tokenize, name='tokenizer') segments = [tokenizer(s) for s in input_segments] truncated_segments = segments # Pack inputs. The details (start/end token ids, dict of output tensors) # are model-dependent, so this gets loaded from the SavedModel. packer = hub.KerasLayer(bert_preprocess.bert_pack_inputs, arguments=dict(seq_length=seq_length), name='packer') model_inputs = packer(truncated_segments) return tf.keras.Model(input_segments, model_inputs)# Let's plot the model
test_preprocess_model = make_bert_preprocess_model(['my_input1', 'my_input2'])
The plot of preprocessor model [ Image by Author ]

Wrapping a function to apply the preprocessing model on the entire dataset using .map method

To apply the preprocessing in all the inputs from the dataset, we will use the .map function from the dataset.

Here we will pass each inidividual Dataset like ['train','test','validation'] to function below
and this function will then applies the above TF preprocessing model on the provided inputs and
returns the Data ready for BERT input
def load_dataset_from_tfds(in_memory_ds, info, split, batch_size,bert_preprocess_model): is_training = split.startswith('train') dataset =[split]) num_examples = info.splits[split].num_examples if is_training: dataset = dataset.shuffle(num_examples) dataset = dataset.repeat() dataset = dataset.batch(batch_size) dataset = ex: (bert_preprocess_model(ex), ex['label']), num_parallel_calls = AUTOTUNE) dataset = dataset.cache().prefetch(buffer_size=AUTOTUNE) return dataset, num_examples

Creating a Tensorflow Input Pipeline with

Next, we’ll apply the wrapper function to the entire data that in-turns gives us input trainand test datasets ready for BERT . We need to initialize the input layers to only accept the text data values by providing each layer names of those fields otherwise we’ll get errors as we are passing the whole glue data including idx and label fields that are not acceptable by input layers . See the code snippet below to understand .

epochs = 4batch_size = 32init_lr = 2e-5print(f'Creating Data for BERT')# Initializing Preprocessing model to only accept text fields as inputssentence_features = ['sentence1', 'sentence2']bert_preprocess_model = make_bert_preprocess_model(sentence_features)# Running everything on CPUwith tf.device('/cpu:0'):  # Train_data  train_data, train_data_size = load_dataset_from_tfds( glue, info,  'train', batch_size, bert_preprocess_model )  steps_per_epoch = train_data_size // batch_size  num_train_steps = steps_per_epoch * epochs  num_warmup_steps = num_train_steps // 10  # Valid data  valid_data, no_of_valid_examples = load_dataset_from_tfds( glue,   info, 'validation', batch_size, bert_preprocess_model )print(train_data_size)print(no_of_valid_examples)

The resulting return (features, labels) pairs, as expected by . let’s see the input tensors

Input data final tensors [ Image by Author ]

Adding a Classification Head to the BERT hub.Keraslayer Execution

Full model base architecture [ Image by Author ]

Next, we’ll add a classification head to BERT . Since we are dealing with a classification problem we only want the pooled_output results from BERT that contains the embedding for the complete sequence . The pooled output results have tensor of shape (Bach_size, hidden_dim) . After that, we’ll pass the pooled output to a dropout layer ( for regularization and reduce overfitting ) and then the final dense layer .

max_seq_length = 128# Building the modeldef create_model():  input_word_ids = tf.keras.layers.Input(shape=(max_seq_length,),  dtype=tf.int32,  name = "input_word_ids")  input_mask = tf.keras.layers.Input(shape=(max_seq_length,),  dtype=tf.int32,  name = "input_mask")  input_type_ids=tf.keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32,  name = "input_type_ids")  encoder_inputs = dict(input_word_ids = input_word_ids ,input_mask  = input_mask,input_type_ids = input_type_ids  )  bert_layer =  hub.KerasLayer(" 768_A-12/4" , trainable = True , name = 'encoder')  outputs  = bert_layer(encoder_inputs)  pooled_output = outputs["pooled_output"]  drop = tf.keras.layers.Dropout(0.4)(pooled_output)  output = tf.keras.layers.Dense(2, activation=None, name="output") (drop)  model = tf.keras.Model(  inputs={  'input_word_ids': input_word_ids,  'input_mask': input_mask,  'input_type_ids': input_type_ids  },  outputs=output)  return model


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: