Abstractive Text Summarization with Transformers

https://miro.medium.com/max/1200/0*mehIeV8W32jDZc3i

Original Source Here

Let’s get started on the tutorial.

The first step is to import the various libraries that we will be using.

We import pandas for the data preprocessing and NumPy for linear algebra.

In this tutorial instead of using PyTorch, I used PyTorch Lightning. We import Dataset and DataLoader from PyTorch so we can create the dataset.

We also import ModelCheckpoint so we can save our model for later use.

From transformers, we will import T5ForConditionalGeneration for the model, AdamW for the optimizer, and T5Tokenizer for the tokenizer.

The dataset I will be using is the “NEWS SUMMARY” dataset from Kaggle(https://www.kaggle.com/sunnysai12345/news-summary). Pandas is used to load and preprocess our data.

Once the dataframe is loaded we create a new dataframe only using the relevant data needed. I then created a new dataframe and transposed it so the data is column-wise instead of row-wise.

I also renamed the columns to identify which column has the summary and which one has the source text. Now, drop the nonexistent values and then split the data into a train set and a test set.

Lastly, print out the shape of the dataset making sure that everything looks correct.

Next, create the PyTorch dataset.

The code snippet for the dataset is shown below:

As shown in the code snippet above the SummaryDataset class takes four arguments i.e. the tokenizer, the dataframe, source text’s max token length, and summary’s max token length. Using the arguments we will create a text encoding using the tokenizer.

Now grab the specified row from the dataframe and create the source text’s encoding and then the summary’s encoding. This method eventually returns a dictionary with all the data.

Next, create a PyTorch Lightning Data Module that takes in the test and train dataframe, the source text’s max token length, and the summary’s max token length, the tokenizer, and finally the batch size. In the setup method, the dataset object that we created before is called. Finally, in the dataloader methods, we return the DataLoader of the dataset we created.

The next step is to define our model.

In the code above I define the model name. The model being used is the “t5-base” model from huggingface. Next, I fetch the pre-trained tokenizer using the specified model name. I then define the number of epochs and our batch size.

Then we build our data module using the batch size, tokenizer, and the train and test data. Next, define the SummaryModel class. I fetch the pre-trained T5ForConditionalGeneration model and call it.

After the model is defined we need to define the training step. First, define the base step method. Grab the input ids, text attention mask, labels, and labels attention mask from the batch and pass all of those arguments into the model, then log the loss. From this base method, we can define the train, validation, and test steps. For the optimizer, return the AdamW optimizer with a learning rate of 1e-4.

Now comes the training of the model.

In the code above I define the ModelCheckpoint class so it can periodically save checkpoints for future use of the model. Now define a PyTorch Lightning Trainer and pass in the ModelCheckpoint callback, the epochs, GPUs, and the progress bar refresh rate. We fit the trainer with the model and the data module. And there it is, the model is now training!

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: