Guide to fine-tuning Text Generation models: GPT-2, GPT-Neo and T5*af8-2P2j-xNUDesv

Original Source Here

The code used in this article can be found here — GPT and T5. To read more about text generation models, see this. For more such articles visit my website or have a look at my latest short book on Data science. You can also connect with me on LinkedIn.


Recent researches in NLP led to the release of multiple massive-sized pre-trained text generation models like GPT-{1,2,3}, GPT-{Neo, J} and T5. If the audiences (including you and me) were not impressed with their tunable parameter’s size going into billions, we were enthralled by the ease with which they can be used for a completely new unseen task, and without training for a single epoch! While this is okay for quick experiments, for any real production deployment, it is still recommended to further train the models for the specific task. This is called fine-tuning, and in this article, we will practically learn the ways to fine-tune some of the best (read state-of-the-art) language models currently available. We will also compare their performance by fine-tuning on Twitter Sentiment detection dataset. Let’s get started!

Text generation models

Text generation is an interesting task in NLP, where the intention is to generate text when provided with some prompt as input. Usually, we apply some form of the Sequence-to-Sequence model for this task. They are called language models, as they can be used to predict the next word based on the previous sentences. The recent surge in interest in this field is due to two main reasons, (1) the availability of several high performance pre-trained models, and (2) it’s very easy to transform a large variety of NLP based tasks into the text-in text-out type of problem. This is very intuitively shown by T5 authors, where the same model can be used to do language translation, text regression, summarization, etc.

T5 text-to-text framework examples. Source: Google AI Blog

In this article, we will be concerned about the following models,

  • GPT-2: It is the second iteration of the original series of language models released by OpenAI. In fact, this series of GPT models made the language model famous! GPT stands for “Generative Pre-trained Transformer”, and currently we have 3 versions of the model (v1, v2 and v3). Out of these only GPT-1 and GPT-2 are open-sourced, and hence we will pick the latest version for our experiment. On the technical side, the architecture of GPT-2 is made up of the decoder part of the Transformer architecture.
  • GPT-Neo: This model was released by EleutherAI to counter the GPT-3 model which was not open-sourced. The architecture is quite similar to GPT-3, but training was done on The Pile, an 825 GB sized text dataset.
  • T5: stands for “Text-to-Text Transfer Transformer” and was Google’s answer to the world for open source language models. T5 paper showcase that using the complete encoder-decoder architecture (of the transformer) is better than only using the decoder (as done by the GPT series), hence they stay true to the original transformer architecture.

A brief comparison of the different models is shown below. One point to note, each model further releases several versions based on the tunable parameter size. For this article, we will pick 117M sized GPT-2, 125M sized GPT-Neo and 220M sized T5.

Comparing different text generation models. Source: [A lazy data science guide]

Sentiment detection task and Dataset

To test the performance of different language models, we will compare the accuracy of the models after finetuning on a simple task — sentiment detection. Here, we will use the Twitter Sentiment dataset, that can be download from here. In total, it contains over 1.6M tweets and their sentiment could be either positive or negative. For computation efficiency, we will sample 10k tweets with a nearly equal distribution of the sentiment classes. Then, we will train the model on 95% of the data, using the remaining 5% for the test purposes. For a fair comparison, we will use the same test and train split for all of the three models. Finally, we will perform 3 trials of splitting and training each model — this is a way to replicate a 3-fold validation test. We will report the individual and aggregated (mean) f1 macro score, which can be used for model’s performance comparison.

Twitter Sentiment dataset examples. By Author.

Now the next obvious question should be, how can we transform the sentiment detection task as a text generation one? The answer is quite simple, all we have to do is create an intuitive prompt (template with data) that could reflect how a similar representation could occur on the web. Let’s understand it this way, we want to provide the tweet as the input and want the sentiment as output. So, in our prompt, we should pass a single tweet after Tweet: prefix and expect the model to predict the sentiment on the next line after Sentiment: prefix. This process of creating an effective prompt is called prompt engineering, and it has been shown that by just changing the prompt, language models performs better! For our use case, we can start with a very simple prompt format. We will have two different prompts, one for training and one for the test. Examples are shown below.

Training prompt (as we want the model to learn this “pattern” to solve the “task”)

Tweet: I am not feeling well.
Sentiment: Negative

Test prompt (as now we hope the model has learned the “task” and hence could complete the “pattern”)

Tweet: I am feeling well.

So during the testing, we will extract the word predicted by the model after the prefix Sentiment: and consider that word as the predicted sentiment label. Now let’s dive into the implementation!

Fine-tuning GPT-2 and GPT-Neo

One point to note — GPT-2 and GPT-Neo share nearly the same architecture, so the majority of the fine-tuning code remains the same. Hence for brevity’s sake, I will only share the code for GPT-2, but I will point out changes required to make it work for the GPT-Neo model as well. Ok, let’s get started by handling the dataset, for which we will begin with creating a Pytorch Dataset class, which defines how we prepare the data for the training.

This includes 3 modules:

  • __init__: where we basically tokenize and store the data.
  • __len__ : where we return the length of the total dataset. This is required for step size calculation within each epoch.
  • __getitem__ : where we fetch one data and then return it.

Some addition points — (1) on line 8, we define the mapping used to transform original numeric sentiment label to textual labels, (2) on line 12, we transform the data into the training prompt we decided on, and (3) on line 14 we perform the tokenization (splitting the tweet into tokens + replace them with their unique ids).

Next, let’s connect the data with the Dataset class. The code breakup is as follows,

  • Line 4-8: We begin with loading the dataset. You can download it from here and modify the local path at line 4. Next, we just subset the relevant columns and rename them. On line 8 we sample 10k tweets for this experiment.
  • Line 10–13: We split the data into train and test, with 95% and 5% split, respectively. We use stratifyflag such that the split is even in sentiment class distribution.
  • Line 16: We pass the train data to the SentimentDataset class. Note, we could have done the same for test data, but I just returned the test data in its original form.

Now we will prepare for the training of the model. Code breakdown is as follows,

  • Line 10–13: We load the tokenizer, add some special tokens we will use to denote separate parts of tweets and finally load the model. Note, the model_name is defined on line 5. Also note, we add the special tokens so that the model learns the start and end of the prompt. This will be helpful later on during the testing phase, as we don’t want the model to keep on writing the next word, but it should know when to stop the process. This can be done by setting the eos_token and training the model to predict it after the label, as done here.
  • Line 16: Load and prepare the dataset using the functions we defined before.
  • Line 21–24: We set configurations for the training process. In short, we define where and when to save the model, how long to train and where to save the logs and also the training strategy with batch_size, warmup_steps and weight_decay.
  • Line 27–31: We start the training by connecting the model with the training dataset. We also define how to process the training data inside data_collator. The first two elements within the collator are input_ids — the tokenized prompt and attention_mask — a simple 1/0 vector which denote which part of the tokenized vector is prompt and which part is the padding. The last part is quite interesting, where we pass the input data as the label instead of just the sentiment labels. This is because we are training a language model, hence we want the model to learn the pattern of the prompt and not just sentiment class. In a sense, the model learns to predict the words of the input tweet + sentiment structured in the prompt, and in the process learn the sentiment detection task.

This will begin the training. It could take some time based on your computer’s specifications.

Finally, we define the test block, where we take the trained model and apply it to the held-out test data. The code breakdown is as follows,

  • Line 5: We turn on the evaluation mode on the model.
  • Line 8–15: For each test data, we first prepare the prompt but with one big difference — we don’t include the sentiment label, as this is what we want the model to predict. Also, remember eos_token— we are hoping the model will predict the sentiment label and then break the operation by printing eos_token. Finally, we tokenize the test prompt.
  • Line 17: We take the test prompt and predict the next set of words. There are a lot of parameters in this function that defines how the next word is predicted. For details about what each one of them does refer to this or to better understand the different strategies of next word prediction refer to this.
  • Line 20–30: We start with decoding the predicted text i.e. we re-transform the predicted tokens ids into text. Then we extract the predicted sentiment label and store all relevant information into lists.
  • Line 33–37: We first combine all extracted info into a pandas dataframe for better readability and then use f1_score function from sklearn package to compute the performance of the complete model.

On running the code for GPT-2 and performing this operation three times with different random_state in the dataset split code, we observed that the model is in fact able to predict perfectly as expected. It is able to predict the label and then break its execution using eos_token. The average f1 macro performance score is 81.7%! This is well comparative with what we would expect a dedicated sentiment detection model to perform, and this goes on to highlight how easy it is to do transfer learning using text generating models in NLP.

GPT-Neo compliant code

To make the GPT-2 code work for GPT-Neo, we have to do the following modifications,

  • import GPTNeoForCausalLM
  • set model_name as "EleutherAI/gpt-neo-2.7B" (choose from any of the available sized models)
  • use GPTNeoForCausalLM in place of GPT2LMHeadModelwhen loading the model.

And that’s it! On running the modified code for GPT-Neo, and following the same training strategy, the average f1 macro performance score was 80.7%!

Fine-tuning T5

The architecture of T5 is different from GPT models, as it stays true to the original transformer’s architecture, while the GPT models only keep the decoder part. For training T5 we will use an excellent wrapper package called SimpleT5, which removes most of the boilerplate from the training phase. Now please remember, while the syntax of training will change, the overall flow and intuition remains the same. Let’s start with the data part.

Here, the majority of the code remains the same as what we did before for the GPT models. One major change is that we don’t need the Dataset class, as SimpleT5 works directly on pandas dataframe. Hence we load the data, do some initial pre-processing, split the data and return the pandas dataframe. (no need to tokenize, create Dataset class, isn’t this great!?)

One more point to note is that we do not need to create prompt formats for this package. This way we can separate out the input tweet and sentiment label into different columns, here source_text and target_text, respectively.

Loading and training the model is also super easy and can be done with 3 lines of code (if you can ignore my pretty line separations).

Next, we test the fine-tuned T5 model on the test dataset. As you can see the inference part is also super easy, on line 11, we use the predict function and just pass the source_text to get the predicted sentiment label. We later compare this with the original_label to generate the performance score at line no 18.

On running the T5 code, and following the same training strategy as before, the average f1 macro performance score was 80.7%!


Consolidating all of the results into a single table, we get,

Comparing GPT-2, GPT-Neo and T5 on sentiment detection task.

One point I wanted to discuss is that I haven’t played at all with the hyperparameters. Add to that the prompt engineering methodology, and I think just by playing around with these two, we can further improve the performance metric for all of the models. I will leave that as an execise for the readers (do let me know if you get better scores!)


While GPT-2 may have won this round, the result table does show the prowess of text generation models on whole. All of them performed very well on the sentiment detection task, and all it took was a few epochs of training. Even if this experiment was done for a single task, I hope this helps to show how easy it is to use the TG models for completely new tasks. In a way, if we can transform the NLP problem into that of text generation, rest assured the pre-trained model will not fail, well at least not drastically šŸ™‚ This makes them the perfect baseline if not the state-of-the-art for many tasks.


  • Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, and Sharan Narang. Exploring the limits of transfer learning with a unified text-to-text transformer. 2020. arXiv:1910.10683
  • Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, and others. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  • GPT-Neo
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. 2017. arXiv:1706.03762.


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: