Fine-Tuning HuBERT for Emotion Recognition in Custom Audio Data Using Huggingface

Original Source Here

Fine-Tuning HuBERT for Emotion Recognition in Custom Audio Data Using Huggingface

Why Audio Data?

NLP for audio data is not getting enough recognition, compared to NLP for text and computer vision tasks. Time to change that!


Emotion recognition — recognize whether spoken audio exhibits anger, happiness, sadness, disgust, surprise, or neutral emotions.

Note: Once we are through with the tutorial, you should be able to reuse the code for any audio classification task.


For this tutorial, we will use the publicly available Crema-D dataset on Kaggle. (A huge thanks to David Cooper Cheyney for putting together this awesome dataset). So go ahead and click the Download button on this link. You should see the containing the Crema-D audio files starting to download. It contains 7k+ audio files in the .wav format.

Note: Feel free to use any audio data that you collected instead of the CremaD dataset.

In case you’d like to follow along with this tutorial, here’s the GitHub repo.

Hugging Face Library & Trainer API

As mentioned in the title, we will be using the Hugging Face library for training the model. In particular, we will be making use of its Trainer class API.

Why Trainer? Why not write a standard training loop in PyTorch?

Here’s what the standard boilerplate code looks like in Pytorch:

Taken from my introductory tutorial to PyTorch

In contrast, the Trainer simplifies the intricacies involved in writing a training loop so much so that the training can happen in a single line:


In addition to supporting the basic training loop, it allows distributed training on multiple GPUs/TPUs, callbacks such as early stopping, evaluating results on a test set, etc. All this can be achieved by simply setting a few arguments when initializing the Trainer class.

If not for anything, I feel substituting Trainer for vanilla PyTorch has definitely led to a more organized and cleaner-looking codebase.

Let’s begin…


Although optional, I would strongly recommend starting the tutorial by creating and activating a new virtual environment, inside which we can do all our pip install ... .

python -m venv audio_env
source activate audio_env/bin/activate

Loading dataset

As with any data modeling task, we first need to load the dataset (that we will pass to the Trainer class) using the datasets library.

pip install datasets

Given that we are working with a custom dataset (as opposed to a pre-installed dataset that comes with this library), we need to first write a loading script (let’s call it to load the dataset in a format acceptable to the Trainer.

I have already covered how to create this script (in excruciating detail) in a previous article. (I strongly recommend going through it to understand the use of config, cache_dir, data_dir, etc. in the snippet below). Each example in the dataset has two features: file and label.

dataset_config = {
"CONFIG_NAME": "clean",
"DATA_DIR": os.path.join(PROJECT_ROOT, "data/"),
"CACHE_DIR": os.path.join(PROJECT_ROOT, "cache_crema"),
ds = load_dataset(
print(ds)********* OUTPUT ********DatasetDict({
train: Dataset({
features: ['file', 'label'],
num_rows: 7442

P.S: While we created a datasets.Dataset object for the CremaD dataset (to be passed to the Trainer class), it doesn’t necessarily have to be this way. We could also define and use a (similar to the CSVDataset we created in this tutorial).

Writing the model training script

The directory structure in the Github repo:


│ │
│ └───data
│ │

| │

Let’s start writing our script.


Experiment Tracking (optional)

We are using Weight & Biases for experiment tracking, so make sure you have created an account and then update USER and WANDB_PROJECT as per your details.

Loading feature extractor

Question: Broadly speaking, what is a feature extractor?
Answer: A feature extractor is a class in charge of preparing input features for a model. For instance, in case of images this can include cropping an image, padding OR in case of audio this can include converting raw audio to spectogram features, applying normalization, paddings, etc.

Example of feature extractor for image data:
>>> from transformers import ViTFeatureExtractor
>>> vit_extractor = ViTFeatureExtractor()
>>> print(vit_extractor)

ViTFeatureExtractor {
“do_normalize”: true,
“do_resize”: true,
“feature_extractor_type”: “ViTFeatureExtractor”,
“image_mean”: [0.5, 0.5, 0.5],
“image_std”: [0.5, 0.5, 0.5],
“resample”: 2,
“size”: 224

More specifically, we will be using Wav2Vec2FeatureExtractor. This is a derived class from SequenceFeatureExtractor which is a general-purpose feature extraction class for speech recognition made available by Huggingface.

There are three ways to use the Wav2Vec2FeatureExtractor:

  • Option 1 — Use the defaults.
from transformers import Wav2Vec2FeatureExtractorfeature_extractor = Wav2Vec2FeatureExtractor()
**** OUTPUT ****
Wav2Vec2FeatureExtractor {
"do_normalize": true,
"feature_extractor_type": "Wav2Vec2FeatureExtractor",
"feature_size": 1,
"padding_side": "right",
"padding_value": 0.0,
"return_attention_mask": false,
"sampling_rate": 16000
  • Option 2 —Modify any of the Wav2Vec2FeatureExtractor parameters to create your custom feature extractor.
from transformers import Wav2Vec2FeatureExtractorfeature_extractor = Wav2Vec2FeatureExtractor(

**** OUTPUT ****
Wav2Vec2FeatureExtractor {
"do_normalize": true,
"feature_extractor_type": "Wav2Vec2FeatureExtractor",
"feature_size": 1,
"padding_side": "right",
"padding_value": 0.0,
"return_attention_mask": false,
"sampling_rate": 24000,
"truncation": true


Option 3: Because we aren’t looking for any customization, we can just use the from_pretrained() method to load a pretrained model’s default feature extractor parameters (usually stored in a file called preprocessor_config.json). Since we will be using the facebook/hubert-base-ls960 as our base model, we can get its feature extractor parameters (available for visual inspection here under preprocessor_config.json).

from transformers import Wav2Vec2FeatureExtractormodel = "facebook/hubert-base-ls960"
feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained(model)
print(feature_extractor)*** OUTPUT ***Wav2Vec2FeatureExtractor {
"do_normalize": true,
"feature_extractor_type": "Wav2Vec2FeatureExtractor",
"feature_size": 1,
"padding_side": "right",
"padding_value": 0,
"return_attention_mask": false,
"sampling_rate": 16000

To see the feature extractor in action, let’s feed a dummy audio file as raw_speech to Wav2Vec2FeatureExtractor:

model_id = "facebook/hubert-base-ls960"
feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained(model_id)
audio_file = "dummy1.wav"
audio_array = librosa.load(audio_file, sr=16000, mono=False)[0]
input = feature_extractor(
***** OUTPUT ******
{'input_values': tensor([[-0.0003, -0.0003, -0.0003, ..., 0.0006, -0.0003, -0.0003]])}
torch.Size([1, 36409])(36409,)

Few things to note:

  • The output of the feature extractor is a dictionary containing input_values. Its value is nothing but normalization applied to audio_array i.e. the output from the librosa library. In fact, both input.input_values and audio_array will have the same shape.
  • When calling the feature extractor make sure you are using the same sampling_rate as the one used by the base model for its training dataset. We are using this facebook model for training and its model card explicitly states to sample speech input at 16Khz.
  • return_tensors can take values “pt”, “tf”, and “np” for PyTorch tensors, TensorFlow objects, and NumPy arrays respectively.
  • padding doesn’t make much sense in the case of a single audio file but when we are batch processing, it does, as it pads the shorter audios (with extra 0s or -1s) to have the same length as the longest audio. Here’s an example of padding audio files with varying lengths:
audio_file_1 = "dummy1.wav"
audio_file_2 = "dummy2.wav"
audio_array_1 = librosa.load(audio_file_1, sr=16000, mono=False)[0]
audio_array_2 = librosa.load(audio_file_2, sr=16000, mono=False)[0]
input_with_one_audio = feature_extractor(
input_with_two_audio = feature_extractor(
[audio_array_1, audio_array_2],
***** OUTPUT ****
torch.Size([1, 36409])
torch.Size([2, 37371])

Now that we know the output from the feature extractor model can vary in shape — depending on the input audio — it might become clear why padding is important before pushing a batch of input to a model for training. When dealing with batches, we can either (a) pad all audios to the length of the longest audio in the train set or (b) truncate all audios to a maximum length. The problem with (a) is that we are unnecessarily increasing the memory overhead to store these extra padded values. The problem with (b) is that there might be some information loss due to truncation.

There exists a better alternative — apply dynamic padding during model training using data collators. We will see them in action shortly!

At the time of building batches (for training), data collators can apply preprocessing (such as padding) to only that particular batch of inputs.

Loading Base Model for Classification

As mentioned previously, we will use Facebook’s Hubert model to classify audios. If you are curious about the inner workings of HuBERT, check out this awesome introductory tutorial to HuBERT by Jonathan Bgn.

The bare HubertModel is a stack of 24 transformer encoder layers and outputs raw hidden states for each of these 24 layers (without any specific head on top for classification).

bare_model = HubertModel.from_pretrained("facebook/hubert-large-ls960-ft")last_hidden_state = bare_model(input.input_values).last_hidden_state
*** OUTPUT ***torch.Size([1, 113, 1024]) # the hidden size i.e. 113 can vary depending on audio

We need some sort of classification head on top of this bare model which can take the last hidden layer’s output and feed it into a linear layer that finally outputs 6 values (one for each of the six emotion classes). This is exactly what HubertForSequenceClassification does. It has a classification head on top for tasks like audio classification.

However, similar to the feature extractor config explained above, if you get the default configuration for HubertForSequenceClassification from the pretrained model, you’ll notice that it will only work for binary classification tasks because of the way its default config is defined.

model_path = ""facebook/hubert-large-ls960-ft""hubert_model = HubertForSequenceClassification.from_pretrained(model_path)hubert_model_config = hubert_model.configprint("Num of labels:", hubert_model_config.num_labels)**** OUTPUT ****Num of labels: 2

For our 6-class classification, we need to update the config to be passed to the Hubert model using PretrainedConfig (check out the section — Parameters for fine-tuning).

Few things to note:

  • On line 5, from_pretrained() loads both the model architecture and the model weights (i.e. weights for all 24 transformer layers + linear classifier) from facebook/hubert-base-ls960.
    Note: If you simply did hubert_model = HubertForSequenceClassification() , the transformer encoders and classifier weights be initialized randomly.
  • Setting ignore_mismatched_sizes argument to True is important because without it you’ll get an error (see image below) due to size mismatch — the classifier weights available as part of facebook/hubert-base-ls960 have shape (2, classifier_proj_size) whereas according to our newly defined config our weights should have a shape (6, classifier_proj_size). Given that we are going to retrain the linear classifier layer from scratch anyway, we can choose to ignore the mismatched sizes.
Error due to classifier size mismatch

Freezing layers for fine-tuning

As a general rule of thumb, if the underlying dataset on which the pre-trained model was trained is significantly different from the dataset you are working with, it’s better to unfreeze and retrain a few layers at the top.

To begin with, we are unfreezing the weights of the top two encoder layers (closest to the classification head) whilst keeping weights for all other layers frozen. To freeze/unfreeze weights, we set param.require_grad to False/True where param refers to the model parameters.

Unfreezing weights during model training means these weights will be updated as usual so that they can reach their optimal value for the task at hand.

Note: While it might seem intuitive to begin unfreezing many layers right at the very beginning of the training process, this is not what I would recommend. I actually began my experiments by freezing all layers and training only the classifier head. Because the results were poor (no surprise there), I resumed training by unfreezing two layers.

Loading dataset

Using our custom loading script, we can now load our dataset using load_dataset() method from the datasets library.

Next, we use map() to convert all raw audios (in .wav format) in our dataset into arrays.

Generally speaking, map applies a function repeatedly to all rows/samples in the dataset.

Here, the function (defined as a lambda function) takes a single parameter x (corresponding to a row in the dataset) and converts the audio file in that row into an array using librosa.load(). As explained above, make sure the sampling rate (sr) is appropriate.

Note: If you were to do print(ds) at this stage, you will notice three features in the dataset:

print(ds)***** OUTPUT *****DatasetDict({
train: Dataset({
features: ['file', 'label', 'array'],
num_rows: 7442

Once we have generated the arrays, we are going to use map once again — this time to prep the input using a helper function prepare_dataset().

The prepare_dataset is a helper function that applies a processing function to each example (or set of examples i.e. batch, if you may) in a dataset. More specifically, the function does two things — (1) reads the audio array present at batch["array"] and extracts features from it using the feature_extractor discussed above and stores it as a new feature called input_values —(in addition to file, labels, and array) and (2) creates a new feature called labels and its value is the same as batch["label"].

Question: You may be wondering what’s the point of having both label and labels for each example especially when they have identical values.
Reason: Trainer API will look for the column name labels, by default, so we are simply obliging. If you wish you can even remove the other label column altogether at this step OR better yet, name the feature“labels” at the time of creating the loading script.

If you notice closely, unlike the previous map use case where the lambda function only took one input parameter, prepare_dataset() requires two parameters.

Remember: Anytime we need to pass more than one argument to the function inside map, we will have to pass fn_kwargs argument to map. This argument is a dictionary containing all parameters to be passed to the function.

Based on its function definition, we need two arguments for prepare_dataset() — (a) row in the dataset and (b) the feature extractor —so we will have to make use of fn_kwargsas follows:

Next, we are going to convert all string labels into ids (0,1,2,3,4,5,6) using class_encode_column().

Finally, we introduce train-test-validation splits by using train_test_split(). We need to split in this manner twice to get three non-overlapping datasets, all of which are combined in Step 8 below into a single DatasetDict.

Let the Training begin…

With all bits and pieces in place, we are now ready to start training using the Trainer class.

First, we need to specify the training arguments — this includes the number of epochs, batch size, directory for storing trained models, experiment logging, etc.

Few things to consider:

  • The gradient accumulation step is super useful in situations where you want to push a bigger batch size during training but your memory is limited. Setting gradient_accumulation_steps=4 allows us to update weights after every 4 steps — in each step, batch_size=32 number of samples gets processed and their gradients are accumulated. Only after 4 steps when enough gradients are accumulated, do weights get updated.

Second, we instantiate the Trainer class with these training arguments, in addition to specifying the train and evaluation datasets.

Few things to consider:

  • On line 5, we have used a data_collator. We talked about this briefly at the beginning of the tutorial as a means of dynamically padding the input audio arrays. The data collator is initialized as follows:

DataCollatorCTCWithPadding is a dataclass that has been adapted from this tutorial. I highly recommend giving the Setup Trainer section from the tutorial a quick read to understand what’s happening in this class.

Without going into too much detail, the __call__ method within this class is in charge of prepping the input received. It takes a batch of examples from the dataset (remember each example has 4 features — file, labels, label, array, input_values) and returns the same batch but with padding applied to input_values using processor.pad. Also, labels in the batch are converted into Pytorch tensors.

  • On line 8, we have defined compute_metrics() which is a way of telling the Trainer what all metrics (accuracy, precision, f1, recall, etc) must be calculated during the evaluation. It takes as input the evaluation predictions (eval_pred) and compares actual labels vs. predicted labels using metric.compute(predictions=.., references=...). Again, the boilerplate code for compute_metrics() has been adapted from here.

Note: If you feel like being creative and displaying custom metrics (for instance, log of absolute diff btw actual and pred value), you can modify the compute_metrics(). All you need to know before doing so is what is returned by eval_pred. This can be found in advance by running trainer.predict on your eval/test dataset before actually training the model. In our case, it returns actual labels and predictions (i.e. logits — on which you apply the argmax function to get the predicted class):

trainer = Trainer(model=..., args=...,...)
output = trainer.predict(ds["test"])
print(output)**** OUTPUT *****
[ 0.0331, -0.0193, -0.98767, 0.0229, 0.01693, -0.0745],
[-0.0445, 0.0020, 0.13196, 0.2219, 0.94693, -0.0614],
], dtype=float32),
label_ids=array([0, 5, ......]),
metrics={'test_loss': 1.780486822128296, 'test_accuracy': 0.0, 'test_runtime': 1.6074, 'test_samples_per_second': 1.244, 'test_steps_per_second': 0.622}

And now the actual training with one line of code

Quick detour: Line 4 contains the command to continue training from a checkpoint. But first, what’s a checkpoint?

During the course of training, the Trainer will create snapshots of the model weights and store them in the output_dir defined in TrainingArguments(output_dir="results"). These folders are usually named as checkpoint-XXXX and contain model weights, training args, etc.


You can specify when and how often you’d like to create these checkpoints, using save_strategy and save_steps, respectively. By default, checkpoints will be saved after every 500 steps (save_steps=500). The reason I am mentioning this is because I wasn’t aware of these default values and during one of the training sessions (that ran for 7 hours), I saw that none of the checkpoints were getting created in the output directory. This was the config I was working with:

  • Training samples: 6697
  • Num epochs = 5
  • Batch size = 32
  • Gradient Accumulation Step = 4

After hours and hours of debugging, I found that the total steps in my case were only 260 whereas the default saving happened only after the 500th step. 🤦‍♀ Setting save_steps = 100 as part of TrainingArguments() fixed this for me.

At the bottom, one can find the total number of optimization steps

Note: In case you’re wondering how to calculate the total steps (i.e. 260 in this case):

Total train batch size = Batch size * Gradient Accumulation Step = 32*4 = 128
Total optimization steps =( Training samples/Total train batch size) *epochs=(6697/128)*5 ≈ 260.

Making predictions and logging results on the test set

Few things to consider:

  • To log any additional metrics/variables to weights and biases, we can use wandb.log(). For instance, in line 4, we are logging the test set accuracy.
  • By default, wandb does not log the trained model so it’s only available on your local machine after training finishes. To explicitly store the model artifacts, we need to make use of with policy="end", meaning only sync the file when the run finishes.

Results and Reflections

Results from all the different model runs with varying hyperparameter combinations are logged in my Weights and Biases dashboard here.

WandB Dashboard

Looking at the learning curves, it looks like our most recent run (faithful-planet-28 with test accuracy = 68% — not too bad considering it took only 4 hours of training) might benefit from additional epochs as both train and eval losses are still decreasing and haven’t plateaued out (or worse, started diverging). Depending on whether or not that works out, more encoder layers might need to be unfrozen.

Learning curves

Few reflections and learnings:

  • If we are increasing epochs, it might be worth considering early-stopping the training via a callback. See this Stackoverflow discussion for details.
callbacks=[EarlyStoppingCallback(early_stopping_patience = 10)]
  • In addition to Cuda, Trainer has recently added support for working with new Mac M1 GPUs (simply set args = TrainingArguments(use_mps_device=True)). If you are working with them, please note that some people have reported a drop in metrics (this is a known bug — see this issue).


Hopefully, you are now feeling a bit more confident fine-tuning deep learning models using the transformers library. If you do take this project forward, please share your results (and steps for improving accuracy) with me and the wider community.

As with any ML project, attention to responsible AI development is crucial to assess the impact of the work going forward. This becomes even more important given recent works have suggested emotion detection approaches can have built-in gender/racial biases and might cause real-world harm. Moreover, if you’re dealing with sensitive audio data (say customer support calls containing credit card details), please apply desensitization techniques to protect personally identifiable information, sensitive personal data, or business data.

As always if there’s an easier way to do/explain some of the things mentioned in this article, do let me know. In general, refrain from unsolicited destructive/trash/hostile comments!

Until next time ✨


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: