Audio Clustering with Deep Learning



Original Source Here

Audio Clustering with Deep Learning

1. Introduction

Deep neural networks are popular for various image processing or NLP tasks. In recent times, however, research focused on audio tasks using deep learning techniques has seen a surge. Some of the deep learning techniques have been adopted from image processing tasks, however audios are quite different as they are one-dimensional time series signal which is different from two-dimensional images. Deep learning methods with audio as input are important as audio is a very prevalent medium in our daily lives.

In this project, the main objective was to train a deep neural network for the purpose of feature extraction, clustering or both. The aim was to use this network to successfully extract feature from audio samples and cluster them into 20 clusters.

For this project, I converted the audio samples into spectrograms and saved them as images. A sample spectogram can be seen below.

2. Architecture

I conducted a total of 50 experiments with various learning rates, batch sizes, model structures, layers, types of layers etc. The final model structure I used is shown in Figure 1. I converted the audio samples into spectrograms and saved them as images. Then I used an autoencoder to compress and reconstruct the images.

The autoencoder consists of the encoder and decoder. The encoder and decoder are block based i.e. I split the original image (spectrogram) into equal sized blocks of size (144, 144, 3) and input those to the encoder. The encoder compresses the original block to size (9, 9, 1) which is a compression ratio of 0.0013. Since there were 671 samples, my dataset consisted of 4026 total blocks.

I then trained the autoencoder model to compress and reconstruct the spectrograms. I used the encoder to predict the original spectrograms. The output is the compressed version of the samples thereby extracting important features. I used these features to cluster the audio samples using K-Means algorithm into 20 possible clusters.

Figure 1: Model Structure

3.1 Dataset, Implementation Details & Spectrograms

The dataset consisted of 671 audio samples of duration 5 seconds. There were no labels given as the audios had to be clustered into 20 possible clusters. Training was done as showcased in the notebook using a GPU which took about twenty minutes on average for one experiment where the model trained for 180 to 200 epochs and stopped based on early stopping which studied the mean squared error loss function with a delta of 0.009 and patience of 15.

There are many ways to import and utilize audio data in machine learning. I chose to convert each audio sample into a spectrogram and save them as images. I then used those images to train the autoencoder model to compress and reconstruct the spectrograms.

I used the encoder to predict on the original spectrograms to output the compressed version of the samples thereby extracting important features. I used these features to cluster the audio samples using K-Means algorithm. I tried a spectrogram; log spectrogram and Mel spectrogram as can be seen in Figure 2.

Mel spectrogram seemed like the best choice as its NMI score was highest. Figure 3showcases the NMI score for clustering for each set of spectrograms used. Since (a) Mel spectrograms gave the highest score, I chose to conduct the rest of the experiments with them.

Figure 2: Spectrograms
Figure 3

3.2 Hyperparameter Tuning & Results

I tried various model structures for both the encoder and the decoder. There were three main cases I tried for each. Figures 4 and 5 showcase the NMI score (on 50% of the set) for each layer choice for the encoder and decoder respectively. Each case is described below the respective figures. Since case 3 worked best, I chose MaxPooling and Upsampling layer for compression and reconstruction and the convolutional and deconvolutional layers for increasing and decreasing feature maps.

I also tried various batch sizes and batch size 4 worked best as shown in Figure 6. Figure 7shows the training loss for the autoencoder model used. Figure 8showcases the progressive improvement as I conducted more experiments and tuned the parameters and altered the structure based on the performance of the previous experiments.

Figure 4, 5

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: