What is sound?

Original Source Here

Machine Learning and data analysis on sound is a growing domain with a huge potential for Data Science use cases. In this article, you will learn what sound is and how it can be represented on a computer.

You will also learn what a Fourier Transform is and how it works. You will then see how you can use the Fourier Transform to prepare Audio Data for Machine Learning or data analysis, with a fully worked-out example in Python.

What is Audio Data?

You are probably familiar with tabular data: the standard form of storing data in rows and columns. This is the traditional way of storing data and it is very suitable for Machine Learning problems.

The next step of difficulty is image data. Image data is more difficult to deal with as it cannot be used as tabular data. However, computers are able to show images by giving values to each small square (pixel) inside an image. By giving each pixel a Red Blue and Green value, computers can show images.

Now for doing Machine Learning on images, we replicate this logic, and we use the pixel values (Red, Blue, and Green) in a three-dimensional vector (the dimensions are image height, image width, and the three colors). If you want to learn more about machine learning on images, you can check out the following resources on object detection and image segmentation.

The next step of difficulty is sound (or audio) data. As a very basic description, let’s say that Audio Data is a numerical way to store sounds.

Sound is Air Pressure

The reason that Audio Data is hard to work with, is that, unlike tabular data and image data, it does not follow a very clear and organized structure.

Sound, in its rawest (non-digital) form, is a variation in air pressure that the human ear can detect.

From Analog Sound to Digital Audio: Microphone

The digital form of sound is audio. To store sound in a computer, we need to convert it into something digital: something that a computer can store.

There are two steps in converting sound to numbers:

  • A Microphone converts air pressure variations into voltage. You may have never realized it, but this is actually all that a microphone does.
  • An Analog-to-Digital Converter receives the voltage (see it as different ‘intensities’ of electricity/current) and converts those voltages to numbers: these numbers are the digital values of a wave: unlike air pressure, those numbers can be stored in a computer.

Audio Data is Amplitude and Frequency

You now have a first understanding of how to convert sound to numbers, but the next real question to ask is: what are those numbers that represent sound?

It turns out that there are two important types of numbers: amplitude and frequency.

  • The amplitude of a sound wave shows its volume.
  • The frequency of a sound wave represents its pitch.

You now understand that a single sound wave exists of two fundamental values: an amplitude and a pitch. Together, they can make tones from loud to quiet, and from low to high pitch.

Real-life sounds are ‘composite’ waves

Now that you understand what one sound wave is, let’s move on to a more complex and more realistic situation. In reality, there are often many sounds at the same time in an audio file. For example, in music, you generally hear multiple tones at the same time. In nature recordings, you will generally also hear all sorts of animals, wind, and more at the same time.

Audio files with just one frequency at the same time are very rare. Therefore, if we want to summarize an audio file as data, we will need to describe more than one wave.

A complete sound is a mix of waves, and therefore a mix of frequencies at different amplitudes. In easier terms: sound is a mix of high and low tones at different volumes.

In order to work with such a complex wave, it has to be split into amplitudes per frequency per time. This is generally what a spectrogram can show.

Spectrograms show all the waves in a sound

Spectrograms are graphs that allow you to depict a sound over time. The graph shows time on the x-axis and frequency on the y-axis. The color indicates the amplitude of the specific frequency at a specific point in time.

Spectrograms are created from a digital “complex” sound using Fourier Transform

Fourier Transform is an advanced mathematical method that allows you to decompose a “complex” sound into a spectrogram that shows volume (amplitude) for each frequency throughout time.

I don’t want to go into too much detail here, but I encourage you to do some reading on Fourier Transform for example here or here.

Representing sound in Python using librosa

Let’s now make this practical by doing a few manipulations in Python with a real music file. We will use the librosa package which is a great package to work with sound in Python.

To install librosa in Python, you can simply run !pip install librosa in a Jupyter Notebook. This would also be a good use case to try out a Google Colab Notebook.

Note that, when working on sound data, you will generally use .WAV files as that is the uncompressed format for sound. Other familiar sound formats like .FLAC or .MP3 compress the sound. This may negatively affect the numerical representation, although it is not impossible.

Librosa comes with an example music file. You can import the example sound file as follows:

What is sound? Loading a music file into Python.

Once you’ve imported this music, you can use a Jupyter Notebook function to listen to the sound. You can use the following code to do so:

What is sound? Playing a sound file in a Jupyter Notebook

You will see the following soundbar in your notebook (and unlike in this image, you’ll actually be able to listen to the music):

What is sound? Playing a wav file in a Jupyter Notebook

As a next step, let’s display the music as a wave, thereby making the sound “visible to the eye”. Of course, as discussed before, real sounds are very complex, and they do not look anything like a simple wave.

You can use the following code to print a wave:

What is sound? Showing the waveplot.

You will obtain the following plot:

What is sound? Printing the wave of the nutcracker.

As a next step, let’s move on to the most useful visualization of sound data, which is the spectrogram. You can obtain the spectrogram input data by applying a Fourier Transform to the wave data y.

As explained earlier, Fourier Transforms are relatively advanced mathematics. Luckily for us, we can use a function in librosa to do the heavy lifting for us, as shown in the code below:

What is sound? Generating the spectrogram data using a Fourier Transform.

You will obtain an array that looks as follows:

What is sound? Raw spectrogram data (amplitude)

One last step remains before creating the final spectrogram, and that is converting the spectrogram data into decibels. The reason for this is that the spectrogram function doesn’t work with the current (amplitude) format.

You can convert the data as follows:

What is sound? Converting the amplitude spectrogram data into decibel spectrogram data.

Now, we can finally use the spectrogram function to show us the spectrogram. This is done as follows:

What is sound? Printing the spectrogram.

You will obtain the following graph that is a spectrogram of the nutcracker. We have successfully imported a music file and converted it into a visually accessible data format. The spectrogram contains a complete overview of the volume at each frequency of the nutcracker, and therefore it is a perfect visual representation of music.

The spectrogram of the nutcracker.


In this article, you have learned the basics of working with sound and audio in a digital format. You have converted audio data from a wav format that can only be heard into a visual representation as a spectrogram.

Now that you have mastered audio data importation and preparation, you are ready to move on to more advanced use cases like using machine learning for music genre classification or sound detections.

I hope this article has been useful to you. Thanks for reading, and do not hesitate to stay tuned for more stats, maths, and data content!


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: