# What is NMF?

NMF is a matrix factorization technique, much like Singular Value Decomposition (SVD), but we constrain the resulting matrices to be non-negative instead of orthogonal.

Given a matrix `X`, we want to decompose it into two matrices `W` and `H`, so that `X ≈ W x H`. We use the approximately equal sign `≈` because, unlike SVD, NMF yields an approximation of the original matrix. Moreover, every element of `W` and `H` is non-negative.

If matrix `X` holds images of faces, `W` would capture the facial features of those faces and `H` the relative importance of those features in each image.

Intuitively, we can think of the two matrices as follows: Assume that we have our matrix `X` in which every column is a vectorized image of a face. `W` expresses the facial features (e.g. noses, eyebrows, facial hair, etc.) and `H` captures the relative importance of features in each image.

Now that we have a perspective about what NMF accomplishes, we are ready to get our hands dirty. But first, we take another brief diversion into word normalization and TF-idf.

# Density vs Significance

Term Frequency — inverse document frequency (TF-idf) is a weight measure often used in information retrieval. Its role is to measure how important is a word to a document in a corpus.

TF-idf is composed of two terms. The first one, Term Frequency, computes the normalized frequency of a word appearing in a document. Thus, the number of times a word appears in a document, divided by the total number of words in that document.

TF measures how frequently a word occurs in a document, while idf computes how important a word is.

The second term is the Inverse Document Frequency (idf). This is computed as the logarithm of the number of documents in the corpus divided by the number of documents where the specific term appears. Thus, from the first term, we get the frequency of words, while the second one provides us with the importance of each word by weighting down the frequent words and scaling up the rare ones.

Now that we have a solid understanding of NMF and TF-idf, we are ready to apply ourselves to the problem at hand; how do we uncover more from just a movie plot?

# The dataset

For this experiment, we use the Wikipedia Movie Plots dataset created by JustinR under the CC BY-SA 4.0 creative commons license. This dataset contains information on 34,886 movies from around the world, such as title, genre, release year, director, plot, etc. The dimensions that we are going to use though are just title and plot.

In addition, we use a dataset of English common first names by Philippe Rémy for reasons we explain further down.

# Topic modeling with NMF

Let us now dive into the code and reap the benefits of our work. First, we need to load the data. We load the Wikipedia Movie Plots data in memory, sample 50% of the movies to avoid memory issues, and keep only the columns we care about (i.e. movie title and plot). We also load the English first names dataset.

Next, we prepare the movie plots, store them in a list, normalize and vectorize them using TF-idf. The aim is to build a matrix `X`, where rows resemble movies and the columns are the unique words appearing in all the plots combined. Thus, the cell `ij` in the matrix `X` captures how many times the word `j` appears in the plot of the movie `i`. For instance, in the matrix below we can see that word `2` appears `7` times in the plot of the movie `1`.

Let us now see that in python code. We use the handy scikit-learn’s `TfidfVectorizer` class to achieve our purpose. This class takes as an argument something called `stop_words`. These are common words, such as “a”, “is”, “the”, etc., that convey no actual meaning and offer us no value whatsoever.

In addition to these words, we also pass the English first names, so we can also exclude them from the process. After the vectorizer has completed processing the words, we can get the resulting vocabulary, to use later.

Now that we have our matrix `X`, we want to factorize it into two matrices `W` and `H` like we have seen before. For that, we can use the `NMF` factorizer provided by scikit-learn. The decomposition takes an argument named `n_components`. These are the number of topics, or in our case movie themes, we would like to get back.

We are ready now to look into the soul of each genre. We want to find what are the most important words for each topic. That process extracts the themes from the plots and models the film’s subject. We have everything we need apart from two helper functions to achieve that. Let us define them below.

The results are the top `10` words for each of the `18` topics, because we set `n_components` to `18`. So, what are they?

Some topics do not make sense at first sight, but most of them are quite representative of particular genres. For example, the second to last is about animation and cartoons, the second one is about crime, we have a western theme on the third row from the end and a medical theme somewhere in the middle. This is quite an impressive result for 10 to 15 lines of code!

# Conclusion

In this story, we explored Non-negative Matrix Factorization (NMF) and how we can use it for topic modeling. NMF has various applications in face decomposition, collaborative filtering, chemistry, gene expression, and more. You have now what you need to examine it further and use it in your projects.

This story is inspired by Rachel Thomas’ course on Computational Linear Algebra.