Exploring the script of a YouTube video through 2D projections of its bags of words

Original Source Here

The main pipeline I explore in this article. Figure by author Luciano Abriata.


Exploring the script of a YouTube video through 2D projections of its bags of words

Applying PCA on the bags of words computed from the script to spread the video’s information on a 2D view.

Recently I started playing with the scripts of YouTube videos, which are uploaded by creators along with the videos or transcribed automatically by the website through speech recognition systems. Currently, I’m looking for ways to display the content of a video in a graphical way that allows me to quickly explore its contents without having to watch it all. In the longer term, my goal is to set up a full “video script explorer” that you can use online to quickly overview what the different sections of the video talk about -stay tuned because this promises to be a fun project, plus maybe useful too!

For the moment I have made some interesting progress that I will share here. It’s all manual steps so for the moment there is no code. Briefly, here I show you (1) how to get the script of a YouTube video, (2) clean up the content of the sentences by removing stop words, symbols, etc., (3) reformat the data to have sentences of sizes reasonable for analysis, (4) convert the sentences into numbers, and (5) finally apply PCA on these numbers to display the results. It’s a quite simple approach, but the results make sense at least on a script of a 50 minute long video that presents 3 different stories about a related topic.

I hope this article will be teaching you a few basic things, and serve to me as a stepping stone for more advanced analyses and for my own future video script-browsing tool.

1. Retrieving the script of a YouTube video

Not all YouTube videos have scripts available. If they do, then you will see “Open transcript” when you click the three dots in the bottom right of the video:

How to retrieve the script of a YouTube video, if it’s available (that’s when the content creator uploads it or when the system can transcribe it automatically). Figure by author Luciano Abriata.

You can select all the transcript text and paste it into your favorite program. You’ll see when you paste the text that this results in a single column with alternating rows of time indexes and text, and also that there’s quite some trash in there. Therefore you will of course need to perform some cleanup.

Note: there are several programmatic ways to get the scripts of YouTube videos, but none of the methods I found worked consistently on all videos.

2. Cleaning up the content of the script by removing stop words, symbols, etc.

You can see that the script is stripped into very small “sentences”. In the video I analyzed (which is not the one shown in the figure above) I got 2032 lines, which actually means 1016 lines of raw text. That’s from a 50 min long video from a regular TV program in my country.

Many sentences actually don’t have any content at all, just indicating that a passage of the video is “[Music]” and other kinds of tags. I removed these lines as well as all symbols, numbers, and words of 3 or fewer characters which are mostly junctions and noise from the automatic script generation process with not much content.

3. Reformatting a script to get sentences of sizes reasonable for analysis

To this point, I extracted 996 lines of text from the script. You can see that each line is quite small, containing between 1 and 10 words. (I suspect the true limit is given by the number of characters, as the system wants to ensure that the whole text fits the screen, but gaps of silence or music also produce shorter sentences).

As they are, these raw lines of text are too short to be analyzed. Therefore I rearranged my 996 lines by merging every 12 consecutive lines into single “sentences”. That means I now have 83 lines, each containing between 30 and 40 words.

These 83 lines involve 847 words (remember I already cleaned up all the stop words, short words, symbols, numbers, etc.). Of them, 75% appear only once in the whole bag of words, 15% appear once, 4% appear three times, and 6% appear between 4 and 11 times.

4. Converting the sentences into numbers

At this point I move from words to numbers. For this, I take the words I compiled and count how many times they show up in each of the 83 sentences. That means I get the “bags of words” for each sentence. In what follows I stay with the procedure in which I included only those words that appear 2 or more times, which means 25% of the 847 words (i.e. 214 words).

Of course, most words do not appear in any sentence; however, the way I prepared words and sentences implies that each word will appear at least once in at least one sentence. Therefore I get a matrix that looks dominated by zeros but actually has at least one number > 0 in all its rows and columns.

Having filtered words with 2 or more total counts, and having 83 sentences, at this point I got a matrix of 214 rows (words) and 83 columns (sentences). The following is a representation of that matrix where all 0s were removed and any number > 0 is seen as a black dot:

Non-zero elements of the matrix of counts for each word (row, only words with 2 or more total counts included) in each sentence (column). Figure by author Luciano Abriata.

You can see that the density of points increases as you go down. That’s because the rows (words) are ordered by increasing total occurrences.

Now that we have a numerical representation of the data (yes, I know it’s very simple and for sure has many problems, but it’s a start) we can begin the fun part of crunching it.

5. Applying PCA on the data and interpreting the results

First PCA attempt

Applying a simple PCA procedure to the matrix above already gives some meaningful results. To aid interpretation, I took advantage of the fact that the video presents 3 separate stories on a common topic: all 3 are about painting, but each one focuses on a different painter who is interviewed separately. In the next picture you can see the input matrix prepared above but colored by story number, and then the PCA plot where each dot (sentence) is colored according to the color of the story it was extracted from.

Left: same matrix as above but coloring the objects (sentences) by types (stories). Right: result of the PCA procedure, coloring each dot by the color that corresponds to the story they belong to. I ran PCA with this online tool that I explained in this TDS article. Figure by author Luciano Abriata.

You can see how stories 1 and 3 are rather separated, especially along PC2. Story 2, instead, remains roughly in the middle.

Playing with different PCA runs

What’s the effect of choosing words of higher or lower frequency? In my tests, PCA considering only the words that appear twice in the whole text does not produce any clear spreading of the data. Meanwhile, running PCA only on words that appear a total of 5 or more times (26 words) produces a better separation of the red dots from the green + blue dots (story 1 against stories 2 + 3):

Plot of the principal components computed on a matrix with only data for words that appear a total of 5 or more times in the story. I ran PCA with this online tool that I explained in this TDS article. Figure by author Luciano Abriata.

Most interestingly, the loadings plot explain what words are weighing more into the separation of points in the PC plot:

Coefficients (“loadings”) for each of the 26 words along principal components 1, 2 and 3. I ran PCA with this online tool that I explained in this TDS article. Figure by author Luciano Abriata

In such plot, where we have one number per input variable (here words) per principal component (here the first three are shown), both positive and negative values matter. The positive and negative peaks in the last variable correspond to the word “hotel” which appears a total of 11 times but all of them in story 1. By watching the video one understands that the whole story 1 revolves around pieces of art that are currently exhibited in a hotel, and that the interview itself takes place in the hotel and even covers how the hotel was recycled from ruins.

The negative peak at position 6 corresponds to the word “yellow”. This word is mentioned a total of 5 times, all of them in a single sentence of story 1 talking about the colors of the autumn, when the story was filmed. Such strong sensitivity to a single sentence is probably something to be softened somehow. In particular, the story is not especially centered around the fall and its colors.

Removing the word “yellow” improves the range of spread a bit, still dominated by “hotel”:

Same analysis as above but leaving out the word “yellow”. I ran PCA with this online tool that I explained in this TDS article. Figure by author Luciano Abriata

Last, removing “hotel” to shift the focus of the PCA procedure to other words leads to less spread of the sentences and stresses the words “look”, “painting”, and “art”:

Same analysis as above but also leaving out the word “hotel”. I ran PCA with this online tool that I explained in this TDS article. Figure by author Luciano Abriata


The procedure might not be the best, but is very easy and it does have some power to spread the contents of a script, being especially sensitive to words that are very frequent in only one of the stories. If made into an interactive web app where the user can dynamically see the results updated when words are removed or included, and possibly see the full sentences when (s)he hovers over the data points, this could be I think a quite powerful tool.

What do you think? What would you expect from a tool supposed to facilitate inspection of the contents in a video script (or any text for that matter)?


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: