Creating a Movie Rating Model Part 1: Data Gathering!



Original Source Here

Movie Rating Model

Creating a Movie Rating Model Part 1: Data Gathering!

Starting the process of creating a movie rating predictive model by gathering the data we need

Hello there friends! Welcome to a new series I’m starting with this post on creating a movie rating predictive model. Before jumping into this post, I do want to make quick mention that I definitely have not abandoned the Terraform + SageMaker series; I’ve just had a super busy summer with lots of stuff and took a break from blogging in general for a while. More to come on that front, so stay tuned!

The reason I’m starting this new series is for a fun sort of reason. I’m an avid listener of podcasts, and one of my favorite podcasts is called The Watercooler. It’s a light-hearted podcast where five coworkers get together after the end of a day to talk about, well, pretty much everything under the sun! They also have some recurring bits, including one where one of the hosts, Caelan Biehn, will provide ratings on different movies. His ratings are actually comprised of two scores: one is a simple “approve / disapprove” while the other is a single decimal float score between 0 and 10, which he refers to as the “Biehn Scale.” (By the way, Caelan is the son of the actor Michael Biehn, who has starred in such movies as The Terminator and Aliens.)

The Watercooler guys are really engaged with their audience, so I one time threw out as a joke in their Facebook group that I should create a movie rating predictive model around Caelan’s movie ratings. They all thought it was funny, and I thought it would be fun, so here we are!

Before delving into the primary purpose of this post — data gathering — let’s lay a foundation of what we’ll be doing across this whole series as we continue to build out our model along the way. And as always, you are more than welcome to follow along with the code in my GitHub repository.

Project Scope and Flow

Flow artwork created by the author

As mentioned above, Caelan provides two scores to each movie rating: a binary “approve / disapprove” and a 0 to 10 float score. Naturally, this means we are going to need to create two predictive models: one that handles the binary classification and the other to be a more regression-based model. I am intentionally not adhering to a particular algorithm yet as we will use a future post to verify a handful of algorithms to see which one performs the best.

To that end, the flowchart above shows how we will progress throughout this series. In the first phase (which is this post), we’ll need to gather the data from various sources to support our model. Once we have gathered our data, we’ll need to perform all the appropriate feature engineering to prepare our data to be put through our respective algorithms. Next, we’ll try out a couple different algorithms along with hyperparameter tuning to see which one performs the best. Once we settle on which algorithms we want to leverage, we’ll create a full model training pipeline to actually create the predictive models. And finally, we’ll take our trained models and wrap them up in a nice inference API!

One thing to note before moving forward: I’m not necessarily trying to create a perfect model here. In my day job, I am a machine learning engineer, and the scope of that role is far more focused on model deployment rather than model creation. That said, I am not as well versed in model creation, but I’m hoping to grow in this area with this series! And if you find along the way that there might be some better ways to do something, please feel free to reach out by leaving a comment on any of the blog posts.

Okay, let’s jump into our first phase of this effort: data gathering!

Gathering the Movie Ratings Themselves

Naturally, if we want to create a movie rating model, we have to have the movie ratings to train our supervised models against! To my knowledge, nobody was saving Caelan’s movie ratings anywhere, so I had to something that many people might dread: I had to gather them myself. This meant going back through the full back catalog of The Watercooler’s podcast and listening to them all to note down any ratings Caelan gave. To support this, I created a Google Sheets spreadsheet where I collect and input the data via my iPhone. At the point this post is published, I’m only about 3/4 through the full podcast back catalog, but I think we have enough data to at least start this series.

(Side note: Caelan also rates other things besides movies, so you’ll find a lot more than just movies in that spreadsheet. I captured those other things just for fun, but they will get filtered out later to only focus on the movies. Also, The Watercooler can be NSFW as can be some of those non-movie ratings so… just be aware of that if you look at the full spreadsheet. 😂)

The nice thing is that I found a way to programatically get the data with Python from a Google Sheet so long as you know what the sheet_id number is. After finding the particular ID for my own Google Sheet, I was able to use the script below to pull the updated movie reviews whenever I need:

# Defining the ID of the Google Sheet with the movie ratings
sheet_id = '1-8tdDUtm0iBrCdCRAsYCw2KOimecrHcmsnL-aqG-l0E'
# Creating a small function to load the data sheet by ID and sheet name
def load_google_sheet(sheet_id, sheet_name):
url = f'https://docs.google.com/spreadsheets/d/{sheet_id}/gviz/tq?tqx=out:csv&sheet={sheet_name}'
df = pd.read_csv(url)
return df
# Loading all the sheets and joining them together
df_main = load_google_sheet(sheet_id, 'main')
df_patreon = load_google_sheet(sheet_id, 'patreon')
df_mnight = load_google_sheet(sheet_id, 'movie_night')
df = pd.concat([df_main, df_patreon, df_mnight], axis = 0)

This little script gathers the data from the three respective tabs within the Google Sheet and loads them all into a unified Pandas DataFrame. Obviously, this Pandas DataFrame contains a lot more than just the movie reviews. To help facilitate that, the Google Sheet contains a column called “Category” that we can use to filter down on just the movies. Here’s what the script to do that looks like:

# Keeping only the movies
df_movies = df[df['Category'] == 'Movie']

Finally, we can get rid of the columns we don’t need for our project, which includes the columns of “Episode Number”, “Category”, and “Notes”. The code to do that is pretty simple:

# Removing the columns we do not need for the model
df_movies.drop(columns = ['Category', 'Episode Number', 'Notes'], inplace = True)

And what we are left with is something that looks like this!

Screenshot captured by the author

Alright! We can save this out with the Pandas DataFrame to_csv function and are ready to move on.

Gathering the Supporting Data

For those of you familiar with Kaggle datasets like the infamous Titanic project, you’ll know that Kaggle basically hands you all the data you need on a silver platter. This is obviously not the case with our project here, and gathering data to support any model like this can be a challenge. Some of the things a data scientist has to consider in this position include questions like the following:

  • Does the data to support a potential model like this exist?
  • What sort of features do I need to gather when creating my model?
  • By what means do I need to curate the data in its raw form?
  • Am I going to have to do a lot of feature engineering against the raw data?
  • Is the raw data actually representative of the inference underlying the model?
  • Does the data create an unfair / unethical bias toward one direction on inference?

The good news is that I can answer some of these questions pretty easily. Does the data exist? Well, I know IMDb exists, so I’m certain that some capacity of movie data exists! I also don’t necessarily have to worry about creating an unethical model since the nature of this effort doesn’t create poor outcomes for the future. In case you’re wondering what an unethical effort might look like, I always use the example of creating a predictive model to predict the best CEO candidate based on 20th century data. Most 20th century CEOs were middle-aged, white males, and that was largely due to oppressive systems that downplayed women and minorities. Creating a model on top of that 20th century CEO information skews all inferences toward selecting a middle-aged, white male, which clearly is bad and unethical.

The challenge here is going to be actually getting data from different sources to support our model. Truth be told, I had no idea at the outset on all the features I should be looking for. I had an intuition that comparing Caelan’s scores to other critic scores is probably a good idea, so that specifically might manifest as gathering the Rotten Tomatoes scores for each movie. But at the end of the day, I had no idea what I was going to find. Moreover, how can I make sure that the movie data from one source matches another? For example, how can I assure that one source giving me data for Avengers: Endgame doesn’t accidentally give me the data for Avengers: Infinity War from another source?

Fortunately, the data gathering wasn’t as difficult as I thought it would be thanks to the power of Google! Let’s now analyze the three sources I gathered my movie data from and how the first one in particular helped to maintain that consistency from one source to another.

(Quick note before moving on, everything discussed in this post from here on out is also captured in this singular Jupyter notebook. Feel free to reference it as we continue along!)

Data Source #1: The Movies Database (TMDb)

On Googling sources that might look promising for our project, the top choice that seemed to keep coming up is this source called The Movies Database, often abbreviated as TMDb. While it has a very nice UI that can be browsed via the web, they also offer an equally great API to interact with. The only catch is that they try to limit how much traffic gets hit by their API, so they require that users register via their website to obtain an API key. (Note: Please don’t ask me for my TMDb API key. I want to ensure I can stay in the free tier, and that’s only possible if I keep my API key for myself.)

Registering with TMDb is pretty straightforward if you follow the instructions behind this link. With your registration complete, you should see a screen that looks like this with your particular API key.

Screenshot captured by the author

As mentioned before, this API key is sensitive, so if you use it with your own code, I suggest you find a way to obscure it if you upload your code to GitHub. There are many ways to do this, but to keep things simple, I loaded my key into a file called keys.yml and then used a .gitignore file to ensure that file is not uploaded to GitHub. In case you’re not aware of what a .gitignore file is, it is a place where you can delineate anything you don’t want to push up to GitHub, whether it be a singular specific file, a specific directory, or anything that ends with a specific extension. The .gitignore file itself doesn’t store anything sensitive, so you’re more than welcome to copy the one I used for this project.

So how do you create and then use this keys.yml file? Let me create a different test.yml file to show you how it works with unsensitive information. First, let’s create that test.yml file and populate it as such:

test_keys:
key1: s3cretz
key2: p@ssw0rd!

Now we can use this little script here to import Python’s default yaml library and parse through the test.yml file. Down below you’ll see a screenshot where I purposefully print the variable that contains the “sensitive” information itself, but note that you do NOT want to do this with your actual API key.

# Importing the default Python YAML library
import yaml
# Loading the test keys from the separate, secret YAML file
with open('test.yml', 'r') as f:
keys_yaml = yaml.safe_load(f)

# Extracting the test keys from the loaded YAML
key1 = keys_yaml['test_keys']['key1']
key2 = keys_yaml['test_keys']['key2']
Screenshot captured by the author

Okay, so that’s how we keep our API key safe, but how do we actually use it? Fortunately, there are several Python wrappers we can use to easily interact with TMDb using Python, and the one I chose to use is called tmdbv3api. The documentation for tmdbv3api can be found at this link, and we’ll walk through the basics of what I did to interact with TMDb’s API via this Python wrapper.

After installing with pip3 install tmdbv3api, the first thing we’ll need to do is instantiate some TMDb Python objects that will allow us to get the data that we need. Let me show you the code first, and I’ll explain what each of these objects does:

# Instantiating the TMDb objects and setting the API key
tmdb = tmdbv3api.TMDb()
tmdb_search = tmdbv3api.Search()
tmdb_movies = tmdbv3api.Movie()
tmdb.api_key = tmdb_key

The first tmdb object is sort of a “master” object that instantiates our connection the API using our provided API key. The tmdb_search object is what we’ll use to pass in the string text of the movie name into the API to get some basic information about the movie, particularly the movie’s unique identifier. We can then pass in that movie’s unique identifier into the tmdb_movies object, which will give us a treasure trove of data to work with.

Let’s demonstrate this by searching for one of my favorite movies, The Matrix. Here’s how we can use tmdb_search to use a string to search for movies:

# Performing a preliminary search for the movie "The Matrix"
matrix_search_results = tmdb_search.movies({'query': 'The Matrix'})

If you were to look at those search results, what you would find is an array of search results, each containing limited details about each movie. Now just like any search engine, TMDb is going to attempt to return the most relevant search for your string. As an example, let’s show what happens when I iterate through these results and pull out just the title of the movie from each search result entry:

Screenshot captured by the author

As you can see, it did return the original Matrix movie first, followed by other subsequent Matrix movies, followed by some other random movies that have “The Matrix” in the title.

Here’s where we’re going to have to accept a risk with our search results here… You’ll see down below that I’m basically going to create a Python for loop to iterate through every movie Caelan has rated. Unfortunately, I can’t feasibly verify that every movie I put through this search is going to have the correct one show up on the first try. For example, I know Caelan reviewed the newer version of the older horror movie Pet Sematary, and without looking, I don’t know if TMDb is going to give me that newer version or the older classic. That’s just a risk we’re going to have to accept.

Doing this search alone isn’t going to cut it with TMDb. While these preliminary results do return some information about the movie, there’s actually another piece of TMDb that will give us all the details we want. Before we get those extra details, let’s just look at the keys from the original search results so we can show indeed how many more details we’ll extract from the next step.

Screenshot captured by the author

In order to do the more detailed search, we’ll need to pass in the TMDb ID into the tmdb_movies object. We can get the TMDb ID from the original search results by looking at matrix_search_results[0]['id']. This is how we use that TMDb ID to extract out a lot more details about the movie:

# Getting details about "The Matrix" by passing in the TMDb ID from the original search results
matrix_details = dict(tmdb_movies.details(matrix_search_results[0]['id']))

Let’s now look at the new keys of this more detailed result list as compared to the preliminary results:

Screenshot captured by the author

As you can see, we get significantly more details out of this search than from before! Now, I’m not going to spend much more time here showing you which features I’m going to keep just for brevity’s sake. I will create a data dictionary as part of my GitHub’s README with all those details if you’d like to know. But the one thing I do want to point out before moving onto the next source. Remember when I said I was concerned of keeping consistency of movie results from one data source to another? Well, friends, I have some good news! As part of the TMDb search results, they also include the imdb_id. This is great as a unique identifier like this will definitely help nail down the correct IMDb search results. And speaking of IMDb…

Data Source #2: IMDb

So you might be wondering, why did I opt for TMDb as the primary data source over IMDb? The answer is because while IMDb does have an official API, I was not able to use it. They seem to require some sort of special business case to request access, and I honestly didn’t feel like going through the hassle of getting that worked out. Fortunately, I did find an alternative that will do just fine for us: IMDbPy. IMDbPy (link to docs) is a Python library that somebody built to interact with data from IMDb. Now, I honestly don’t know how the author is sourcing this data. It could be they do have access to the official IMDb API, or maybe they’re doing some fancy screen scraping. In either case, this Python library gives us just the data we need.

By extracting the IMDb ID from the TMDb search results, we can use that as the basis for our IMDb search. One thing to note though: TMDb formats the IMDb ID a little oddly. Specifically, the append on the characters “tt” to the beginning of every ID. No worries, however, as we can simply use a Python string iterator to peel off those preliminary characters before putting them in the IMDb search.

After installing IMDbPy by running pip3 install imdbpy, we’re ready to instantiate our imdb_search object by running the following code:

# Importing the IMDb Python library
from imdb import IMDb
# Instantiating the IMDb search object
imdb_search = IMDb()

Now we can perform the search by popping in the IMDb ID as such:

# Obtaining IMDb search results about the movie "The Matrix"
matrix_imdb_results = dict(imdb_search.get_movie(matrix_imdb_id))

The only unfortunate downside is that while IMDb returns back a TON of information, most of the information isn’t going to be good for our model. Most of the information that IMDb returns is about the supporting cast and crew of the movie, which unfortunately don’t make for good features in a predictive model. In fact, the only bits of information we can extract out of these results are imdb_rating and imdb_votes. I’d say it’s a little disappointing, but since TMDb returned so much information for us in the first pass, I’m not too disappointed.

With IMDb now complete, let’s move on to our third and final data source.

Data Source #3: The Open Movie Database (OMDb)

Since the outset of this project, I knew one thing that I absolutely wanted as a feature to support these models was the respective critic score and audience score from Rotten Tomatoes, and so far, we haven’t obtained that data from neither TMDb nor IMDb.

Fortunately, the Open Movie Database (OMDb) shows some promise on this front! To actually use OMDb is going to be very similar to how we used TMDb, except we can use the IMDb ID to search for movies in TMDb. Like TMDb, OMDb requires that you sign up for your own API key. You can do so by submitting your email on this form here, and they will email you the API key. As you’ll notice on this form, you can only use 1000 calls to this API per day, which is totally fine for our purposes. (Again, you would run into issues if you started sharing this API key with the world, so be sure to keep that API key to yourself!)

Because this blog post is already starting to get really long, I’m just going to paste in my script, which gets both the Rotten Tomatoes critic score as well as the Metacritic metascore. This script behaves very similarly to TMDb:

# Instantiating the OMDb client
omdb_client = OMDBClient(apikey = omdb_key)
# Iterating through all the movies to extract the proper OMDb information
for index, row in df_all_data.iterrows():
# Extracting movie name from the row
movie_name = row['movie_name']

# Using the OMDb client to search for the movie results using the IMDb ID
omdb_details = omdb_client.imdbid(row['imdb_id'])

# Resetting the Rotten Tomatoes critic score variable
rt_critic_score = None

# Checking if the movie has any ratings populated under 'ratings'
omdb_ratings_len = len(omdb_details['ratings'])

if omdb_ratings_len == 0:
print(f'{movie_name} has no Rotten Tomatoes critic score.')
elif omdb_ratings_len >= 0:
# Extracting out the Rotten Tomatoes score if available
for rater in omdb_details['ratings']:
if rater['source'] == 'Rotten Tomatoes':
rt_critic_score = rater['value']

# Populating Rotten Tomatoes critic score appropriately
if rt_critic_score:
df_all_data.loc[index, 'rt_critic_score'] = rt_critic_score
else:
df_all_data.loc[index, 'rt_critic_score'] = np.nan

# Populating the Metacritic metascore appropriately
df_all_data.loc[index, 'metascore'] = omdb_details['metascore']

When all is said and done, these are the 18 features you should see associated to all our movies:

Screenshot captured by the author

Let’s go ahead and save this dataset as a CSV and wrap up this post!

The End of Data Gathering…?

Phew, we came a long way in this post! We obtained 18 different features across 3 different data sources, which I hope will give us a good foundation to start generating new features in our next feature engineering section, but… we’re not quite done with data gathering. One feature that I really want that I was unable to obtain via any of the sources was the Rotten Tomatoes audience score. Because the official Rotten Tomatoes API is so difficult to get access to, I’m not even going to try getting access to that.

Instead, I’m going to see if we can use a special technique called web scraping (or screen scraping) to get that information directly from the Rotten Tomatoes website. That is sort of a long process in and of itself, and this post here is already long enough as is. So that’s what we’ll do in the next post!

Thank you all for checking out this post! Hope you found it fun and informative. I’m also starting to live stream the coding process for this project on YouTube, and you can find all my happenings at my LinkTree. I actually already did the live streaming for this particular post and the next one on screen scraping, and replays are available on YouTube. Once again, thank you for reading! See you in the next post. 🎬

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: