Simple and hands-on machine learning project using sci-kit learn
In this post, I will show you how to build a movie recommender program using Python. This will be a simple project where we will be able to see how machine learning can be used in our daily life. If you check my other articles, you will see that I like to demonstrate hands-on projects. I think this is the best way to practice our coding skills and improve ourselves. Building a movie recommender program is a great way to get started with machine recommenders. After sharing the contents table, I would like to introduce you to recommendation systems.
- The Movie Data
- Data Preparation
- Content-Based Recommender
- Show Time
- Video Demonstration
Recommendation systems have been around with us for a while now, and they are so powerful. They do have a strong influence on our decisions these days. From movie streaming services to online shopping stores, they are almost everywhere we look. If you are wondering how do they know what you might buy after adding an “x” item to your cart, the answer is simple: Power of Data.
We may look very different from each other, but our habits can be very similar. And the companies love to find similar habits of their customers. Since they know that many people who bought “x” item also bought “y” item, they recommend you to add “y” item to your cart. And guess what, you are training your own recommender the more you buy, which means the machine will know more about you.
Recommendation systems are a very interesting field of machine learning, and the cool part about them is that they are all around us. There is a lot to learn about this topic, to keep things simple I will stop here. Let’s begin building our own movie recommender system!
The Movie Data
I’ve found great movie data on Kaggle. If you haven’t heard about Kaggle, Kaggle is the world’s largest data science community with powerful tools and resources to help you achieve your data science goals.
Here is the link to download the dataset.
- The data folder contains the metadata for all 45,000 movies listed in the Full MovieLens Dataset.
- The dataset consists of movies released on or before July 2017. Data points include cast, crew, plot keywords, budget, revenue, posters, release dates, languages, production companies, countries, TMDB vote counts, and vote averages.
- This data folder also contains a rating file with 26 million ratings from 270,000 users for all 45,000 movies.
First things first, let’s start by importing and exploring our data. After you download the data folder, you will multiple dataset files. For this project, we will be using the movies_metadata.csv dataset. This dataset has all we need to create a movie recommender.
import pandas as pd#load the data
movie_data = pd.read_csv('data/movie_data/movies_metadata.csv', low_memory=False)movie_data.head()
Content Based Recommender
Content based recommender is a recommendation model that returns a list of items based on a specific item. A nice example of this recommenders are Netflix, YouTube, Disney+ and more. For example, Netflix recommends similar shows that you watched before and liked more. With this project, you will have a better understanding of how these online streaming services’ algorithms work.
Back to the project, as an input to train our model we will use the overview of the movies that we checked earlier. And then we will use some sci-kit learn ready functions to build our model. Our recommender will be ready in four simple steps. Let’s begin!
1. Define Vectorizer
from sklearn.feature_extraction.text import TfidfVectorizertfidf_vector = TfidfVectorizer(stop_words='english')movie_data['overview'] = movie_data['overview'].fillna('')tfidf_matrix = tfidf_vector.fit_transform(movie_data['overview'])
Understanding the above code
- Importing the vectorizer from sci-kit learn module. Learn more here.
- Tf-idf Vectorizer Object removes all English stop words such as ‘the’, ‘a’ etc.
- We are replacing the Null(empty) values with an empty string so that it doesn’t return an error message when training them.
- Lastly, we are constructing the required Tf-idf matrix by fitting and transforming the data
2. Linear Kernel
We will start by importing the linear kernel function from sci-kit learn module. The linear kernel will help us to create a similarity matrix. These lines take a bit longer to execute, don’t worry it’s normal. Calculating the dot product of two huge matrixes is not easy, even for machines 🙂
from sklearn.metrics.pairwise import linear_kernelsim_matrix = linear_kernel(tfidf_matrix, tfidf_matrix)
Now, we have to construct a reverse map of the indices and movie titles. And in the second part of the Series function, we are cleaning the movie titles that are repeating with a simple function called drop_duplicates.
indices = pd.Series(movie_data.index, index=movie_data['title']).drop_duplicates()indices[:10]
4. Finally — Recommender Function
def content_based_recommender(title, sim_scores=sim_matrix):
idx = indices[title] sim_scores = list(enumerate(sim_matrix[idx])) sim_scores = sorted(sim_scores, key=lambda x: x, reverse=True) sim_scores = sim_scores[1:11] movie_indices = [i for i in sim_scores] return movie_data['title'].iloc[movie_indices]
Well done! It’s time to test our recommender. Let’s see in action and how powerful it really is. We will run the function by adding the movie name as a string as a parameter.
Similar Movies to “Toy Story”
Congrats!! You have created a program that recommends movies to you. Now, you have a program to run when you want to choose your next Netflix show. Hoping that you enjoyed reading this hands-on project. I would be glad if you learned something new today. Working on hands-on programming projects like this one is the best way to sharpen your coding skills.
Feel free to contact me if you have any questions while implementing the code.