Predictive CA2
Predictive CA2
Predictive CA2
Abstract
The movie recommender system is designed to suggest movies to users based on their preferences,
addressing the problem of information overload in online streaming platforms. The primary objective
of this project is to develop a recommendation model that delivers personalized movie suggestions
by analyzing user interests and movie features. The system employs several techniques, including
feature selection, feature extraction, stemming, Term Frequency-Inverse Document Frequency (TF-
IDF), and cosine similarity, to process and analyze the data.
Feature selection and extraction techniques are used to identify relevant attributes from the movie
dataset, while stemming helps in normalizing textual data by reducing words to their root forms. The
TF-IDF method is applied to convert the textual information, such as movie descriptions or genres,
into numerical values that represent the importance of each term. Cosine similarity is then used to
measure the similarity between users’ preferences and movie features, allowing the system to
recommend movies that are most aligned with the users’ tastes.
The expected outcome is a functional movie recommender system that can provide accurate and
personalized suggestions, enhancing user experience on streaming platforms.
Introduction
• Problem Statement:
With the vast number of movies available on streaming platforms, users often struggle to find films
that match their tastes. Traditional recommendation systems, which rely on user ratings, face
challenges like the cold start problem. This project addresses these issues by developing a content-
based filtering movie recommendation system that focuses on movie attributes—such as genre, cast,
and plot summaries—rather than user ratings.
• Objective:
The objective is to create a machine learning model that accurately recommends movies based on
user preferences and past interactions. The project aims to enhance user experience by providing
relevant suggestions, utilize advanced feature engineering techniques, evaluate model performance
with metrics like precision and recall, and establish a feedback mechanism for continuous
improvement. Ultimately, the goal is to deliver personalized recommendations that increase user
engagement and satisfaction.
Literature Review
Movie recommendation systems have become increasingly popular in recent years, addressing the
challenge of providing personalized content to users. Various approaches and methods have been
developed to solve this problem, with the most common techniques being collaborative filtering,
content-based filtering, and hybrid methods.
Existing Approaches
1. Collaborative Filtering
Collaborative filtering is a popular technique that recommends movies based on the
Cini-gram Precision Index
preferences of users with similar tastes. It can be further divided into user-based and item-
based approaches. In user-based collaborative filtering, recommendations are generated
based on the ratings given by users who share similar preferences. In item-based
collaborative filtering, the system recommends movies that are similar to the ones a user has
liked in the past. Although collaborative filtering can be effective, it suffers from issues like
the cold-start problem, where recommendations are difficult to generate for new users or
items.
2. Content-Based Filtering
Content-based filtering methods recommend movies based on the similarity between the
content features of the movies and the user's preferences. These techniques often use
metadata such as movie genres, plot summaries, and actors. The main limitation of content-
based filtering is that it may not capture user preferences effectively over time, as it relies on
predefined content features rather than user behavior patterns.
3. Hybrid Methods
Hybrid recommendation systems combine collaborative filtering and content-based
approaches to leverage the strengths of both techniques while minimizing their weaknesses.
These systems can provide more accurate recommendations by incorporating user behavior
and content features. However, hybrid methods can be computationally expensive and
complex to implement.
Several research papers and projects have contributed to the development of this movie
recommender system:
• "A Content-based Recommender System for Academic Conferences" (Pazzani & Billsus,
2007) introduced the concept of using TF-IDF for text data in recommendation systems,
which influenced the feature extraction approach used in this project.
While existing methods provide a solid foundation, they often encounter challenges such as limited
accuracy due to sparse datasets, scalability issues, and the cold-start problem. Additionally, many
systems do not perform effective preprocessing of text data, which can lead to suboptimal
recommendations. This project addresses these gaps by incorporating techniques such as:
• Text preprocessing (stemming) to normalize textual data for better similarity calculations.
• TF-IDF with cosine similarity for a more refined measurement of similarity between user
preferences and movie attributes, resulting in more precise recommendations.
Cini-gram Precision Index
Data Collection
Dataset
2. tmdb_5000_credits.csv: This dataset includes cast and crew information for the same set of
movies, detailing the actors, directors, producers, and other crew members involved in the
production.
These datasets were chosen because they provide a comprehensive set of features required for
building a robust recommendation system, including both text-based data (e.g., movie overview) and
categorical data (e.g., genres, cast). The datasets were merged to form a unified dataset for analysis,
using the "movie_id" column as a common identifier.
Data Preprocessing
The preprocessing steps undertaken to prepare the data for the recommendation model include:
1. Merging Datasets
o Missing or null values were identified across the merged dataset. Features with
critical missing information (e.g., movie title or genre) were removed as they are
essential for building the model.
o For non-critical features, missing values were filled using placeholder values or by
applying statistical imputation techniques (e.g., filling missing numerical data with
the mean or median).
3. Removing Duplicates
o Duplicate rows were checked and removed to ensure that each movie entry was
unique, avoiding redundancy in recommendations.
o Text Data Extraction: Text-based fields (e.g., movie "overview") were extracted to
create feature vectors that can represent the movie content in a meaningful way for
similarity calculation.
Cini-gram Precision Index
o Genres and Keywords Parsing: The "genres" and "keywords" columns contained
JSON-like structures, which were parsed and transformed into lists of relevant terms.
5. Text Preprocessing
o Stemming: Applied stemming to normalize the text by reducing words to their base
or root forms (e.g., "playing" becomes "play").
o Lowercasing: Converted all text to lowercase to standardize the data and avoid case-
sensitive mismatches.
o Categorical data such as genres, cast, and crew were transformed into a suitable
format for model training. For example, the names of genres and cast members were
converted into vectors using custom encoding techniques.
o Combining Features: The "genres," "keywords," "overview," and other features were
combined into a single feature string to create a consolidated representation of each
movie.
7. Feature Vectorization
o The TF-IDF vectors enable the system to measure the weight of each term (word)
within the context of movie descriptions, accounting for the frequency of the term in
individual documents and across the dataset.
8. Similarity Calculation
o Cosine Similarity: Used to calculate the similarity between movies based on the TF-
IDF vectors. Cosine similarity measures the cosine of the angle between two vectors,
providing a score that indicates how similar the two movies are.
o This similarity score was used to rank movies and recommend those with the highest
similarity to a user's selected movie.
Cini-gram Precision Index
Methodology
❖ Algorithm Selection:
• Text Vectorization:
While the exact method is not explicitly mentioned, content-based filtering typically employs
one of the following vectorization techniques:
❖ Model Building
After vectorization, the algorithm calculates the similarity between movies using cosine
similarity. This measures the cosine of the angle between two vectors, representing how similar
two sets of text data are based on their feature vectors.
❖ Feature Engineering
• Feature Selection:
o The model uses the 'tags' column from the dataset. This column is a mixture of text
features (probably genres, keywords, actors, and possibly plot summaries). These are
treated as the most relevant descriptors for each movie.
• Text Preprocessing:
o The tags undergo several preprocessing steps before being used for similarity
calculation:
Cini-gram Precision Index
• Dimensionality Reduction:
o stemming and vectorization help simplify the data representation, effectively
reducing the dimensionality of the text features.
o Similarity Matrix:
▪ After vectorizing the tags, the next step is computing the similarity between
different movies. This is likely done using cosine similarity, which measures
the cosine of the angle between two vectors. Cosine similarity works well in
text-based models, as it measures the relative distance between two feature
vectors (i.e., how similar two sets of text are).
Unlike classification or regression models, content-based filtering systems like this don’t follow the
conventional split of training and test datasets. Instead, the entire dataset is used to compute
similarities.
• Dataset Splitting:
o The notebook does not explicitly mention splitting the dataset into training and test
sets. In content-based recommendation systems, training-testing splits are
uncommon because recommendations are typically made for all items based on
their content. However, evaluation can still be done using alternative techniques like
user-based evaluation or offline metrics.
• Evaluation Process:
o The primary objective of this model is to recommend movies based on content
similarity. Recommendations are provided by finding the top 'n' most similar movies
to the target movie. For instance, in the notebook, a function is used to recommend
similar movies when a user inputs a particular movie title (e.g., "Harry Potter and the
Half-Blood Prince"). The system returns the top 5 movies that are most similar based
on content.
Model Evaluation
Results
The movie recommender system's performance was evaluated based on the accuracy of the
recommendations provided by the content-based filtering model. The system relies on cosine
similarity to measure how closely a movie matches the user's preferences.
To visualize the performance, a table of recommended movies based on a sample user's input is
presented:
Cini-gram Precision Index
These similarity scores reflect how well each movie aligns with the selected movie's features, based
on the TF-IDF vector representations of the content (genres, keywords, and overview). The higher
the score, the more relevant the recommendation is.
To further assess performance, a graph showing the distribution of similarity scores for the top 10
recommended movies can be plotted.
Evaluation Metrics
In this project, cosine similarity was the primary evaluation metric used to assess how similar movies
are to each other. This metric calculates the cosine of the angle between two vectors and ranges
from -1 to 1, where:
• A score of 1 indicates that the two movies are identical in terms of their feature vectors.
The higher the cosine similarity score, the more likely the recommended movie will match the user's
preferences.
Error Analysis
While the movie recommender system performed well on most recommendations, a few limitations
can be identified:
1. Cold-Start Problem:
o The model struggled to provide accurate recommendations for new movies or users
with limited interaction history. Since the system relies on existing movie features
and similarities, the absence of sufficient user data or movie attributes (genres,
keywords) made it difficult to recommend relevant movies.
o The model tended to favor popular movies with extensive metadata, often
recommending well-known blockbusters over niche or less popular films. This is due
to the abundance of descriptive features (e.g., keywords, cast) available for popular
movies, which results in higher similarity scores.
o Cosine similarity measures the angle between two feature vectors but doesn’t
account for the magnitude (popularity) of the vectors. As a result, two movies might
have a high similarity score even if one is a minor independent film and the other is a
major release.
4. Feature Overlap:
o The model sometimes recommended movies that shared genres or keywords but
were not thematically similar. For instance, two movies with the same genre (e.g.,
"action") but with different tones or narratives (e.g., a superhero movie vs. a spy
thriller) were sometimes incorrectly recommended together.
The movie recommender system, based on content-based filtering, showed promising results by
providing personalized movie recommendations using cosine similarity on TF-IDF vectorized features.
The model effectively captured the relationships between movies by analyzing their descriptions,
genres, and keywords, and then recommending similar movies.
One of the key findings was that the model was particularly effective at recommending movies with
overlapping genres and themes. For example, if a user liked a sci-fi movie such as Inception, the
system would recommend other popular sci-fi or mind-bending films like Interstellar and The Matrix.
This shows the model's strength in identifying content similarities.
However, the model exhibited some limitations, particularly when it came to movies that belong to a
broad genre or share generic keywords. For instance, action movies were sometimes clustered
together, despite varying dramatically in tone or content (e.g., superhero movies vs. war dramas).
The similarity scores were heavily influenced by shared terms in the descriptions, which can
sometimes lead to recommendations that lack depth in their thematic matching.
Significance of Findings
Despite its strengths, the model's bias toward popular and feature-rich movies highlights the need
for more advanced techniques to improve recommendation quality, especially for niche or
underrepresented content.
When compared to more advanced recommendation techniques like collaborative filtering and
hybrid models, the content-based approach shows some trade-offs:
large amounts of user interaction data (ratings, views), which makes it less effective in cold-
start situations (new users or movies with few ratings). Collaborative filtering tends to
outperform content-based systems when user behavior data is available because it captures
user taste patterns better.
• Hybrid Models: These combine both collaborative filtering and content-based filtering to
balance the strengths of both. For example, Netflix's recommendation system is a well-
known hybrid model, leveraging user ratings as well as content similarities. In general, hybrid
models achieve the highest accuracy but are computationally more complex and resource-
intensive to implement.
While our content-based model performs reasonably well on movie data, hybrid models are likely to
provide more accurate recommendations by capturing both user behavior and content features.
The choice of content-based filtering with TF-IDF and cosine similarity worked well because it allows
the system to leverage the detailed information available in the movie descriptions, genres, and
keywords. TF-IDF effectively weighted the terms that were most relevant to each movie, making the
similarity scores more meaningful. Cosine similarity, being a simple and interpretable metric, allowed
us to quantify the degree of similarity between movies in a straightforward way.
For users interested in specific genres, actors, or plot types, this approach is highly effective, as it
focuses on the content of the movies rather than relying on other users’ ratings.
2. Hybrid Models: Hybrid models often outperform both content-based and collaborative
filtering alone because they leverage multiple data sources. These systems can identify users
with similar tastes while also considering movie features. This ensures that even movies
without extensive user ratings can still be recommended based on their content. Hybrid
models would address the cold-start issue and improve recommendations for niche content
by combining both user behavior and movie features.
Conclusion
In this project, a content-based movie recommender system was successfully implemented using
Python. The model relies on TF-IDF vectorization to represent movie content (genres, keywords, and
overviews) and cosine similarity to recommend movies that are similar in terms of features. The
system was able to suggest relevant movies based on the content features of a given movie,
providing personalized recommendations to users.
Cini-gram Precision Index
Summary of Outcomes
The recommender system performed well in suggesting movies that share common themes, genres,
and other textual attributes with the selected movie. It was particularly useful in providing
recommendations for movies with rich metadata (e.g., well-defined genres, popular cast and crew).
The use of cosine similarity ensured that the system could accurately quantify the similarity between
movies based on their content.
In a real-world scenario, this movie recommender system can be useful for platforms like streaming
services (Netflix, Amazon Prime) and movie databases (IMDb, TMDb) where users seek personalized
movie suggestions. The content-based approach allows the system to provide recommendations
even when user interaction data (such as ratings) is limited or unavailable, making it an effective
solution in scenarios where user behavior data is sparse. Additionally, this system can help users
discover new movies based on their preferences for certain genres or themes, thereby enhancing
user engagement.
Although the content-based filtering approach was effective, there are several areas for
improvement and potential future work:
1. Ensemble Methods: Future work could explore the use of ensemble methods to improve
recommendation accuracy. By combining multiple recommendation algorithms, the system
can achieve better performance across a wider range of movies and users.
2. Sentiment Analysis: Incorporating sentiment analysis into the model could further improve
recommendations by analyzing the sentiment expressed in user reviews or movie
descriptions, helping to match movies that align with the user's emotional preferences.
3. Handling Cold-Start Issues: Developing a method to handle cold-start problems (i.e., when
there is little to no data on new users or movies) could significantly improve the model’s
performance, especially for new users or obscure movies.
4. Incorporating Advanced NLP Techniques: Using more advanced natural language processing
techniques, such as word embeddings (e.g., Word2Vec or BERT), could capture deeper
semantic relationships between movies, leading to more accurate recommendations.
Appendices
Notebook link : https://pdf.ac/3uOXyZ (PDF)
Cini-gram Precision Index
Streamlit code :
Cini-gram Precision Index