Movie Recommender System (Final Report)
Movie Recommender System (Final Report)
Movie Recommender System (Final Report)
MACHINE LEARNING
By
M. Harika Reddy(19STUCHH010116)
Pavan Katne(19STUCHH010178)
S. Vinay kumar(19STUCHH010289)
We would like to express our thankfulness to our director Dr. K.L Narayana of the Faculty of
Science and Technology (ICFAI Foundation for Higher Education, Hyderabad) for giving us such
an opportunity.
We are very much thankful to Dr.P.Pavan kumar for his constant supervision, guidance, and co-operation
throughout the project. Their useful suggestions for this whole work and co-operative behavior are
sincerely acknowledged.
We would like to extend our sincere thanks to all of them who helped us to complete this project.
ABSTRACT
A recommendation engine filters the data using different algorithms and recommends the most relevant
items to users. It first captures the past behavior of a customer and based on that, recommends products
which the users might be likely to buy.
If a completely new user visits an e-commerce site, that site will not have any past history of that user. So
how does the site go about recommending products to the user in such a scenario? One possible solution
could be to recommend the best selling products, i.e. the products which are high in demand. Another
possible solution could be to recommend the products which would bring the maximum profit to the
business.
Three main approaches are used for our recommender systems. One is Demographic Filtering i.e They
offer generalized recommendations to every user, based on movie popularity and/or genre. The System
recommends the same movies to users with similar demographic features. Since each user is different ,
this approach is considered to be too simple.
The basic idea behind this system is that movies that are more popular and critically acclaimed will have
a higher probability of being liked by the average audience. Second is content-based filtering, where we
try to profile the users interests using information collected, and recommend items based on that profile.
The other is collaborative filtering, where we try to group similar users together and use information
about the group to make recommendations to the user.
In this hustling world, entertainment is a necessity for each one of us to refresh our mood and energy.
Entertainment regains our confidence for work and we can work more enthusiastically. For revitalizing
ourselves, we can listen to our preferred music or can watch movies of our choice. For watching
favourable movies online we can utilize movie recommendation systems, which are more reliable, since
searching of preferred movies will require more and more time which one cannot afford to waste. In this
paper, to improve the quality of a movie recommendation system, a content based filtering is presented in
the proposed methodology and comparative results have been shown which depicts that the proposed
approach shows an improvement in the accuracy, quality and scalability of the movie recommendation
system than the pure approaches in three different datasets.
TABLE OF CONTENTS
1. Introduction………………………………………………………………………..1
1.1 Motivation.............................................................................................................................1
1.4 Methodology..........................................................................................................................3
2. Literature Review…………………………………………………………………………………4
4.4 Comparison………………………………………………………………………………………15
5.1 E-Commerce………………………………………………………………………………………16
6. Data Preprocessing………………………………………………………………………………..21
7. Data Normalization……………………………………………………………...............................28
8. Vectorization………………………………………………………………………………………..30
11. Lemmatization……………………………………………………………………………………….33
13. Conclusion……………………………………………………………………………………………35
14. References……………………………………………………………………………………………36
15. Appendices……………………………………………………………………………………………37
1.INTRODUCTION
A recommendation system or recommendation engine is a model used for information filtering where it
tries to predict the preferences of a user and provide suggests based on these preferences. These systems
have become increasingly popular nowadays and are widely used today in areas such as movies, music,
books, videos, clothing, restaurants, food, places and other utilities. These systems collect information
about a user's preferences and behaviour, and then use this information to improve their suggestions in
the future.
Movies are a part and parcel of life. There are different types of movies like some for entertainment,
some for educational purposes, some are animated movies for children, and some are horror movies or
action films. Movies can be easily differentiated through their genres like comedy, thriller, animation,
action etc. Other way to distinguish among movies can be either by releasing year, language, director
etc. Watching movies online, there are a number of movies to search in our most liked movies . Movie
Recommendation Systems helps us to search our preferred movies among all of these different types of
movies and hence reduce the trouble of spending a lot of time searching our favourable movies. So, it
requires that the movie recommendation system should be very reliable and should provide us with the
recommendation of movies which are exactly same or most matched with our preferences.
1enrich a user's shopping experience. Recommendation systems have several benefits, the most important
being customer satisfaction and revenue. Movie Recommendation system is very powerful and important
system. But, due to the problems associated with pure collaborative approach, movie recommendation
systems also suffers with poor recommendation quality and scalability issues.
1.1 Motivation
1
Nowadays, a recommender system can be found in almost every information-intensive website. For
example, a list of likely preferred products are recommended to an customer when browsing the target
product in Amazon. Moreover, when watching a video clip in Youtube, a recommender system employed
in the system suggests some relevant videos to users by learning the users’ behaviours that were
generated previously. So to speak, recommender systems have deeply changed the way we obtain
information. Recommender systems not only make it easier and more convenient for people to receive
information, but also provide great potential in the economic growth as described in . As more and more
people realise the importance and power of recommender systems, the exploration for designing high
quality recommender systems have been remaining an active topic in the community over the past
decade. Due to the continuous efforts into the field, thankfully, many recommender systems have been
developed and used in a variety of domains. Based on this, a key question arising is how to know the
performance of recommender systems so that the most suitable ones can be found to apply in certain
contexts or domain
motivated by the importance of evaluating recommender systems and the highlight on comprehensive
metric considerations into the evaluation experiments, the project aims to explore the most possible
scientific method to evaluating recommender systems. The implementation of the project is presented in
the form of a web application.
The objective of this project is to provide accurate movie recommendations to users. The goal of the
project is to improve the quality of movie recommendation system, such as accuracy, quality and
scalability of system than the pure approaches. This is done using Hybrid approach by combining content
based filtering and collaborative filtering, To eradicate the overload of the data, recommendation system
is used as information filtering tool in social networking sites .Hence, there is a huge scope of exploration
in this field for improving scalability, accuracy and quality of movie recommendation systems Movie
Recommendation system is very powerful and important system. But, due to the problems associated
with pure collaborative approach, movie recommendation systems also suffers with poor
recommendation quality and scalability issues.
1.4 Methodology
The hybrid approach proposed an integrative method by merging fuzzy k-means clustering method and
genetic algorithm based weighted similarity measure to construct a movie recommendation system. The
proposed movie recommendation system gives finer similarity metrics and quality than the existing
Movie recommendation system but the computation time which is Movie Recommendation System Dept
of CSE, CMRIT 2019-2020 Page 3 taken by the proposed recommendation system is more than the
existing recommendation system. This problem can be fixed by taking the clustered data points as an
input dataset The proposed approach is for improving the scalability and quality of the movie
recommendation system
.We use a Hybrid approach , by unifying Content-Based Filtering and Collaborative Filtering, so that the
approaches can be profited from each other. For computing similarity between the different movies in the
given dataset efficiently and in least time and to reduce computation time of the movie recommender
engine we used cosine similarity measure.
3
2.LITERATURE REVIEW
For building a recommender system from scratch, we face several different problems. Currently there are a lot of
recommender systems based on the user information, so what should we do if the website has not gotten
enough users. After that, we will solve the representation of a movie, which is how a system can understand a
movie. That is the precondition for comparing similarity between two movies. Movie features such as genre, actor
and director is a way that can categorize movies. But for each feature of the movie, there should be different
weight for them and each of them plays a different role for recommendation. So we get these questions:
• What kind of movie features can be used for the recommender system.
The goals of this thesis project is to do the research of Recommender Systems and find a suitable way to
implement it for Vionel.com. There are many kinds of Recommender Systems but not all of them are suitable for
one specific problem and situation. Our goal is to find a new way to improve the classification of movies, which
is the requirement of improving content-based recommender systems.
In order to achieve the goal of the project, the first process is to do enough back- ground study, so the literature
study will be conducted. The whole project is based on a big amount of movie data so that we choose
quantitative research method. For philosophical assumption, positivism is selected because the project is experi-
mental and testing character. The research approach is deductive approach as the improvement of our research
will be tested by deducing and testing a theory. Ex post facto research is our research strategy, the movie data is
4
already collected and we don’t change the independent variables. We use experiments to collect movie data.
Computational mathematics is used data analysis because the result is based on improvement of algorithm. For
the quality assurance, we have a detail explana- tion of algorithm to ensure test validity. The similar results will
be generated when we run the same data multiple times, which is for reliability. We ensure the same data
leading to same result by different researchers.
This analysis mainly focuses on the domain of Machine Learning example of Movie Recommendation System
with a approach of finding the similarity scores between the two contents in the content based filtering. It helps
us in finding the distance between the two vectors and their angle by the help of the cosine similarity formula
and their magnitude of relative scores. In this model, we have considered two texts and plotted them in the
form of a graph from where we have plotted the points taking two-dimension plane of x-y plane only.
In today computer world we have lots of stuff on our Internet sources to watch and see but every single stuff
available not matches with our liking. We sometimes get the feed of the videos, vies, news, clothing etc. which
is not according to our liking and interest. It makes the customer interest in the application lowest and he/she
further doesn’t want to get through the same application again. The need for the hour is to develop some code
which can tell at a beginner level the matching pattern of the customer trend and recommend him with the best
item of his interest level. This will help us in making the customer experience satisfactory and able to achieve
good ratings and popularity as well.
Many of the recommendation systems we are seeing today in our environment such as in the YouTube for
example if I see lots of news regarding the GK and Current Affairs then it offers me the related videos
according to it with different subscribers. It gains popularity by application rating and at the same time
enhances the customer experience. This policy of recommendation system is really helpful in giving optimum
results to an application profitability and to make the organisation more connected. We can also see the
recommendation work in online food applications such as Zomato, Food Panda and Swiggy which offers their
customers the food restaurants which supplies their taste food. They learn upon the behaviour of the customer
from the previous orders and tries to impress them with the latest add-ons of their favourite cuisines and stuffs.
5
3.Types of Recommender System
3.1Demographic Filtering
There are various types of recommender systems with different approaches and some of them are
classified as below:
1. Demographic Filtering- They offer generalized recommendations to every user, based on movie
popularity and/or genre. The System recommends the same movies to users with similar demographic
features. Since each user is different, this approach is considered to be too simple. The basic idea behind
this system is that movies that are more popular and critically acclaimed will have a higher probability of
being liked by the average audience. Before getting started with this –
● We need a metric to score or rate movie
● Calculate the score for every movie
● Sort the scores and recommend the best rated movie to the users.
We can use the average ratings of the movie as the score but using this won't be fair enough since a
movie with 8.9 average rating and only 3 votes cannot be considered better than the movie with 7.8 as as
average rating but 40 votes. So, I'll be using IMDB's weighted rating (wr) which is given as :-
𝑣
Weighted Rating(WR) = ( 𝑚
. 𝑅) + ( . 𝐶)
𝑣+𝑚
𝑣+𝑚
where,
● v is the number of votes for the movie;
● m is the minimum votes required to be listed in the chart;
● R is the average rating of the movie; And
6
● C is the mean vote across the whole report
7
3.2 Content Based Filtering
In content-based filtering, items are recommended based on comparisons between item profile and user
profile. A user profile is content that is found to be relevant to the user in form of keywords(or features).
A user profile might be seen as a set of assigned keywords (terms, features) collected by algorithm from
items found relevant (or interesting) by the user. A set of keywords (or features) of an item is the Item
profile. For example, consider a scenario in which a person goes to buy his favorite cake ‘X’ to a pastry.
Unfortunately, cake ‘X’ has been sold out and as a result of this the shopkeeper recommends the person
to buy cake ‘Y’ which is made up of ingredients similar to cake ‘X’. This is an instance of content-based
filtering.
We will be using the cosine similarity to calculate a numeric quantity that denotes the similarity
8
between two movies. We use the cosine similarity score since it is independent of magnitude and is
relatively easy and fast to calculate. Mathematically, it is defined as follows:
𝐴⋅𝐵 𝑛
∑ 𝐴𝑖𝐵𝑖
9
3.3 Credits, Genres and keywords Based Recommender
It goes without saying that the quality of our recommender would be increased with the usage of better
10
metadata. That is exactly what we are going to do in this section. We are going to build a recommender
based on the following metadata: the 3 top actors, the director, related genres and the movie plot
keywords. From the cast, crew and keywords features, we need to extract the three most important actors,
the director and the keywords associated with that movie. Right now, our data is present in the form of
"stringified" lists , we need to convert it into a safe and usable structure.
10
4.Collaborative Filtering Based Systems
Our content based engine suffers from some severe limitations. It is only capable of suggesting movies
which are close to a certain movie. That is, it is not capable of capturing tastes and providing
recommendations across genres. Also, the engine that we built is not really personal in that it doesn't
capture the personal tastes and biases of a user. Anyone querying our engine for recommendations based
on a movie will receive the same recommendations for that movie, regardless of who she/he is.
Therefore, in this section, we will use a technique called Collaborative Filtering to make
recommendations to Movie Watchers. It is basically of two types:-
D 1 5 4
4
E 2 1
F 4 1 NA
11
Since user A and F do not share any movie ratings in common with user E.Their similarities with user E are not
defined person correlation. Therefore, we only need to consider user B, C and D. Based on person correlation, we can
compute the following similarity
A 2 2 4 5 NA
B 5 4 1 0.87
C 5 2 1
D 1 5 4 -1
E 4 2 1
F 4 5 1 NA
Although computing user-based CF is very simple, it suffers from several problems. One main issue is that
users’ preference can change over time. It indicates that precomputing the matrix based on their neighboring
users may lead to bad performance. To tackle this problem, we can apply item-based CF.
12
4.2 Item Based Filtering
Instead of measuring the similarity between users, the item-based CF recommends items based on their
similarity with the items that the target user rated. Likewise, the similarity can be computed with 17
Pearson Correlation or Cosine Similarity. The major difference is that, with item-based collaborative
filtering, we fill in the blank vertically, as oppose to the horizontal manner that user-based CF does. The
following table shows how to do so for the movie Me Before
It successfully avoids the problem posed by dynamic user preference as item-based CF is more static.
However, several problems remain for this method. First, the main issue is scalability. The computation
grows with both the customer and the product. The worst case complexity is O(mn) with m users and n
items. In addition, sparsity is another concern. Take a look at the above table again. Although there is
only one user that rated both Matrix and Titanic rated, the similarity between them is 1. In extreme cases,
we can have millions of users and the similarity between two fairly different movies could be very high
simply because they have similar rank for the only user who ranked them both.
13
One way to handle the scalability and sparsity issue created by CF is to leverage a latent factor model to
capture the similarity between users and items. Essentially, we want to turn the recommendation problem
into an optimization problem. We can view it as how good we are in predicting the rating for items given
a user. One common metric is Root Mean Square Error (RMSE). The lower the RMSE, the better the
performance.
Now talking about latent factor you might be wondering what is it ?It is a broad idea which describes a
property or concept that a user or an item have. For instance, for music, latent factor can refer to the genre
that the music belongs to. SVD decreases the dimension of the utility matrix by extracting its latent
factors. Essentially, we map each user and each item into a latent space with dimension r. Therefore, it
helps us
better understand the relationship between users and items as they become directly comparable. The
below figure illustrates this idea.
14
Now enough said, let's see how to implement this. Since the dataset we used before did not have
userId(which is necessary for collaborative filtering) let's load another dataset. We'll be using the Surprise
library to implement SVD.
Advantages of collaborative filtering based systems:
● It is dependent on the relation between users which implies that it is content-independent.
● CF recommender systems can suggest serendipitous items by observing similar-minded people’s
behavior.
● They can make real quality assessment of items by considering other peoples experience Disadvantages
of collaborative filtering are:
● Early rater problem: Collaborative filtering systems cannot provide recommendations for new items
since there are no user ratings on which to base a prediction.
● Gray sheep: In order for CF based system to work, group with similar characteristics are needed. Even
if such groups exist, it will be very difficult to recommend users who do not consistently agree or
disagree to these groups.
● Sparsity problem: In most cases, the amount of items exceed the number of users by a great margin
which makes it difficult to find items that are rated by enough people.
4.4Comparison
Each approach has its advantage and disadvantage, and the effects are different as well for different
dataset. The approach may not suitable for all kinds of problems because of the algorithm itself. For
example, it is hard to apply automate feature extraction to media data by content-based filtering method.
And the recommendation result only limits to items the user ever chose, which means the diversity is not
so good. It is very hard to recommend for users who never choose anything. Collaborative filtering
method overcomes the disadvantage of mentioned before somehow. But CF based on big amount of
history data, so there are problems of sparsity and cold start. In terms of cold start, as collaborative
filtering is based on the similarity between the items chosen by users, there are not only new user
problem[30], but also new item problem, which means it is hard to be recommended if the new item has
never been recommended before[1]. The comparison is in Table 2.3[11].
15
5. FAMOUS RECOMMENDER SYSTEM
What is the difference between recommender system and search engine is that recommender system is
based on the behaviors of user. There are a lot of websites using recommender system in the world.
Personalized recommender system analyzes a huge amount of user behavior data and provides
personalized content to different users, which improves the click rate and conversions of the website. The
fields that widely use recommender system are e-commerce, movie, video, music, social network,
reading, local based service, personalized email and advertisement.
5.1 E-Commerce
The most famous e-commerce website, Amazon, is the active application and promoter of recommender
system. The recommender system of Amazon reaches deeper into all kinds of products. is the
recommendation list of Amazon. Apart from personalized recommendation list, another important
application of recommender system is relevant recommendation list. When you buy something in
Amazon. Amazon has two kinds of relevant recommendation, one is customers who bought this item also
bought. Another is what other items do customers buy after viewing this item. The difference between the
two recommendations is the calculation of the different user behaviors. The most important application of
relevant recommendation is cross selling. When you are buying something, Amazon will tell you what
other customers who bought this item also bought and let you decide whether buy it at the same time. If
you do, the goods will be packed and provide a certain discount.
16
Recommendation algorithms
(1) Content Based Recommendation
Advantages:
result is intuitive and easy to interpret
No need for users? access history data
No new item problem and no sparsity problem
Supported by the mature technology of classification learning.
Disadvantages:
Limited by the features extraction methods
New user problem
The training of classifier needs massive data
Poor scalability.
(2) Collaboration filtering
Advantages:
No need for professional knowledge
Performance improving as the increasing of the user number
Automatic
Easy to find user?s new interesting point
Complex unstructured item can be processed. eg. Music, Video, etc.
Disadvantages:
Sparsity problem
Poor scalability
New user and new item problem
The recommendation quality limited by the history data set.
17
Fig 5.2.1 Personalized recommendation of Amazon
Fig 5.2.2 Relevant Recommendation, Customers Who Bought This Item Also
Bought
Amazon and it are the two most representative companies in recommender systems. The Below is the
recommendation page of Netflix. We can find that the recommendation result consists of the following
parts.
• The title and poster of the movie.
• The feedback of the user, including Play, Rating and Not Interested.
• Recommendation reason
18
Fig 5.2.3 Relevant Recommendation, What Other Items Do Customers Buy After
Viewing This Item
It can be illustrated that the recommendation algorithm is similar with Amazon according to the
recommendation reason of Netflix. Netflix declared that 60% of their users find movies that they are
interested in by the recommender system.
As the biggest video website in America, YouTube has a huge amount of videos that uploaded by users.
So the information overload is a serious problem for YouTube. In the latest paper of YouTube, in order to
prove the availability of personalized recommender system, researchers of YouTube ever made an
experience comparing the click rate of personalized recommendation list and popular list.
The result showed that click rate of personalized recommendation list is twice of that of popular list.
19
The successful application of personalized recommender system needs two requirements, one is
information overload because if users can find what they like easily, there is no reason to use
recommender systems. The second is the user doesn’t have clear requirements. Because they will use
search engine directly if they do. Under the two requirements, recommender system is very suitable for
personalized Internet radio. First of all, people cannot listen to all the music in the world and find which
one they like. Secondly, people often do not want to listen to a specific music, they wish to listen to
whatever musics that match their mood at that moment.There are a lot of personalized Internet radios,
such as Pandora and Last.fm. Figure 2.6 is the main page of Last.fm.
6.DATA PREPROCESSING
Data preprocessing is a data mining technique which I used to transform the raw data in a useful and
efficient format.
20
Steps Involved in Data Preprocessing:
6.1 Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data cleaning is done. It involves
handling of missing data, noisy data etc.
21
the most probable value.
Regression:
Here data can be made smooth by fitting it to a regression function.The regression used may be linear (having
one independent variable) or multiple (having multiple independent variables).
Clustering:
This approach groups the similar data in a cluster. The outliers may be undetected or it will fall outside the
clusters.
6.2 Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable for mining process. This involves
following ways:
Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to 1.0)
Attribute Selection:
In this strategy, new attributes are constructed from the given set of attributes to help the mining process.
Discretization:
This is done to replace the raw values of numeric attribute by interval levels or conceptual levels.
Numerosity Reduction:
This enable to store the model of data instead of whole data, for example: Regression Models.
Dimensionality Reduction:
This reduce the size of data by encoding mechanisms. It can be lossy or lossless. If after reconstruction from
compressed data, original data can be retrieved, such reduction are called lossless reduction else it is called
lossy reduction. The two effective methods of dimensionality reduction are:Wavelet transforms and PCA
(Principal Component Analysis).
6.2 Data Transformation
The data are transformed in ways that are ideal for mining the data. The data transformation involves steps that
are:
1. Smoothing:
It is a process that is used to remove noise from the dataset using some algorithms It allows for highlighting
important features present in the dataset. It helps in predicting the patterns. When collecting data, it can be
manipulated to eliminate or reduce any variance or any other noise form.
The concept behind data smoothing is that it will be able to identify simple changes to help predict different
trends and patterns. This serves as a help to analysts or traders who need to look at a lot of data which can often
be difficult to digest for finding patterns that they wouldn’t see otherwise.
23
2. Aggregation:
Data collection or aggregation is the method of storing and presenting data in a summary format. The data may
be obtained from multiple data sources to integrate these data sources into a data analysis description. This is a
crucial step since the accuracy of data analysis insights is highly dependent on the quantity and quality of the
data used. Gathering accurate data of high quality and a large enough quantity is necessary to produce relevant
results.
The collection of data is useful for everything from decisions concerning financing or business strategy of the
product, pricing, operations, and marketing strategies.
For example, Sales, data may be aggregated to compute monthly& annual total amounts.
3. Discretization:
It is a process of transforming continuous data into set of small intervals. Most Data Mining activities in the
real world require continuous attributes. Yet many of the existing data mining frameworks are unable to handle
these attributes.
Also, even if a data mining task can manage a continuous attribute, it can significantly improve its efficiency
by replacing a constant quality attribute with its discrete values.
For example, (1-10, 11-20) (age:- young, middle age, senior).
4. Attribute Construction:
Where new attributes are created & applied to assist the mining process from the given set of attributes. This
simplifies the original data & makes the mining more efficient.
5. Generalization:
It converts low-level data attributes to high-level data attributes using concept hierarchy. For Example Age
initially in Numerical form (22, 25) is converted into categorical value (young, old).
6. Normalization: Data normalization involves converting all data variable into a given range.
Techniques that are used for normalization are:
Min-Max Normalization:
This transforms the original data linearly.
Suppose that: min_A is the minima and max_A is the maxima of an attribute, P
We Have the Formula:
24
Z-Score Normalization:
In z-score normalization (or zero-mean normalization) the values of an attribute (A), are normalized based on
the mean of A and its standard deviation
A value, v, of attribute A is normalized to v’ by computing
Decimal Scaling:
It normalizes the values of an attribute by changing the position of their decimal points
The number of points by which the decimal point is moved can be determined by the absolute maximum value
of attribute A.
A value, v, of attribute A is normalized to v’ by computing
where j is the smallest integer such that Max(|v’|) < 1.
' v
v= j
10
25
Step-wise Forward Selection –
The selection begins with an empty set of attributes later on we decide best of the original attributes on the set
based on their relevance to other attributes. We know it as a p-value in statistics.
Step-wise Backward Selection –
This selection starts with a set of complete attributes in the original data and at each point, it eliminates the
worst remaining attribute in the set.
Combination of forwarding and Backward Selection –
It allows us to remove the worst and select best attributes, saving time and making the process faster.
3. Data Compression:
The data compression technique reduces the size of the files using different encoding mechanisms (Huffman
Encoding & run-length Encoding). We can divide it into two types based on their compression techniques.
Lossless Compression –
Encoding techniques (Run Length Encoding) allows a simple and minimal data size reduction. Lossless data
compression uses algorithms to restore the precise original data from the compressed data.
Lossy Compression –
Methods such as Discrete Wavelet transform technique, PCA (principal component analysis) are examples of
this compression. For e.g., JPEG image format is a lossy compression, but we can find the meaning equivalent
to the original the image. In lossy-data compression, the decompressed data may differ to the original data but
are useful enough to retrieve information from them.
4. Numerosity Reduction:
In this reduction technique the actual data is replaced with mathematical models or smaller representation of
the data instead of actual data, it is important to only store the model parameter. Or non-parametric method
such as clustering, histogram, sampling. For More Information on Numerosity Reduction Visit the link below:
5. Discretization & Concept Hierarchy Operation:
Techniques of data discretization are used to divide the attributes of the continuous nature into data with
intervals. We replace many constant values of the attributes by labels of small intervals. This means that
mining results are shown in a concise, and easily understandable way.
Top-down discretization –
If you first consider one or a couple of points (so-called breakpoints or split points) to divide the whole set of
attributes and repeat of this method up to the end, then the process is known as top-down discretization also
known as splitting.
26
Bottom-up discretization –
If you first consider all the constant values as split-points, some are discarded through a combination of the
neighbourhood values in the interval, that process is called bottom-up discretization.
Concept Hierarchies:
It reduces the data size by collecting and then replacing the low-level concepts (such as 43 for age) to high-
level concepts (categorical variables such as middle age or Senior).
For numeric data following techniques can be followed:
Binning–
Binning is the process of changing numerical variables into categorical counterparts. The number of categorical
counterparts depends on the number of bins specified by the user.
Histogram analysis –
Like the process of binning, the histogram is used to partition the value for the attribute X, into disjoint ranges
called brackets. There are several partitioning rules:
Equal Frequency partitioning: Partitioning the values based on their number of occurrences in the data set.
Equal Width Partitioning: Partitioning the values in a fixed gap based on the number of bins i.e. a set of values
ranging from 0-20.
Clustering: Grouping the similar data together.
Initial attribute Set: {X1, X2, X3, X4, X5, X6}
Initial reduced attribute set: { }
Step-1: {X1}
Step-2: {X1, X2}
Step-3: {X1, X2, X5}
7.DATA NORMALIZATION
It is used to scale the data of an attribute so that it falls in a smaller range, such as -1.0 to 1.0 or 0.0 to 1.0. It is
generally useful for classification algorithms.
27
Need of Normalization –
Normalization is generally required when we are dealing with attributes on a different scale, otherwise,
it may lead to a dilution in effectiveness of an important equally important attribute(on lower scale)
because of other attribute having values on larger scale.
In simple words, when multiple attributes are there but attributes have values on different scales, this
may lead to poor data models while performing data mining operations. So they are normalized to
bring all the attributes on the same scale.
' vi
v i= j
10
28
Min-Max Normalization –
In this technique of data normalization, linear transformation is performed on the original data. Minimum and
maximum value from data is fetched and each value is replaced according to the following formula.
v’, v is the new and old of each entry in data respectively. σA, A is the standard deviation and mean of A
respectively.
8.VECTORIZATION
Recommending movies to users can be done in multiple ways using content – based filtering and collaborative
filtering approaches. Content-based filtering approach primarily focuses on the item similarity i.e., the
similarity in movies, whereas collaborative filtering focuses on drawing a relation between different users of
similar choices in watching movies. Based on the plot of a movie that was watched by the user in the past,
29
movies with a similar plot can be recommended to the user. This approach comes under content-based filtering
as the recommendations are done only based on the user’s past activity.
Vectorization is a technique of implementing array operations without using for loops. Instead, we use
functions defined by various modules which are highly optimized that reduces the running and execution time
of code. Vectorized array operations will be faster than their pure Python equivalents, with the biggest impact
in any kind of numerical computations.
Python-for-loops are slower than their C/C++ counterpart. Python is an interpreted language and most of the
implementation is slow. The main reason for this slow computation comes down to the dynamic nature of
Python and the lack of compiler level optimizations which incur memory overheads. NumPy being a C
implementation of arrays in Python provides vectorized actions on NumPy arrays.
Vectorized Operations using NumPy
1. Add/Subtract/Multiply/Divide by Scalar
Addition, Subtraction, Multiplication, and Division of an array by a scalar quantity result in an array of the
same dimensions while updating all the elements of the array with a given scalar. We apply this operation just
like we do with variables. The code is both small and fast as compared to for-loop implementation.
To calculate the execution time, we will use timer class present in timeit module which takes the statement to
execute, and then call timeit() method that takes how many times to repeat the statement. Note that the output
computation time is not exactly the same always and depends on the hardware and other factors.
2. Sum and Max of array
For finding the sum and maximum element in an array, we can use for loop as well as python built-in
methods sum() and max() respectively. Lets compare both of these ways with numpy operations.
30
sudo pip install nltk
Then, enter the python shell in your terminal by simply typing python
Type import nltk
nltk.download(‘all’)
The above installation will take quite some time due to the massive amount of tokenizers, chunkers, other
algorithms, and all of the corpora to be downloaded.
Some terms that will be frequently used are :
Corpus – Body of text, singular. Corpora is the plural of this.
Lexicon – Words and their meanings.
Token – Each “entity” that is a part of whatever was split up based on rules. For examples, each word is a
token when a sentence is “tokenized” into words. Each sentence can also be a token, if you tokenized the
sentences out of a paragraph.
So basically tokenizing involves splitting sentences and words from the body of the text.
The process of converting data to something a computer can understand is referred to as pre-processing. One
of the major forms of pre-processing is to filter out useless data. In natural language processing, useless words
(data), are referred to as stop words.
31
What are Stop words?
Stop Words: A stop word is a commonly used word (such as “the”, “a”, “an”, “in”) that a search engine has
been programmed to ignore, both when indexing entries for searching and when retrieving them as the result of
a search query.
We would not want these words to take up space in our database, or taking up valuable processing time. For
this, we can remove them easily, by storing a list of words that you consider to stop words. NLTK(Natural
Language Toolkit) in python has a list of stopwords stored in 16 different languages. You can find them in the
nltk_data directory. home/pratima/nltk_data/corpora/stopwords is the directory address.(Do not forget to
change your home directory name)
11.LEMMATIZATION
Lemmatization is the process of grouping together the different inflected forms of a word so they can be
analyzed as a single item. Lemmatization is similar to stemming but it brings context to the words. So it links
words with similar meanings to one word.
32
Text preprocessing includes both stemming as well as Lemmatization. Many times people find these two terms
confusing. Some treat these two as the same. Actually, lemmatization is preferred over Stemming because
lemmatization does morphological analysis of the words.
Applications of lemmatization are:
One major difference with stemming is that lemmatize takes a part of speech parameter, “pos” If not supplied,
the default is “noun.”
Stemming is the process of producing morphological variants of a root/base word. Stemming programs are
commonly referred to as stemming algorithms or stemmers. A stemming algorithm reduces the words
33
“chocolates”, “chocolatey”, “choco” to the root word, “chocolate” and “retrieval”, “retrieved”, “retrieves”
reduce to the stem “retrieve”.
Errors in Stemming:
There are mainly two errors in stemming - Overstemming and Understemming. Overstemming occurs when
two words are stemmed to same root that are of different stems. Under-stemming occurs when two words are
stemmed to same root that are not of different stems.
Applications of stemming are:
Stemming is used in information retrieval systems like search engines.
It is used to determine domain vocabularies in domain analysis.
Stemming is desirable as it may reduce redundancy as most of the time the word stem and their
inflected/derived words mean the same.
13. CONCLUSION
In this project we have implemented and learn the following things such as-
• Building a Movie Recommendation System
34
• To find the Similarity Scores and Indexes.
• Compute Distance Between Two Vectors
• Cosine Similarity
• Many more ML related concepts and techniques.
Research paper recommender systems help library users in finding or getting most relevant research papers over a
large volume of research papers in a digital library. This paper adopted content-based filtering technique to
provide recommendations to the intended users. Based on the results of the system, integrating recommendation
features in digital libraries would be useful to library users. The solution to this problem came as a result of the
availability of the contents describing the items and users' profile of interest. Content-based techniques are
independent of the users ratings but depend on these contents. This paper also presents an algorithm to provide or
suggest recommendations based on the users' query. The algorithm employs cosine similarity measure.
The next step of our future work is to adopt hybrid algorithm to see how the combination of collaborative and
content-based filtering techniques can gives us a better recommendation compared to the adopted technique in this
paper. The content-based technique is adopted or considered here for the design of the recommender system for
digital libraries. Content-based technique is suitable in situations or domains where items are more than users.
Library users do experience difficulties in getting or finding favourite digital objects (e.g. research papers) from a
large collection of digital objects in digital libraries.
35
12.REFERENCES
1. https://www.kaggle.com/datasets/tmdb/tmdb-movie-metadata
2. Peng, Xiao, Shao Liangshan, and Li Xiuran. "Improved Collaborative Filtering Algorithm in the
Research and Application of Personalized Movie Recommendations",
2013 Fourth International Conference on Intelligent Systems Design and EngineeringApplications, 2013.
7. Gediminas Adomavicius and Alexander Tuzhilin. Toward the next generation of recommender systems: A
survey of the state-of-the-art and possible extensions. Knowledge and Data Engineering, IEEE
Transactions on, 17(6):734–749, 2005.
8. Zhi-Dan Zhao and Ming-Sheng Shang. User-based collaborative-filtering rec- ommendation algorithms
on hadoop. In Knowledge Discovery and Data Mining, 2010. WKDD’10. Third International
Conference on, pages 478–481. IEEE, 2010.
36
37
13. Appendices
The data sets contains movies and credits. We are going to merge movies and
credits into a single one.
38
The dataset contains overview, genres, keywords, cast, crew in the form of tags.
In the below implementations we are going to convert the tags into the list.
39
Here we are going to keep the genres names in the format of action, adventure, fantasy and
science fiction and changing the tags in the keywords to list.
40
41
42
43
Vectorization
Stemming
44
45
Creating website using pycharm
46
47