Movie Recommender System (Final Report)

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 53

MOVIE RECOMMENDER SYSTEM USING

MACHINE LEARNING
By

M. Harika Reddy(19STUCHH010116)
Pavan Katne(19STUCHH010178)
S. Vinay kumar(19STUCHH010289)

Under Guidance of Dr.P.PAVAN KUMAR

The ICFAI Foundation for Higher Education


Faculty of Science & Technology
(a Deemed University under Section 3 of UGC Act. 1956)Donthanapally,
Shankarapalli Road, Hyderabad – 501203
Acknowledgements

We would like to express our thankfulness to our director Dr. K.L Narayana of the Faculty of
Science and Technology (ICFAI Foundation for Higher Education, Hyderabad) for giving us such
an opportunity.
We are very much thankful to Dr.P.Pavan kumar for his constant supervision, guidance, and co-operation
throughout the project. Their useful suggestions for this whole work and co-operative behavior are
sincerely acknowledged.
We would like to extend our sincere thanks to all of them who helped us to complete this project.
ABSTRACT

A recommendation engine filters the data using different algorithms and recommends the most relevant
items to users. It first captures the past behavior of a customer and based on that, recommends products
which the users might be likely to buy.
If a completely new user visits an e-commerce site, that site will not have any past history of that user. So
how does the site go about recommending products to the user in such a scenario? One possible solution
could be to recommend the best selling products, i.e. the products which are high in demand. Another
possible solution could be to recommend the products which would bring the maximum profit to the
business.

Three main approaches are used for our recommender systems. One is Demographic Filtering i.e They
offer generalized recommendations to every user, based on movie popularity and/or genre. The System
recommends the same movies to users with similar demographic features. Since each user is different ,
this approach is considered to be too simple.
The basic idea behind this system is that movies that are more popular and critically acclaimed will have
a higher probability of being liked by the average audience. Second is content-based filtering, where we
try to profile the users interests using information collected, and recommend items based on that profile.
The other is collaborative filtering, where we try to group similar users together and use information
about the group to make recommendations to the user.

In this hustling world, entertainment is a necessity for each one of us to refresh our mood and energy.
Entertainment regains our confidence for work and we can work more enthusiastically. For revitalizing
ourselves, we can listen to our preferred music or can watch movies of our choice. For watching
favourable movies online we can utilize movie recommendation systems, which are more reliable, since
searching of preferred movies will require more and more time which one cannot afford to waste. In this
paper, to improve the quality of a movie recommendation system, a content based filtering is presented in
the proposed methodology and comparative results have been shown which depicts that the proposed
approach shows an improvement in the accuracy, quality and scalability of the movie recommendation
system than the pure approaches in three different datasets.
TABLE OF CONTENTS
1. Introduction………………………………………………………………………..1

1.1 Motivation.............................................................................................................................1

1.2 Project Scope.......................................................................................................................2

1.3 Objectives and Implementation...........................................................................................3

1.4 Methodology..........................................................................................................................3

2. Literature Review…………………………………………………………………………………4

3. Types of Recommender System....................................................................................................6

3.1 Demographic Filtering.........................................................................................................6

3.2 Content Based Filtering.......................................................................................................7

3.3 Credits, Genres and keywords based Recommender.......................................................9

4. Collaborative Filtering based system.....................................................................................…11

4.1 User Based Filtering…................................................................................................................11

4.2 Item Based filtering….................................................................................................................13

4.3 Single Value Decomposition.......................................................................................................15

4.4 Comparison………………………………………………………………………………………15

5. Famous Recommender Systems………………………………………………………………..16

5.1 E-Commerce………………………………………………………………………………………16

5.2 Movie and Video Website…………………………………………………………………………16

5.3 Internet Radio……………………………………………………………………………………..20

6. Data Preprocessing………………………………………………………………………………..21

6.1 Data cleaning………………………………………………………………………………………21


6.2 Data Transformation………………………………………………………………………………23

6.3 Data Reduction……………………………………………………………………………………..25

7. Data Normalization……………………………………………………………...............................28

8. Vectorization………………………………………………………………………………………..30

9. Tokenize text Using NLTK…………………………………………………………………………31

10. Removing Stop Words using NLTK……………………………………………………………….32

11. Lemmatization……………………………………………………………………………………….33

12. Stemming Words…………………………………………………………………………………….34

13. Conclusion……………………………………………………………………………………………35

14. References……………………………………………………………………………………………36

15. Appendices……………………………………………………………………………………………37
1.INTRODUCTION

A recommendation system or recommendation engine is a model used for information filtering where it
tries to predict the preferences of a user and provide suggests based on these preferences. These systems
have become increasingly popular nowadays and are widely used today in areas such as movies, music,
books, videos, clothing, restaurants, food, places and other utilities. These systems collect information
about a user's preferences and behaviour, and then use this information to improve their suggestions in
the future.

Movies are a part and parcel of life. There are different types of movies like some for entertainment,
some for educational purposes, some are animated movies for children, and some are horror movies or
action films. Movies can be easily differentiated through their genres like comedy, thriller, animation,
action etc. Other way to distinguish among movies can be either by releasing year, language, director
etc. Watching movies online, there are a number of movies to search in our most liked movies . Movie
Recommendation Systems helps us to search our preferred movies among all of these different types of
movies and hence reduce the trouble of spending a lot of time searching our favourable movies. So, it
requires that the movie recommendation system should be very reliable and should provide us with the
recommendation of movies which are exactly same or most matched with our preferences.

1enrich a user's shopping experience. Recommendation systems have several benefits, the most important
being customer satisfaction and revenue. Movie Recommendation system is very powerful and important
system. But, due to the problems associated with pure collaborative approach, movie recommendation
systems also suffers with poor recommendation quality and scalability issues.

1.1 Motivation

1
Nowadays, a recommender system can be found in almost every information-intensive website. For
example, a list of likely preferred products are recommended to an customer when browsing the target
product in Amazon. Moreover, when watching a video clip in Youtube, a recommender system employed
in the system suggests some relevant videos to users by learning the users’ behaviours that were
generated previously. So to speak, recommender systems have deeply changed the way we obtain
information. Recommender systems not only make it easier and more convenient for people to receive
information, but also provide great potential in the economic growth as described in . As more and more
people realise the importance and power of recommender systems, the exploration for designing high
quality recommender systems have been remaining an active topic in the community over the past
decade. Due to the continuous efforts into the field, thankfully, many recommender systems have been
developed and used in a variety of domains. Based on this, a key question arising is how to know the
performance of recommender systems so that the most suitable ones can be found to apply in certain
contexts or domain

motivated by the importance of evaluating recommender systems and the highlight on comprehensive
metric considerations into the evaluation experiments, the project aims to explore the most possible
scientific method to evaluating recommender systems. The implementation of the project is presented in
the form of a web application.

1.2 Project Scope

The objective of this project is to provide accurate movie recommendations to users. The goal of the
project is to improve the quality of movie recommendation system, such as accuracy, quality and
scalability of system than the pure approaches. This is done using Hybrid approach by combining content
based filtering and collaborative filtering, To eradicate the overload of the data, recommendation system
is used as information filtering tool in social networking sites .Hence, there is a huge scope of exploration
in this field for improving scalability, accuracy and quality of movie recommendation systems Movie
Recommendation system is very powerful and important system. But, due to the problems associated
with pure collaborative approach, movie recommendation systems also suffers with poor
recommendation quality and scalability issues.

1.3 Objectives and Implementation

 Improving the Accuracy of the recommendation system


2
 Improve the Quality of the movie Recommendation system

 Improving the Scalability.

 Enhancing the user experience.

1.4 Methodology

The hybrid approach proposed an integrative method by merging fuzzy k-means clustering method and
genetic algorithm based weighted similarity measure to construct a movie recommendation system. The
proposed movie recommendation system gives finer similarity metrics and quality than the existing
Movie recommendation system but the computation time which is Movie Recommendation System Dept
of CSE, CMRIT 2019-2020 Page 3 taken by the proposed recommendation system is more than the
existing recommendation system. This problem can be fixed by taking the clustered data points as an
input dataset The proposed approach is for improving the scalability and quality of the movie
recommendation system
.We use a Hybrid approach , by unifying Content-Based Filtering and Collaborative Filtering, so that the
approaches can be profited from each other. For computing similarity between the different movies in the
given dataset efficiently and in least time and to reduce computation time of the movie recommender
engine we used cosine similarity measure.

3
2.LITERATURE REVIEW

For building a recommender system from scratch, we face several different problems. Currently there are a lot of
recommender systems based on the user information, so what should we do if the website has not gotten
enough users. After that, we will solve the representation of a movie, which is how a system can understand a
movie. That is the precondition for comparing similarity between two movies. Movie features such as genre, actor
and director is a way that can categorize movies. But for each feature of the movie, there should be different
weight for them and each of them plays a different role for recommendation. So we get these questions:

• How to recommend movies when there are no user information.

• What kind of movie features can be used for the recommender system.

• How to calculate the similarity between two movies.

• Is it possible to set weight for each feature.

The goals of this thesis project is to do the research of Recommender Systems and find a suitable way to
implement it for Vionel.com. There are many kinds of Recommender Systems but not all of them are suitable for
one specific problem and situation. Our goal is to find a new way to improve the classification of movies, which
is the requirement of improving content-based recommender systems.
In order to achieve the goal of the project, the first process is to do enough back- ground study, so the literature
study will be conducted. The whole project is based on a big amount of movie data so that we choose
quantitative research method. For philosophical assumption, positivism is selected because the project is experi-
mental and testing character. The research approach is deductive approach as the improvement of our research
will be tested by deducing and testing a theory. Ex post facto research is our research strategy, the movie data is

4
already collected and we don’t change the independent variables. We use experiments to collect movie data.
Computational mathematics is used data analysis because the result is based on improvement of algorithm. For
the quality assurance, we have a detail explana- tion of algorithm to ensure test validity. The similar results will
be generated when we run the same data multiple times, which is for reliability. We ensure the same data
leading to same result by different researchers.
This analysis mainly focuses on the domain of Machine Learning example of Movie Recommendation System
with a approach of finding the similarity scores between the two contents in the content based filtering. It helps
us in finding the distance between the two vectors and their angle by the help of the cosine similarity formula
and their magnitude of relative scores. In this model, we have considered two texts and plotted them in the
form of a graph from where we have plotted the points taking two-dimension plane of x-y plane only.
In today computer world we have lots of stuff on our Internet sources to watch and see but every single stuff
available not matches with our liking. We sometimes get the feed of the videos, vies, news, clothing etc. which
is not according to our liking and interest. It makes the customer interest in the application lowest and he/she
further doesn’t want to get through the same application again. The need for the hour is to develop some code
which can tell at a beginner level the matching pattern of the customer trend and recommend him with the best
item of his interest level. This will help us in making the customer experience satisfactory and able to achieve
good ratings and popularity as well.
Many of the recommendation systems we are seeing today in our environment such as in the YouTube for
example if I see lots of news regarding the GK and Current Affairs then it offers me the related videos
according to it with different subscribers. It gains popularity by application rating and at the same time
enhances the customer experience. This policy of recommendation system is really helpful in giving optimum
results to an application profitability and to make the organisation more connected. We can also see the
recommendation work in online food applications such as Zomato, Food Panda and Swiggy which offers their
customers the food restaurants which supplies their taste food. They learn upon the behaviour of the customer
from the previous orders and tries to impress them with the latest add-ons of their favourite cuisines and stuffs.

5
3.Types of Recommender System

3.1Demographic Filtering

There are various types of recommender systems with different approaches and some of them are
classified as below:
1. Demographic Filtering- They offer generalized recommendations to every user, based on movie
popularity and/or genre. The System recommends the same movies to users with similar demographic
features. Since each user is different, this approach is considered to be too simple. The basic idea behind
this system is that movies that are more popular and critically acclaimed will have a higher probability of
being liked by the average audience. Before getting started with this –
● We need a metric to score or rate movie
● Calculate the score for every movie
● Sort the scores and recommend the best rated movie to the users.
We can use the average ratings of the movie as the score but using this won't be fair enough since a
movie with 8.9 average rating and only 3 votes cannot be considered better than the movie with 7.8 as as
average rating but 40 votes. So, I'll be using IMDB's weighted rating (wr) which is given as :-

𝑣
Weighted Rating(WR) = ( 𝑚
. 𝑅) + ( . 𝐶)
𝑣+𝑚
𝑣+𝑚

where,
● v is the number of votes for the movie;
● m is the minimum votes required to be listed in the chart;
● R is the average rating of the movie; And
6
● C is the mean vote across the whole report

7
3.2 Content Based Filtering

In content-based filtering, items are recommended based on comparisons between item profile and user
profile. A user profile is content that is found to be relevant to the user in form of keywords(or features).
A user profile might be seen as a set of assigned keywords (terms, features) collected by algorithm from
items found relevant (or interesting) by the user. A set of keywords (or features) of an item is the Item
profile. For example, consider a scenario in which a person goes to buy his favorite cake ‘X’ to a pastry.
Unfortunately, cake ‘X’ has been sold out and as a result of this the shopkeeper recommends the person
to buy cake ‘Y’ which is made up of ingredients similar to cake ‘X’. This is an instance of content-based
filtering.

.Fig 3.2.1 content based filtering architecture

We will be using the cosine similarity to calculate a numeric quantity that denotes the similarity

8
between two movies. We use the cosine similarity score since it is independent of magnitude and is
relatively easy and fast to calculate. Mathematically, it is defined as follows:

𝐴⋅𝐵 𝑛
∑ 𝐴𝑖𝐵𝑖

similarity =cos(𝜃) 𝑖=1


=
= ‖𝐴‖‖𝐵‖
𝑛 𝑛
√∑ 𝐴2√∑ 𝐵2
𝑖=1 𝑖 𝑖=1 𝑖

We are now in a good position to define our recommendation


function. These are the following steps we'll follow :-
● Get the index of the movie given its title.
● Get the list of cosine similarity scores for that particular movie with all movies. Convert it into a list
of tuples where the first element is its position and the second is the similarity score.
● Sort the aforementioned list of tuples based on the similarity scores; that is, the second element.
● Get the top 10 elements of this list. Ignore the first element as it refers to self (the movie most similar to
a particular movie is the movie itself).
● Return the titles corresponding to the indices of the top elements.
While our system has done a decent job of finding movies with similar plot descriptions, the quality of
recommendations is not that great. "The Dark Knight Rises" returns all Batman movies while it is more
likely that the people who liked that movie are more inclined to enjoy other Christopher Nolan movies.
This is something that cannot be captured by the present system.

9
3.3 Credits, Genres and keywords Based Recommender

It goes without saying that the quality of our recommender would be increased with the usage of better

10
metadata. That is exactly what we are going to do in this section. We are going to build a recommender
based on the following metadata: the 3 top actors, the director, related genres and the movie plot
keywords. From the cast, crew and keywords features, we need to extract the three most important actors,
the director and the keywords associated with that movie. Right now, our data is present in the form of
"stringified" lists , we need to convert it into a safe and usable structure.

Advantages of content-based filtering are:


● They capable of recommending unrated items.
● We can easily explain the working of recommender system by listing the Content features of an item.
● Content-based recommender systems use need only the rating of the concerned user ,and not any
other user of the system.
Disadvantages of content-based filtering are:
● It does not work for a new user who has not rated any item yet as enough ratings are required
content based recommender evaluates the user preferences and provides accurate recommendations.
● No recommendation of serendipitous items.
● Limited Content Analysis- The recommend does not work if the system fails to distinguish the items
hat a user likes from the items that he does not like.

10
4.Collaborative Filtering Based Systems

Our content based engine suffers from some severe limitations. It is only capable of suggesting movies
which are close to a certain movie. That is, it is not capable of capturing tastes and providing
recommendations across genres. Also, the engine that we built is not really personal in that it doesn't
capture the personal tastes and biases of a user. Anyone querying our engine for recommendations based
on a movie will receive the same recommendations for that movie, regardless of who she/he is.
Therefore, in this section, we will use a technique called Collaborative Filtering to make
recommendations to Movie Watchers. It is basically of two types:-

4.1User Based Filtering


These systems recommend products to a user that similar users have liked. For measuring the similarity
between two users we can either use 16 person correlation or cosine similarity. This filtering technique
can be illustrated with an example. In the following matrix's, each row represents a user, while the
columns correspond to different movies except the last one which records the similarity between that user
and the target user. Each cell represents the rating that the user gives to that movie. Assume user E is the
target.

THE SHERLOCK TRANSFORMERS MATRIX TITANIC ME BEFORE YOU SIMILIRATY


AVENGERS
2
A 2 4 5 NA
4
B 5 1
5
C 2

D 1 5 4
4
E 2 1

F 4 1 NA

11
Since user A and F do not share any movie ratings in common with user E.Their similarities with user E are not
defined person correlation. Therefore, we only need to consider user B, C and D. Based on person correlation, we can
compute the following similarity

The Avengers Sherlock Transformers Matrix Titanic Me before You Similarity

A 2 2 4 5 NA

B 5 4 1 0.87

C 5 2 1

D 1 5 4 -1

E 4 2 1

F 4 5 1 NA

Although computing user-based CF is very simple, it suffers from several problems. One main issue is that
users’ preference can change over time. It indicates that precomputing the matrix based on their neighboring
users may lead to bad performance. To tackle this problem, we can apply item-based CF.

12
4.2 Item Based Filtering

Instead of measuring the similarity between users, the item-based CF recommends items based on their
similarity with the items that the target user rated. Likewise, the similarity can be computed with 17
Pearson Correlation or Cosine Similarity. The major difference is that, with item-based collaborative
filtering, we fill in the blank vertically, as oppose to the horizontal manner that user-based CF does. The
following table shows how to do so for the movie Me Before

The Avengers Sherlock Transformers Matrix Titanic Me Before


You
A 2 2 4 5 2.94*
B 5 4 1
C 5 2 2.48*
D 1 5 4
E 4 2
F 4 5 1 1.12*
Similarity -1 -1 0.86 1 1

It successfully avoids the problem posed by dynamic user preference as item-based CF is more static.
However, several problems remain for this method. First, the main issue is scalability. The computation
grows with both the customer and the product. The worst case complexity is O(mn) with m users and n
items. In addition, sparsity is another concern. Take a look at the above table again. Although there is
only one user that rated both Matrix and Titanic rated, the similarity between them is 1. In extreme cases,
we can have millions of users and the similarity between two fairly different movies could be very high
simply because they have similar rank for the only user who ranked them both.

4.3 Single Value Decomposition

13
One way to handle the scalability and sparsity issue created by CF is to leverage a latent factor model to
capture the similarity between users and items. Essentially, we want to turn the recommendation problem
into an optimization problem. We can view it as how good we are in predicting the rating for items given
a user. One common metric is Root Mean Square Error (RMSE). The lower the RMSE, the better the
performance.
Now talking about latent factor you might be wondering what is it ?It is a broad idea which describes a
property or concept that a user or an item have. For instance, for music, latent factor can refer to the genre
that the music belongs to. SVD decreases the dimension of the utility matrix by extracting its latent
factors. Essentially, we map each user and each item into a latent space with dimension r. Therefore, it
helps us

better understand the relationship between users and items as they become directly comparable. The
below figure illustrates this idea.

Fig 4.3.1 Single Value Decomposition

14
Now enough said, let's see how to implement this. Since the dataset we used before did not have
userId(which is necessary for collaborative filtering) let's load another dataset. We'll be using the Surprise
library to implement SVD.
Advantages of collaborative filtering based systems:
● It is dependent on the relation between users which implies that it is content-independent.
● CF recommender systems can suggest serendipitous items by observing similar-minded people’s
behavior.
● They can make real quality assessment of items by considering other peoples experience Disadvantages
of collaborative filtering are:
● Early rater problem: Collaborative filtering systems cannot provide recommendations for new items
since there are no user ratings on which to base a prediction.
● Gray sheep: In order for CF based system to work, group with similar characteristics are needed. Even
if such groups exist, it will be very difficult to recommend users who do not consistently agree or
disagree to these groups.
● Sparsity problem: In most cases, the amount of items exceed the number of users by a great margin
which makes it difficult to find items that are rated by enough people.
4.4Comparison
Each approach has its advantage and disadvantage, and the effects are different as well for different
dataset. The approach may not suitable for all kinds of problems because of the algorithm itself. For
example, it is hard to apply automate feature extraction to media data by content-based filtering method.
And the recommendation result only limits to items the user ever chose, which means the diversity is not
so good. It is very hard to recommend for users who never choose anything. Collaborative filtering
method overcomes the disadvantage of mentioned before somehow. But CF based on big amount of
history data, so there are problems of sparsity and cold start. In terms of cold start, as collaborative
filtering is based on the similarity between the items chosen by users, there are not only new user
problem[30], but also new item problem, which means it is hard to be recommended if the new item has
never been recommended before[1]. The comparison is in Table 2.3[11].

15
5. FAMOUS RECOMMENDER SYSTEM

What is the difference between recommender system and search engine is that recommender system is
based on the behaviors of user. There are a lot of websites using recommender system in the world.
Personalized recommender system analyzes a huge amount of user behavior data and provides
personalized content to different users, which improves the click rate and conversions of the website. The
fields that widely use recommender system are e-commerce, movie, video, music, social network,
reading, local based service, personalized email and advertisement.

5.1 E-Commerce
The most famous e-commerce website, Amazon, is the active application and promoter of recommender
system. The recommender system of Amazon reaches deeper into all kinds of products. is the
recommendation list of Amazon. Apart from personalized recommendation list, another important
application of recommender system is relevant recommendation list. When you buy something in
Amazon. Amazon has two kinds of relevant recommendation, one is customers who bought this item also
bought. Another is what other items do customers buy after viewing this item. The difference between the
two recommendations is the calculation of the different user behaviors. The most important application of
relevant recommendation is cross selling. When you are buying something, Amazon will tell you what
other customers who bought this item also bought and let you decide whether buy it at the same time. If
you do, the goods will be packed and provide a certain discount.

5.2 Movie and Video website


Personalized recommender system is a very important application for movie and video website, which
can help users to find what they really like among the vast of videos. Netflix is the most successful
company in this field.

16
Recommendation algorithms
(1) Content Based Recommendation
Advantages:
result is intuitive and easy to interpret
No need for users? access history data
No new item problem and no sparsity problem
Supported by the mature technology of classification learning.
Disadvantages:
Limited by the features extraction methods
New user problem
The training of classifier needs massive data
Poor scalability.
(2) Collaboration filtering
Advantages:
No need for professional knowledge
Performance improving as the increasing of the user number
Automatic
Easy to find user?s new interesting point
Complex unstructured item can be processed. eg. Music, Video, etc.
Disadvantages:
Sparsity problem
Poor scalability
New user and new item problem
The recommendation quality limited by the history data set.

17
Fig 5.2.1 Personalized recommendation of Amazon

Fig 5.2.2 Relevant Recommendation, Customers Who Bought This Item Also
Bought

Amazon and it are the two most representative companies in recommender systems. The Below is the
recommendation page of Netflix. We can find that the recommendation result consists of the following
parts.
• The title and poster of the movie.
• The feedback of the user, including Play, Rating and Not Interested.
• Recommendation reason

18
Fig 5.2.3 Relevant Recommendation, What Other Items Do Customers Buy After
Viewing This Item

It can be illustrated that the recommendation algorithm is similar with Amazon according to the
recommendation reason of Netflix. Netflix declared that 60% of their users find movies that they are
interested in by the recommender system.

Fig 5.2.4 Netflix Recommender System

As the biggest video website in America, YouTube has a huge amount of videos that uploaded by users.
So the information overload is a serious problem for YouTube. In the latest paper of YouTube, in order to
prove the availability of personalized recommender system, researchers of YouTube ever made an
experience comparing the click rate of personalized recommendation list and popular list.
The result showed that click rate of personalized recommendation list is twice of that of popular list.

5.3 Internet Radio

19
The successful application of personalized recommender system needs two requirements, one is
information overload because if users can find what they like easily, there is no reason to use
recommender systems. The second is the user doesn’t have clear requirements. Because they will use
search engine directly if they do. Under the two requirements, recommender system is very suitable for
personalized Internet radio. First of all, people cannot listen to all the music in the world and find which
one they like. Secondly, people often do not want to listen to a specific music, they wish to listen to
whatever musics that match their mood at that moment.There are a lot of personalized Internet radios,
such as Pandora and Last.fm. Figure 2.6 is the main page of Last.fm.

Fig 5.3.1 last fm

6.DATA PREPROCESSING

Data preprocessing is a data mining technique which I used to transform the raw data in a useful and
efficient format.

20
Steps Involved in Data Preprocessing:
6.1 Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data cleaning is done. It involves
handling of missing data, noisy data etc.

(a). Missing Data:


This situation arises when some data is missing in the data. It can be handled in various ways.
Some of them are:
Ignore the tuples:
This approach is suitable only when the dataset we have is quite large and multiple values are missing within a
tuple.

Fill the Missing values:


There are various ways to do this task. You can choose to fill the missing values manually, by attribute mean or

21
the most probable value.

(b). Noisy Data:


Noisy data is a meaningless data that can’t be interpreted by machines.It can be generated due to faulty data
collection, data entry errors etc. It can be handled in following ways :
Binning Method:
This method works on sorted data in order to smooth it. The whole data is divided into segments of equal size
and then various methods are performed to complete the task. Each segmented is handled separately. One can
replace all data in a segment by its mean or boundary values can be used to complete the task.

Regression:
Here data can be made smooth by fitting it to a regression function.The regression used may be linear (having
one independent variable) or multiple (having multiple independent variables).

Clustering:
This approach groups the similar data in a cluster. The outliers may be undetected or it will fall outside the
clusters.
6.2 Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable for mining process. This involves
following ways:
Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to 1.0)

Attribute Selection:
In this strategy, new attributes are constructed from the given set of attributes to help the mining process.

Discretization:
This is done to replace the raw values of numeric attribute by interval levels or conceptual levels.

Concept Hierarchy Generation:


Here attributes are converted from lower level to higher level in hierarchy. For Example-The attribute “city”
can be converted to “country”.
22
5.3 Data Reduction:
Since data mining is a technique that is used to handle huge amount of data. While working with huge volume
of data, analysis became harder in such cases. In order to get rid of this, we uses data reduction technique. It
aims to increase the storage efficiency and reduce data storage and analysis costs.
The various steps to data reduction are:
Data Cube Aggregation:
Aggregation operation is applied to data for the construction of the data cube.

Attribute Subset Selection:


The highly relevant attributes should be used, rest all can be discarded. For performing attribute selection, one
can use level of significance and p- value of the attribute.the attribute having p-value greater than significance
level can be discarded.

Numerosity Reduction:
This enable to store the model of data instead of whole data, for example: Regression Models.

Dimensionality Reduction:
This reduce the size of data by encoding mechanisms. It can be lossy or lossless. If after reconstruction from
compressed data, original data can be retrieved, such reduction are called lossless reduction else it is called
lossy reduction. The two effective methods of dimensionality reduction are:Wavelet transforms and PCA
(Principal Component Analysis).
6.2 Data Transformation
The data are transformed in ways that are ideal for mining the data. The data transformation involves steps that
are:
1. Smoothing:
It is a process that is used to remove noise from the dataset using some algorithms It allows for highlighting
important features present in the dataset. It helps in predicting the patterns. When collecting data, it can be
manipulated to eliminate or reduce any variance or any other noise form.
The concept behind data smoothing is that it will be able to identify simple changes to help predict different
trends and patterns. This serves as a help to analysts or traders who need to look at a lot of data which can often
be difficult to digest for finding patterns that they wouldn’t see otherwise.
23
2. Aggregation:
Data collection or aggregation is the method of storing and presenting data in a summary format. The data may
be obtained from multiple data sources to integrate these data sources into a data analysis description. This is a
crucial step since the accuracy of data analysis insights is highly dependent on the quantity and quality of the
data used. Gathering accurate data of high quality and a large enough quantity is necessary to produce relevant
results.
The collection of data is useful for everything from decisions concerning financing or business strategy of the
product, pricing, operations, and marketing strategies.
For example, Sales, data may be aggregated to compute monthly& annual total amounts.
3. Discretization:
It is a process of transforming continuous data into set of small intervals. Most Data Mining activities in the
real world require continuous attributes. Yet many of the existing data mining frameworks are unable to handle
these attributes.
Also, even if a data mining task can manage a continuous attribute, it can significantly improve its efficiency
by replacing a constant quality attribute with its discrete values.
For example, (1-10, 11-20) (age:- young, middle age, senior).
4. Attribute Construction:
Where new attributes are created & applied to assist the mining process from the given set of attributes. This
simplifies the original data & makes the mining more efficient.
5. Generalization:
It converts low-level data attributes to high-level data attributes using concept hierarchy. For Example Age
initially in Numerical form (22, 25) is converted into categorical value (young, old).
6. Normalization: Data normalization involves converting all data variable into a given range.
Techniques that are used for normalization are:
Min-Max Normalization:
This transforms the original data linearly.
Suppose that: min_A is the minima and max_A is the maxima of an attribute, P
We Have the Formula:

Where v is the value you want to plot in the new range.


v’ is the new value you get after normalizing the old value.

24
Z-Score Normalization:
In z-score normalization (or zero-mean normalization) the values of an attribute (A), are normalized based on
the mean of A and its standard deviation
A value, v, of attribute A is normalized to v’ by computing

Decimal Scaling:
It normalizes the values of an attribute by changing the position of their decimal points
The number of points by which the decimal point is moved can be determined by the absolute maximum value
of attribute A.
A value, v, of attribute A is normalized to v’ by computing
where j is the smallest integer such that Max(|v’|) < 1.

' v
v= j
10

6.3 Data Reduction


The method of data reduction may achieve a condensed description of the original data which is much smaller
in quantity but keeps the quality of the original data.
Methods of data reduction:
These are explained as following below.
1. Data Cube Aggregation:
This technique is used to aggregate data in a simpler form. For example, imagine that information you gathered
for your analysis for the years 2012 to 2014, that data includes the revenue of your company every three
months. They involve you in the annual sales, rather than the quarterly average, So we can summarize the data
in such a way that the resulting data summarizes the total sales per year instead of per quarter. It summarizes
the data.
2. Dimension reduction:
Whenever we come across any data which is weakly important, then we use the attribute required for our
analysis. It reduces data size as it eliminates outdated or redundant features.

25
Step-wise Forward Selection –
The selection begins with an empty set of attributes later on we decide best of the original attributes on the set
based on their relevance to other attributes. We know it as a p-value in statistics.
Step-wise Backward Selection –
This selection starts with a set of complete attributes in the original data and at each point, it eliminates the
worst remaining attribute in the set.
Combination of forwarding and Backward Selection –
It allows us to remove the worst and select best attributes, saving time and making the process faster.
3. Data Compression:
The data compression technique reduces the size of the files using different encoding mechanisms (Huffman
Encoding & run-length Encoding). We can divide it into two types based on their compression techniques.
Lossless Compression –
Encoding techniques (Run Length Encoding) allows a simple and minimal data size reduction. Lossless data
compression uses algorithms to restore the precise original data from the compressed data.
Lossy Compression –
Methods such as Discrete Wavelet transform technique, PCA (principal component analysis) are examples of
this compression. For e.g., JPEG image format is a lossy compression, but we can find the meaning equivalent
to the original the image. In lossy-data compression, the decompressed data may differ to the original data but
are useful enough to retrieve information from them.
4. Numerosity Reduction:
In this reduction technique the actual data is replaced with mathematical models or smaller representation of
the data instead of actual data, it is important to only store the model parameter. Or non-parametric method
such as clustering, histogram, sampling. For More Information on Numerosity Reduction Visit the link below:
5. Discretization & Concept Hierarchy Operation:
Techniques of data discretization are used to divide the attributes of the continuous nature into data with
intervals. We replace many constant values of the attributes by labels of small intervals. This means that
mining results are shown in a concise, and easily understandable way.
Top-down discretization –
If you first consider one or a couple of points (so-called breakpoints or split points) to divide the whole set of
attributes and repeat of this method up to the end, then the process is known as top-down discretization also
known as splitting.

26
Bottom-up discretization –
If you first consider all the constant values as split-points, some are discarded through a combination of the
neighbourhood values in the interval, that process is called bottom-up discretization.
Concept Hierarchies:
It reduces the data size by collecting and then replacing the low-level concepts (such as 43 for age) to high-
level concepts (categorical variables such as middle age or Senior).
For numeric data following techniques can be followed:
Binning–
Binning is the process of changing numerical variables into categorical counterparts. The number of categorical
counterparts depends on the number of bins specified by the user.
Histogram analysis –
Like the process of binning, the histogram is used to partition the value for the attribute X, into disjoint ranges
called brackets. There are several partitioning rules:
Equal Frequency partitioning: Partitioning the values based on their number of occurrences in the data set.
Equal Width Partitioning: Partitioning the values in a fixed gap based on the number of bins i.e. a set of values
ranging from 0-20.
Clustering: Grouping the similar data together.
Initial attribute Set: {X1, X2, X3, X4, X5, X6}
Initial reduced attribute set: { }

Step-1: {X1}
Step-2: {X1, X2}
Step-3: {X1, X2, X5}

Final reduced attribute set: {X1, X2, X5}

7.DATA NORMALIZATION

It is used to scale the data of an attribute so that it falls in a smaller range, such as -1.0 to 1.0 or 0.0 to 1.0. It is
generally useful for classification algorithms.

27
Need of Normalization –
Normalization is generally required when we are dealing with attributes on a different scale, otherwise,
it may lead to a dilution in effectiveness of an important equally important attribute(on lower scale)
because of other attribute having values on larger scale.
In simple words, when multiple attributes are there but attributes have values on different scales, this
may lead to poor data models while performing data mining operations. So they are normalized to
bring all the attributes on the same scale.

Person_name Salary Year of Expected position


experience Level
Aman 100000 10 2
Abhinav 78000 7 4
Ashutosh 32000 5 8
Dishi 55000 6 7
Abhishek 92000 8 3
Avantika 1200000 15 1
Ayushi 65750 7 5

Methods of Data Normalization –


Decimal Scaling
Min-Max Normalization
z-Score Normalization(zero-mean Normalization)
Decimal Scaling Method For Normalization –
It normalizes by moving the decimal point of values of the data. To normalize the data by this technique, we
divide each value of the data by the maximum absolute value of data. The data value, vi, of data is normalized
to by using the formula below –

' vi
v i= j
10

where j is the smallest integer such that max(|vi‘|)<1.

28
Min-Max Normalization –
In this technique of data normalization, linear transformation is performed on the original data. Minimum and
maximum value from data is fetched and each value is replaced according to the following formula.

Where A is the attribute data,


Min(A), Max(A) are the minimum and maximum absolute value of A respectively.
v’ is the new value of each entry in data.
v is the old value of each entry in data.
new_max(A), new_min(A) is the max and min value of the range(i.e boundary value of range required)
respectively.
Z-score normalization –
In this technique, values are normalized based on mean and standard deviation of the data A. The
formula used is:
v− A
v’ = σA

v’, v is the new and old of each entry in data respectively. σA, A is the standard deviation and mean of A
respectively.

8.VECTORIZATION

Recommending movies to users can be done in multiple ways using content – based filtering and collaborative
filtering approaches. Content-based filtering approach primarily focuses on the item similarity i.e., the
similarity in movies, whereas collaborative filtering focuses on drawing a relation between different users of
similar choices in watching movies. Based on the plot of a movie that was watched by the user in the past,
29
movies with a similar plot can be recommended to the user. This approach comes under content-based filtering
as the recommendations are done only based on the user’s past activity.
Vectorization is a technique of implementing array operations without using for loops. Instead, we use
functions defined by various modules which are highly optimized that reduces the running and execution time
of code. Vectorized array operations will be faster than their pure Python equivalents, with the biggest impact
in any kind of numerical computations.
Python-for-loops are slower than their C/C++ counterpart. Python is an interpreted language and most of the
implementation is slow. The main reason for this slow computation comes down to the dynamic nature of
Python and the lack of compiler level optimizations which incur memory overheads. NumPy being a C
implementation of arrays in Python provides vectorized actions on NumPy arrays.
Vectorized Operations using NumPy
1. Add/Subtract/Multiply/Divide by Scalar
Addition, Subtraction, Multiplication, and Division of an array by a scalar quantity result in an array of the
same dimensions while updating all the elements of the array with a given scalar. We apply this operation just
like we do with variables. The code is both small and fast as compared to for-loop implementation.
To calculate the execution time, we will use timer class present in timeit module which takes the statement to
execute, and then call timeit() method that takes how many times to repeat the statement. Note that the output
computation time is not exactly the same always and depends on the hardware and other factors.
2. Sum and Max of array
For finding the sum and maximum element in an array, we can use for loop as well as python built-in
methods sum() and max() respectively. Lets compare both of these ways with numpy operations.

9.Tokenize text using NLTK

Natural language toolkit(NLTK) has to be installed in your system.


The NLTK module is a massive tool kit, aimed at helping you with the entire Natural Language Processing
(NLP) methodology.
In order to install NLTK run the following commands in your terminal.

30
sudo pip install nltk
Then, enter the python shell in your terminal by simply typing python
Type import nltk
nltk.download(‘all’)
The above installation will take quite some time due to the massive amount of tokenizers, chunkers, other
algorithms, and all of the corpora to be downloaded.
Some terms that will be frequently used are :
Corpus – Body of text, singular. Corpora is the plural of this.
Lexicon – Words and their meanings.
Token – Each “entity” that is a part of whatever was split up based on rules. For examples, each word is a
token when a sentence is “tokenized” into words. Each sentence can also be a token, if you tokenized the
sentences out of a paragraph.
So basically tokenizing involves splitting sentences and words from the body of the text.

10.REMOVING STOPWORDS USING NLTK

The process of converting data to something a computer can understand is referred to as pre-processing. One
of the major forms of pre-processing is to filter out useless data. In natural language processing, useless words
(data), are referred to as stop words.

31
What are Stop words?
Stop Words: A stop word is a commonly used word (such as “the”, “a”, “an”, “in”) that a search engine has
been programmed to ignore, both when indexing entries for searching and when retrieving them as the result of
a search query.
We would not want these words to take up space in our database, or taking up valuable processing time. For
this, we can remove them easily, by storing a list of words that you consider to stop words. NLTK(Natural
Language Toolkit) in python has a list of stopwords stored in 16 different languages. You can find them in the
nltk_data directory. home/pratima/nltk_data/corpora/stopwords is the directory address.(Do not forget to
change your home directory name)

Sample Text with Stop Words Without Stop Words


GeeksforGeeks – A computer GeeksforGeeks, Computer Science
Science Portal for Geeks Portal, Geeks
Can Listening be exhausting? Listening,Exhausting
I like reading, so I read Like, Reading, Read

11.LEMMATIZATION

Lemmatization is the process of grouping together the different inflected forms of a word so they can be
analyzed as a single item. Lemmatization is similar to stemming but it brings context to the words. So it links
words with similar meanings to one word.

32
Text preprocessing includes both stemming as well as Lemmatization. Many times people find these two terms
confusing. Some treat these two as the same. Actually, lemmatization is preferred over Stemming because
lemmatization does morphological analysis of the words.
Applications of lemmatization are:

Used in comprehensive retrieval systems like search engines.


Used in compact indexing

One major difference with stemming is that lemmatize takes a part of speech parameter, “pos” If not supplied,
the default is “noun.”

12. STEMMING WORDS

Stemming is the process of producing morphological variants of a root/base word. Stemming programs are
commonly referred to as stemming algorithms or stemmers. A stemming algorithm reduces the words

33
“chocolates”, “chocolatey”, “choco” to the root word, “chocolate” and “retrieval”, “retrieved”, “retrieves”
reduce to the stem “retrieve”.
Errors in Stemming:
There are mainly two errors in stemming - Overstemming and Understemming. Overstemming occurs when
two words are stemmed to same root that are of different stems. Under-stemming occurs when two words are
stemmed to same root that are not of different stems.
Applications of stemming are:
Stemming is used in information retrieval systems like search engines.
It is used to determine domain vocabularies in domain analysis.
Stemming is desirable as it may reduce redundancy as most of the time the word stem and their
inflected/derived words mean the same.

13. CONCLUSION

In this project we have implemented and learn the following things such as-
• Building a Movie Recommendation System

34
• To find the Similarity Scores and Indexes.
• Compute Distance Between Two Vectors
• Cosine Similarity
• Many more ML related concepts and techniques.
Research paper recommender systems help library users in finding or getting most relevant research papers over a
large volume of research papers in a digital library. This paper adopted content-based filtering technique to
provide recommendations to the intended users. Based on the results of the system, integrating recommendation
features in digital libraries would be useful to library users. The solution to this problem came as a result of the
availability of the contents describing the items and users' profile of interest. Content-based techniques are
independent of the users ratings but depend on these contents. This paper also presents an algorithm to provide or
suggest recommendations based on the users' query. The algorithm employs cosine similarity measure.
The next step of our future work is to adopt hybrid algorithm to see how the combination of collaborative and
content-based filtering techniques can gives us a better recommendation compared to the adopted technique in this
paper. The content-based technique is adopted or considered here for the design of the recommender system for
digital libraries. Content-based technique is suitable in situations or domains where items are more than users.
Library users do experience difficulties in getting or finding favourite digital objects (e.g. research papers) from a
large collection of digital objects in digital libraries.

35
12.REFERENCES

1. https://www.kaggle.com/datasets/tmdb/tmdb-movie-metadata

2. Peng, Xiao, Shao Liangshan, and Li Xiuran. "Improved Collaborative Filtering Algorithm in the
Research and Application of Personalized Movie Recommendations",
2013 Fourth International Conference on Intelligent Systems Design and EngineeringApplications, 2013.

3. Munoz-Organero, Mario, Gustavo A. Ramíez-González, Pedro J. Munoz-Merino, andCarlos


Delgado Kloos. "A Collaborative Recommender System Based on Space-
Time Similarities", IEEE Pervasive Computing, 2010.

4. Al-Shamri, M.Y.H.. "Fuzzy-genetic approach to recommender systems based on a novelhybrid


user model", Expert Systems With Applications, 200810

5. Hu Jinming. "Application and research of collaborative filtering in e-commerce recommendation


system", 2010 3rd International Conference on Computer Science andInformation Technology,
07/2010

6. Suvir Bhargav. Efficient features for movie recommendation systems. 2014.

7. Gediminas Adomavicius and Alexander Tuzhilin. Toward the next generation of recommender systems: A
survey of the state-of-the-art and possible extensions. Knowledge and Data Engineering, IEEE
Transactions on, 17(6):734–749, 2005.

8. Zhi-Dan Zhao and Ming-Sheng Shang. User-based collaborative-filtering rec- ommendation algorithms
on hadoop. In Knowledge Discovery and Data Mining, 2010. WKDD’10. Third International
Conference on, pages 478–481. IEEE, 2010.

36
37
13. Appendices

Importing required libraries

The data sets contains movies and credits. We are going to merge movies and
credits into a single one.

38
The dataset contains overview, genres, keywords, cast, crew in the form of tags.
In the below implementations we are going to convert the tags into the list.

39
Here we are going to keep the genres names in the format of action, adventure, fantasy and
science fiction and changing the tags in the keywords to list.

40
41
42
43
Vectorization

Stemming

44
45
Creating website using pycharm

46
47

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy