Project 2 - Movielens Case Study
Project 2 - Movielens Case Study
DESCRIPTION
The GroupLens Research Project is a research group in the Department of Computer Science and
Engineering at the University of Minnesota. Members of the GroupLens Research Project are
involved in many research projects related to the fields of information filtering, collaborative filtering,
and recommender systems. The project is led by professors John Riedl and Joseph Konstan. The
project began to explore automated collaborative filtering in 1992 but is most well known for its
worldwide trial of an automated collaborative filtering system for Usenet news in 1996. Since then,
the project has expanded its scope to research overall information by filtering solutions, integrating
into content-based methods, as well as, improving current collaborative filtering technology.
Problem Objective:
Here, we ask you to perform the analysis using the Exploratory Data Analysis technique. You need to
find features affecting the ratings of any particular movie and build a model to predict the movie
ratings.
Domain: Entertainment
Create a new dataset [Master_Data] with the following columns MovieID Title UserID Age Gender
Occupation Rating. (Hint: (i) Merge two tables at a time. (ii) Merge the tables using two primary keys
MovieID & UserId)
Explore the datasets using visual representations (graphs or tables), also include your comments on
the following:
Feature Engineering:
Find out all the unique genres (Hint: split the data in column genre making a list and then process the
data to find out only the unique categories of genres)
Create a separate column for each genre category with a one-hot encoding ( 1 and 0) whether or not
the movie belongs to that genre.
Dataset Description:
These files contain 1,000,209 anonymous ratings of approximately 3,900 movies made by 6,040
MovieLens users who joined MovieLens in 2000.
Ratings.dat
Format - UserID::MovieID::Rating::Timestamp
Field Description
Users.dat
Format - UserID::Gender::Age::Occupation::Zip-code
Field Description
All demographic information is provided voluntarily by the users and is not checked for accuracy.
Only users who have provided demographic information are included in this data set.
Value Description
1 "Under 18"
18 "18-24"
25 "25-34"
35 "35-44"
45 "45-49"
50 "50-55"
56 "56+"
Value
Description
1 "academic/educator"
2 "artist”
3 "clerical/admin"
4 "college/grad student"
5 "customer service"
6 "doctor/health care"
7 "executive/managerial"
8 "farmer"
9 "homemaker"
10 "K-12 student"
11 "lawyer"
12 "programmer"
13 "retired"
14 "sales/marketing"
15 "scientist"
16 "self-employed"
17 "technician/engineer"
18 "tradesman/craftsman"
19 "unemployed"
20 "writer”
Movies.dat
Format - MovieID::Title::Genres
Field Description
Titles are identical to titles provided by the IMDB (including year of release)
Genres are pipe-separated and are selected from the following genres:
Action
Adventure
Animation
Children's
Comedy
Crime
Documentary
Drama
Fantasy
Film-Noir
Horror
Musical
Mystery
Romance
Sci-Fi
Thriller
War
Western
Some MovieIDs do not correspond to a movie due to accidental duplicate entries and/or test entries
Movies are mostly entered by hand, so errors and inconsistencies may exist
Dataset: https://drive.google.com/file/d/1kMySbxVf7kmXU07AVi7MZKwDfcRLfh7_/view?
usp=drive_link