0% found this document useful (0 votes)
104 views18 pages

Movie Recommendation System in R Jupyter Notebook

The document discusses preprocessing data for a movie recommendation system built in R. It loads movie and rating data, then cleans the movie genre data by splitting genres on pipes (|) and converting to a binary matrix indicating the presence or absence of each genre. The rating data is also loaded and summarized. Finally, the movie IDs, titles, and cleaned genre matrix are bound together into a search matrix to complete the preprocessing steps.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
104 views18 pages

Movie Recommendation System in R Jupyter Notebook

The document discusses preprocessing data for a movie recommendation system built in R. It loads movie and rating data, then cleans the movie genre data by splitting genres on pipes (|) and converting to a binary matrix indicating the presence or absence of each genre. The rating data is also loaded and summarized. Finally, the movie IDs, titles, and cleaned genre matrix are bound together into a search matrix to complete the preprocessing steps.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

11/18/23, 2:37 PM movie-recommendation-system-in-r - Jupyter Notebook

In [1]: # importing library



library(recommenderlab)
library(ggplot2)
library(data.table)
library(reshape2)

Loading required package: Matrix

Loading required package: arules

Attaching package: ‘arules’

The following objects are masked from ‘package:base’:

abbreviate, write

Loading required package: proxy

Attaching package: ‘proxy’

The following object is masked from ‘package:Matrix’:

as.matrix

The following objects are masked from ‘package:stats’:

as.dist, dist

The following object is masked from ‘package:base’:

as.matrix

Loading required package: registry

Registered S3 methods overwritten by 'registry':


method from
print.registry_field proxy
print.registry_entry proxy

Attaching package: ‘reshape2’

The following objects are masked from ‘package:data.table’:

dcast, melt

localhost:8888/notebooks/movie-recommendation-system-in-r.ipynb 1/18
11/18/23, 2:37 PM movie-recommendation-system-in-r - Jupyter Notebook

Retrieving the Data


In [2]: movie_data <- read.csv("../input/top-movies/movies.csv",stringsAsFactors=FA
rating_data <- read.csv("../input/movie-rating/ratings.csv")
str(movie_data)

'data.frame': 10329 obs. of 3 variables:


$ movieId: int 1 2 3 4 5 6 7 8 9 10 ...
$ title : chr "Toy Story (1995)" "Jumanji (1995)" "Grumpier Old Men (1
995)" "Waiting to Exhale (1995)" ...
$ genres : chr "Adventure|Animation|Children|Comedy|Fantasy" "Adventure
|Children|Fantasy" "Comedy|Romance" "Comedy|Drama|Romance" ...

In [3]: summary(movie_data)

movieId title genres


Min. : 1 Length:10329 Length:10329
1st Qu.: 3240 Class :character Class :character
Median : 7088 Mode :character Mode :character
Mean : 31924
3rd Qu.: 59900
Max. :149532

In [4]: head(movie_data)

A data.frame: 6 × 3

movieId title genres

<int> <chr> <chr>

1 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy

2 2 Jumanji (1995) Adventure|Children|Fantasy

3 3 Grumpier Old Men (1995) Comedy|Romance

4 4 Waiting to Exhale (1995) Comedy|Drama|Romance

5 5 Father of the Bride Part II (1995) Comedy

6 6 Heat (1995) Action|Crime|Thriller

In [5]: summary(rating_data)

userId movieId rating timestamp


Min. : 1.0 Min. : 1 Min. :0.500 Min. :8.286e+08
1st Qu.:192.0 1st Qu.: 1073 1st Qu.:3.000 1st Qu.:9.711e+08
Median :383.0 Median : 2497 Median :3.500 Median :1.115e+09
Mean :364.9 Mean : 13381 Mean :3.517 Mean :1.130e+09
3rd Qu.:557.0 3rd Qu.: 5991 3rd Qu.:4.000 3rd Qu.:1.275e+09
Max. :668.0 Max. :149532 Max. :5.000 Max. :1.452e+09

localhost:8888/notebooks/movie-recommendation-system-in-r.ipynb 2/18
11/18/23, 2:37 PM movie-recommendation-system-in-r - Jupyter Notebook

In [6]: head(rating_data)

A data.frame: 6 × 4

userId movieId rating timestamp

<int> <int> <dbl> <int>

1 1 16 4.0 1217897793

2 1 24 1.5 1217895807

3 1 32 4.0 1217896246

4 1 47 4.0 1217896556

5 1 50 4.0 1217896523

6 1 110 4.0 1217896150

Data Pre-processing

localhost:8888/notebooks/movie-recommendation-system-in-r.ipynb 3/18
11/18/23, 2:37 PM movie-recommendation-system-in-r - Jupyter Notebook

In [7]: movie_genre <- as.data.frame(movie_data$genres, stringsAsFactors=FALSE)


library(data.table)
movie_genre2 <- as.data.frame(tstrsplit(movie_genre[,1], '[|]',
type.convert=TRUE),
stringsAsFactors=FALSE) #DataFlair
colnames(movie_genre2) <- c(1:10)

list_genre <- c("Action", "Adventure", "Animation", "Children",
"Comedy", "Crime","Documentary", "Drama", "Fantasy",
"Film-Noir", "Horror", "Musical", "Mystery","Romance",
"Sci-Fi", "Thriller", "War", "Western")
genre_mat1 <- matrix(0,10330,18)
genre_mat1[1,] <- list_genre
colnames(genre_mat1) <- list_genre

for (index in 1:nrow(movie_genre2)) {
for (col in 1:ncol(movie_genre2)) {
gen_col = which(genre_mat1[1,] == movie_genre2[index,col]) #Author Data
genre_mat1[index+1,gen_col] <- 1
}
}
genre_mat2 <- as.data.frame(genre_mat1[-1,], stringsAsFactors=FALSE) #remov
for (col in 1:ncol(genre_mat2)) {
genre_mat2[,col] <- as.integer(genre_mat2[,col]) #convert from characters
}
str(genre_mat2)

'data.frame': 10329 obs. of 18 variables:


$ Action : int 0 0 0 0 0 1 0 0 1 1 ...
$ Adventure : int 1 1 0 0 0 0 0 1 0 1 ...
$ Animation : int 1 0 0 0 0 0 0 0 0 0 ...
$ Children : int 1 1 0 0 0 0 0 1 0 0 ...
$ Comedy : int 1 0 1 1 1 0 1 0 0 0 ...
$ Crime : int 0 0 0 0 0 1 0 0 0 0 ...
$ Documentary: int 0 0 0 0 0 0 0 0 0 0 ...
$ Drama : int 0 0 0 1 0 0 0 0 0 0 ...
$ Fantasy : int 1 1 0 0 0 0 0 0 0 0 ...
$ Film-Noir : int 0 0 0 0 0 0 0 0 0 0 ...
$ Horror : int 0 0 0 0 0 0 0 0 0 0 ...
$ Musical : int 0 0 0 0 0 0 0 0 0 0 ...
$ Mystery : int 0 0 0 0 0 0 0 0 0 0 ...
$ Romance : int 0 0 1 1 0 0 1 0 0 0 ...
$ Sci-Fi : int 0 0 0 0 0 0 0 0 0 0 ...
$ Thriller : int 0 0 0 0 0 1 0 0 0 1 ...
$ War : int 0 0 0 0 0 0 0 0 0 0 ...
$ Western : int 0 0 0 0 0 0 0 0 0 0 ...

localhost:8888/notebooks/movie-recommendation-system-in-r.ipynb 4/18
11/18/23, 2:37 PM movie-recommendation-system-in-r - Jupyter Notebook

In [8]: SearchMatrix <- cbind(movie_data[,1:2], genre_mat2[])


head(SearchMatrix) #DataFlair

Film-
tion Adventure Animation Children Comedy Crime Documentary Drama Fantasy H
Noir

nt> <int> <int> <int> <int> <int> <int> <int> <int> <int>

0 1 1 1 1 0 0 0 1 0

0 1 0 1 0 0 0 0 1 0

0 0 0 0 1 0 0 0 0 0

0 0 0 0 1 0 0 1 0 0

0 0 0 0 1 0 0 0 0 0

1 0 0 0 0 1 0 0 0 0

In [9]: ratingMatrix <- dcast(rating_data, userId~movieId, value.var = "rating", na


ratingMatrix <- as.matrix(ratingMatrix[,-1]) #remove userIds
#Convert rating matrix into a recommenderlab sparse matrix
ratingMatrix <- as(ratingMatrix, "realRatingMatrix")
ratingMatrix

668 x 10325 rating matrix of class ‘realRatingMatrix’ with 105339 rating


s.

In [11]: recommendation_model <- recommenderRegistry$get_entries(dataType = "realRat


names(recommendation_model)

'HYBRID_realRatingMatrix' · 'ALS_realRatingMatrix' · 'ALS_implicit_realRatingMatrix' ·


'IBCF_realRatingMatrix' · 'LIBMF_realRatingMatrix' · 'POPULAR_realRatingMatrix' ·
'RANDOM_realRatingMatrix' · 'RERECOMMEND_realRatingMatrix' ·
'SVD_realRatingMatrix' · 'SVDF_realRatingMatrix' · 'UBCF_realRatingMatrix'

localhost:8888/notebooks/movie-recommendation-system-in-r.ipynb 5/18
11/18/23, 2:37 PM movie-recommendation-system-in-r - Jupyter Notebook

In [12]: lapply(recommendation_model, "[[", "description")

$HYBRID_realRatingMatrix
'Hybrid recommender that aggegates several recommendation strategies using weighted
averages.'
$ALS_realRatingMatrix
'Recommender for explicit ratings based on latent factors, calculated by alternating least
squares algorithm.'
$ALS_implicit_realRatingMatrix
'Recommender for implicit data based on latent factors, calculated by alternating least
squares algorithm.'
$IBCF_realRatingMatrix
'Recommender based on item-based collaborative filtering.'
$LIBMF_realRatingMatrix
'Matrix factorization with LIBMF via package recosystem (https://cran.r-
project.org/web/packages/recosystem/vignettes/introduction.html).'
$POPULAR_realRatingMatrix
'Recommender based on item popularity.'
$RANDOM_realRatingMatrix
'Produce random recommendations (real ratings).'
$RERECOMMEND_realRatingMatrix
'Re-recommends highly rated items (real ratings).'
$SVD_realRatingMatrix
'Recommender based on SVD approximation with column-mean imputation.'
$SVDF_realRatingMatrix
'Recommender based on Funk SVD with gradient descend
(https://sifter.org/~simon/journal/20061211.html).'
$UBCF_realRatingMatrix
'Recommender based on user-based collaborative filtering.'

We will implement a single model in our R project – Item Based Collaborative Filtering.

In [13]: recommendation_model$IBCF_realRatingMatrix$parameters

$k
30
$method
'Cosine'
$normalize
'center'
$normalize_sim_matrix
FALSE
$alpha
0.5
$na_as_zero
FALSE

localhost:8888/notebooks/movie-recommendation-system-in-r.ipynb 6/18
11/18/23, 2:37 PM movie-recommendation-system-in-r - Jupyter Notebook

Exploring Similar Data


In [14]: similarity_mat <- similarity(ratingMatrix[1:4, ],
method = "cosine",
which = "users")
as.matrix(similarity_mat)

A matrix: 4 × 4 of type dbl

1 2 3 4

1 0.0000000 0.9760860 0.9641723 0.9914398

2 0.9760860 0.0000000 0.9925732 0.9374253

3 0.9641723 0.9925732 0.0000000 0.9888968

4 0.9914398 0.9374253 0.9888968 0.0000000

In [15]: image(as.matrix(similarity_mat), main = "User's Similarities")

In the above matrix, each row and column represents a user. We have taken four
users and each cell in this matrix represents the similarity that is shared between the
two users.

Now, we delineate the similarity that is shared between the films –

localhost:8888/notebooks/movie-recommendation-system-in-r.ipynb 7/18
11/18/23, 2:37 PM movie-recommendation-system-in-r - Jupyter Notebook

In [16]: movie_similarity <- similarity(ratingMatrix[, 1:4], method =


"cosine", which = "items")
as.matrix(movie_similarity)

image(as.matrix(movie_similarity), main = "Movies similarity")

A matrix: 4 × 4 of type dbl

1 2 3 4

1 0.0000000 0.9669732 0.9559341 0.9101276

2 0.9669732 0.0000000 0.9658757 0.9412416

3 0.9559341 0.9658757 0.0000000 0.9864877

4 0.9101276 0.9412416 0.9864877 0.0000000

Let us now extract the most unique ratings –

In [17]: rating_values <- as.vector(ratingMatrix@data)


unique(rating_values)

0 · 5 · 4 · 3 · 4.5 · 1.5 · 2 · 3.5 · 1 · 2.5 · 0.5

Now, we will create a table of ratings that will display the most unique ratings.

localhost:8888/notebooks/movie-recommendation-system-in-r.ipynb 8/18
11/18/23, 2:37 PM movie-recommendation-system-in-r - Jupyter Notebook

In [18]: Table_of_Ratings <- table(rating_values) # creating a count of movie rating


Table_of_Ratings

rating_values
0 0.5 1 1.5 2 2.5 3 3.5 4
4.5
6791761 1198 3258 1567 7943 5484 21729 12237 28880
8187
5
14856

Most Viewed Movies Visualization


In this section of the machine learning project, we will explore the most viewed movies
in our dataset. We will first count the number of views in a film and then organize them
in a table that would group them in descending order.

In [19]: library(ggplot2)
movie_views <- colCounts(ratingMatrix) # count views for each movie
table_views <- data.frame(movie = names(movie_views),
views = movie_views) # create dataframe of views
table_views <- table_views[order(table_views$views,
decreasing = TRUE), ] # sort by number of
table_views$title <- NA
for (index in 1:10325){
table_views[index,3] <- as.character(subset(movie_data,
movie_data$movieId == table_views[
}
table_views[1:6,]

A data.frame: 6 × 3

movie views title

<chr> <int> <chr>

296 296 325 Pulp Fiction (1994)

356 356 311 Forrest Gump (1994)

318 318 308 Shawshank Redemption, The (1994)

480 480 294 Jurassic Park (1993)

593 593 290 Silence of the Lambs, The (1991)

260 260 273 Star Wars: Episode IV - A New Hope (1977)

Now, we will visualize a bar plot for the total number of views of the top films. We will
carry this out using ggplot2.

localhost:8888/notebooks/movie-recommendation-system-in-r.ipynb 9/18
11/18/23, 2:37 PM movie-recommendation-system-in-r - Jupyter Notebook

In [20]: ggplot(table_views[1:6, ], aes(x = title, y = views)) +


geom_bar(stat="identity", fill = 'steelblue') +
geom_text(aes(label=views), vjust=-0.3, size=3.5) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +

ggtitle("Total Views of the Top Films")

From the above bar-plot, we observe that Pulp Fiction is the most-watched film
followed by Forrest Gump.

Heatmap of Movie Ratings

localhost:8888/notebooks/movie-recommendation-system-in-r.ipynb 10/18
11/18/23, 2:37 PM movie-recommendation-system-in-r - Jupyter Notebook

In [21]: image(ratingMatrix[1:20, 1:25], axes = FALSE, main = "Heatmap of the first

Performing Data Preparation


We will conduct data preparation in the following three steps –

Selecting useful data.


Normalizing data.
Binarizing the data.

In [23]: movie_ratings <- ratingMatrix[rowCounts(ratingMatrix) > 50,


colCounts(ratingMatrix) > 50]
movie_ratings

420 x 447 rating matrix of class ‘realRatingMatrix’ with 38341 ratings.

From the above output of ‘movie_ratings’, we observe that there are 420 users and
447 films as opposed to the previous 668 users and 10325 films. We can now
delineate our matrix of relevant users as follows –

localhost:8888/notebooks/movie-recommendation-system-in-r.ipynb 11/18
11/18/23, 2:37 PM movie-recommendation-system-in-r - Jupyter Notebook

In [24]: minimum_movies<- quantile(rowCounts(movie_ratings), 0.98)


minimum_users <- quantile(colCounts(movie_ratings), 0.98)
image(movie_ratings[rowCounts(movie_ratings) > minimum_movies,
colCounts(movie_ratings) > minimum_users],
main = "Heatmap of the top users and movies")

Now, we will visualize the distribution of the average ratings per user.

localhost:8888/notebooks/movie-recommendation-system-in-r.ipynb 12/18
11/18/23, 2:37 PM movie-recommendation-system-in-r - Jupyter Notebook

In [25]: average_ratings <- rowMeans(movie_ratings)


qplot(average_ratings, fill=I("steelblue"), col=I("red")) +
ggtitle("Distribution of the average rating per user")

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Data Normalization
In [26]: normalized_ratings <- normalize(movie_ratings)
sum(rowMeans(normalized_ratings) > 0.00001)

localhost:8888/notebooks/movie-recommendation-system-in-r.ipynb 13/18
11/18/23, 2:37 PM movie-recommendation-system-in-r - Jupyter Notebook

In [27]: image(normalized_ratings[rowCounts(normalized_ratings) > minimum_movies,


colCounts(normalized_ratings) > minimum_users],
main = "Normalized Ratings of the Top Users")

Performing Data Binarization

localhost:8888/notebooks/movie-recommendation-system-in-r.ipynb 14/18
11/18/23, 2:37 PM movie-recommendation-system-in-r - Jupyter Notebook

In [28]: binary_minimum_movies <- quantile(rowCounts(movie_ratings), 0.95)


binary_minimum_users <- quantile(colCounts(movie_ratings), 0.95)
#movies_watched <- binarize(movie_ratings, minRating = 1)

good_rated_films <- binarize(movie_ratings, minRating = 3)
image(good_rated_films[rowCounts(movie_ratings) > binary_minimum_movies,
colCounts(movie_ratings) > binary_minimum_users],
main = "Heatmap of the top users and movies")

Collaborative Filtering System


In [30]: sampled_data<- sample(x = c(TRUE, FALSE),
size = nrow(movie_ratings),
replace = TRUE,
prob = c(0.8, 0.2))
training_data <- movie_ratings[sampled_data, ]
testing_data <- movie_ratings[!sampled_data, ]

Building the Recommendation System using


R

localhost:8888/notebooks/movie-recommendation-system-in-r.ipynb 15/18
11/18/23, 2:37 PM movie-recommendation-system-in-r - Jupyter Notebook

In [31]: recommendation_system <- recommenderRegistry$get_entries(dataType ="realRat


recommendation_system$IBCF_realRatingMatrix$parameters

$k
30
$method
'Cosine'
$normalize
'center'
$normalize_sim_matrix
FALSE
$alpha
0.5
$na_as_zero
FALSE

In [32]: recommen_model <- Recommender(data = training_data,


method = "IBCF",
parameter = list(k = 30))
recommen_model

Recommender of type ‘IBCF’ for ‘realRatingMatrix’


learned using 330 users.

In [33]: class(recommen_model)

'Recommender'

In [34]: model_info <- getModel(recommen_model)



class(model_info$sim)

'dgCMatrix'

In [35]: dim(model_info$sim)

447 · 447

localhost:8888/notebooks/movie-recommendation-system-in-r.ipynb 16/18
11/18/23, 2:37 PM movie-recommendation-system-in-r - Jupyter Notebook

In [36]: top_items <- 20


image(model_info$sim[1:top_items, 1:top_items],
main = "Heatmap of the first rows and columns")

In [37]: sum_rows <- rowSums(model_info$sim > 0)


table(sum_rows)

sum_rows
30
447

How to build Recommender System on


dataset using R?
In [39]: top_recommendations <- 10 # the number of items to recommend to each user
predicted_recommendations <- predict(object = recommen_model,
newdata = testing_data,
n = top_recommendations)
predicted_recommendations

Recommendations as ‘topNList’ with n = 10 for 90 users.

localhost:8888/notebooks/movie-recommendation-system-in-r.ipynb 17/18
11/18/23, 2:37 PM movie-recommendation-system-in-r - Jupyter Notebook

In [40]: user1 <- predicted_recommendations@items[[1]] # recommendation for the firs


movies_user1 <- predicted_recommendations@itemLabels[user1]
movies_user2 <- movies_user1
for (index in 1:10){
movies_user2[index] <- as.character(subset(movie_data,
movie_data$movieId == movies_user1
}
movies_user2

'Get Shorty (1995)' · 'Casper (1995)' · 'Ed Wood (1994)' · 'Quiz Show (1994)' ·
'Santa Clause, The (1994)' · 'What\'s Eating Gilbert Grape (1993)' · 'Dave (1993)' ·
'In the Line of Fire (1993)' · 'Beauty and the Beast (1991)' · 'Kingpin (1996)'

In [41]: recommendation_matrix <- sapply(predicted_recommendations@items,


function(x){ as.integer(colnames(movie_ratings)[x]) }
#dim(recc_matrix)
recommendation_matrix[,1:4]

A matrix: 10 × 4 of type int

21 48516 1 913

158 474 16 3147

235 3147 17 55820

300 50 21 68157

317 1094 25 2997

337 60069 36 1285

440 3703 62 2395

474 150 110 32587

595 594 111 3578

785 2542 112 1266

localhost:8888/notebooks/movie-recommendation-system-in-r.ipynb 18/18

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy