0% found this document useful (0 votes)
58 views11 pages

SDM - Task B - Group 1G - Movies

The document discusses a project analyzing factors that may affect IMDb ratings of movies. These factors include duration, Facebook likes of cast/directors/movies, number of reviews, and others. Statistical concepts like boxplots, histograms, correlation, and multiple regression were used to determine how the factors impact IMDb scores and by what level. A linear regression model was created using duration, gross, Facebook likes, reviews, budget, Facebook likes, and age as predictors of IMDb score.

Uploaded by

Akash Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views11 pages

SDM - Task B - Group 1G - Movies

The document discusses a project analyzing factors that may affect IMDb ratings of movies. These factors include duration, Facebook likes of cast/directors/movies, number of reviews, and others. Statistical concepts like boxplots, histograms, correlation, and multiple regression were used to determine how the factors impact IMDb scores and by what level. A linear regression model was created using duration, gross, Facebook likes, reviews, budget, Facebook likes, and age as predictors of IMDb score.

Uploaded by

Akash Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

SDM_Task B_Group 1G_Movies

Akash Gupta, B S S Pramod, Raksha Shetty, Rishabh Agrawal, Sanya Sharma, Vipul Bhatia

14/09/2019

The success of any movie depends on many factors. Cast and crew of few movies expect
commercial success, while cast and crew of other movies expect critical success (i.e, higher
ratings by critics by movie critics and websites like IMDb, Rotten Tomatoes etc.).
In this project, we have tried to check whether some of the following factors affect the
IMDb ratings of the movies: Duration, Facebook likes of Cast, Directors and Movies, number
of reviews at IMDb, etc.
We have used some concepts of Statistics (like Boxplots, Histogram, Correlation, Multiple
Regression etc.) to determine not only how the factors impact IMDb score, but also to
determine by what level do the factors impact IMDb score.
Setting the working directory
setwd("D:/SDM/R")

First, we read the CSV file Data into R.


M<-read.csv(paste("movie.csv",sep=""))

Then, we converted few values which were sgtored as factors into integer value.
M$Gross<-as.integer(M$Gross)
M$Budget<-as.integer(M$Budget)

We view the data.


View(M)

We see the structure of the data to ensure whether we have all variables in the required
formats.
str(M)

## 'data.frame': 3000 obs. of 14 variables:


## $ Movie : Factor w/ 2927 levels "[Rec] 2Â ","10
Cloverfield Lane ",..: 224 1618 1935 2189 1141 1939 225 933 258 2012 ...
## $ Duration.In.Min. : int 178 100 148 100 132 156 141 153 183 169
...
## $ Gross : int 2684 1719 1215 2040 2573 1752 2048 1700
1744 1214 ...
## $ Genre : Factor w/ 20 levels "Action","Adventure",..:
13 8 20 15 14 17 18 17 19 16 ...
## $ Cast_Total_Facebook_Likes: int 4834 48350 11700 106759 1873 46055
92000 58753 24450 29991 ...
## $ num_user_for_reviews : int 3054 1238 994 2701 738 1902 1117 973
3018 2367 ...
## $ Language : Factor w/ 8 levels "Chinese","English",..: 1
2 4 2 3 8 3 4 7 7 ...
## $ Country : Factor w/ 10 levels "Australia","China",..:
2 9 4 9 3 8 3 4 7 7 ...
## $ content_rating : Factor w/ 8 levels "A","G","NC-17",..: 6 6 6
6 6 6 6 5 6 6 ...
## $ Budget : int 137 168 139 141 145 143 141 141 141 127
...
## $ Year : int 2009 2007 2015 2012 2012 2007 2015 2009
2016 2006 ...
## $ imdb_score : num 7.9 7.1 6.8 8.5 6.6 6.2 7.5 7.5 6.9 6.1
...
## $ movie_facebook_likes : int 33000 0 85000 164000 24000 0 118000
10000 197000 0 ...
## $ Age : int 10 12 4 7 7 12 4 10 3 13 ...

We check the number of rows and columns of the data.


dim(M)

## [1] 3000 14

We summarise the entire data to get a rough idea of how variables are spread over a range
of values.
summary(M)

## Movie Duration.In.Min. Gross


## Pan : 3 Min. : 45.0 Min. : 1.0
## The Fast and the Furious : 3 1st Qu.: 94.0 1st Qu.: 736.8
## Victor Frankenstein : 3 Median :104.0 Median :1464.5
## Alice in Wonderland : 2 Mean :108.7 Mean :1462.5
## Aloha : 2 3rd Qu.:118.0 3rd Qu.:2191.2
## Around the World in 80 Days : 2 Max. :300.0 Max. :2928.0
## (Other) :2985
## Genre Cast_Total_Facebook_Likes num_user_for_reviews
## Biography: 174 Min. : 0 Min. : 1.0
## War : 170 1st Qu.: 2113 1st Qu.: 114.0
## Musical : 167 Median : 4614 Median : 215.0
## Animation: 164 Mean : 12260 Mean : 351.8
## Crime : 162 3rd Qu.: 17152 3rd Qu.: 420.0
## Thriller : 159 Max. :656730 Max. :5060.0
## (Other) :2004
## Language Country content_rating Budget
## English :907 China : 308 R :1310 Min. : 1.00
## Chinese :308 USA : 308 PG-13 :1178 1st Qu.: 80.75
## Japanese:307 Australia: 307 PG : 406 Median :149.00
## Hindi :305 Japan : 307 G : 66 Mean :142.43
## German :304 India : 305 Not Rated: 22 3rd Qu.:213.00
## Russian :294 Germany : 304 Unrated : 15 Max. :297.00
## (Other) :575 (Other) :1161 (Other) : 3
## Year imdb_score movie_facebook_likes Age
## Min. :1996 Min. :1.600 Min. : 0 Min. : 3.00
## 1st Qu.:2002 1st Qu.:5.800 1st Qu.: 0 1st Qu.: 8.00
## Median :2006 Median :6.500 Median : 317 Median :13.00
## Mean :2006 Mean :6.389 Mean : 10797 Mean :12.56
## 3rd Qu.:2011 3rd Qu.:7.100 3rd Qu.: 13000 3rd Qu.:17.00
## Max. :2016 Max. :9.000 Max. :349000 Max. :23.00
##

We represent the same data visually with boxplots, to check the skewness of the variables.
par(mfrow=c(1,1))
boxplot(M$Duration.In.Min., xlab="Duration in Mins", ylab="",
horizontal=TRUE,col=c("yellow"))

boxplot(M$Gross, xlab="Gross", ylab="",


horizontal=TRUE,col=c("Green"))
boxplot(M$Cast_Total_Facebook_Likes, xlab="Cast facebook likes", ylab="",
horizontal=TRUE,col=c("yellow"))
boxplot(M$num_user_for_reviews, xlab="Reviews", ylab="",
horizontal=TRUE,col=c("red"))

boxplot(M$Budget, xlab="Budget", ylab="",


horizontal=TRUE,col=c("brown"))
boxplot(M$movie_facebook_likes, xlab="Movie FB Like", ylab="",
horizontal=TRUE,col=c("magenta"))
boxplot(M$Age, xlab="Movie Age", ylab="",
horizontal=TRUE,col=c("orange"))

boxplot(M$imdb_score, xlab="IMDB Rating", ylab="",


horizontal=TRUE,col=c("blue"))
Coming, to the model, we first check the normality of the dependent variable (IMDb Rating)
by plotting a histogram.
hist(M$imdb_score)
The dependent variable looks normal from the Histogram, which implies we can go ahead
with linear multiple regression.
Now, we check how strong or weak are the associations between all variables. Hence, we
plot correlation matrix.
round(digits=4, cor(M[,c(2,3,5,6,10,12,13,14)]))

## Duration.In.Min. Gross
## Duration.In.Min. 1.0000 0.0103
## Gross 0.0103 1.0000
## Cast_Total_Facebook_Likes 0.0830 0.0448
## num_user_for_reviews 0.2105 0.0274
## Budget -0.0241 0.0403
## imdb_score 0.2412 0.0293
## movie_facebook_likes 0.1831 0.0144
## Age -0.0577 -0.0354
## Cast_Total_Facebook_Likes num_user_for_reviews
## Duration.In.Min. 0.0830 0.2105
## Gross 0.0448 0.0274
## Cast_Total_Facebook_Likes 1.0000 0.1842
## num_user_for_reviews 0.1842 1.0000
## Budget 0.0147 0.0038
## imdb_score 0.1174 0.3339
## movie_facebook_likes 0.2033 0.3551
## Age -0.1017 0.0592
## Budget imdb_score movie_facebook_likes Age
## Duration.In.Min. -0.0241 0.2412 0.1831 -0.0577
## Gross 0.0403 0.0293 0.0144 -0.0354
## Cast_Total_Facebook_Likes 0.0147 0.1174 0.2033 -0.1017
## num_user_for_reviews 0.0038 0.3339 0.3551 0.0592
## Budget 1.0000 -0.0007 -0.0432 0.0550
## imdb_score -0.0007 1.0000 0.3091 -0.0420
## movie_facebook_likes -0.0432 0.3091 1.0000 -0.4472
## Age 0.0550 -0.0420 -0.4472 1.0000

Also, to visually represent correlation values, we plot a corrgram.


library(corrgram)

## Registered S3 method overwritten by 'seriation':


## method from
## reorder.hclust gclus

corrgram(M[,c(2,3,5,6,10,12,13,14)],
order=FALSE,
lower.panel=panel.pie,
upper.panel=panel.cor,
text.panel=panel.txt,
main="Corrgram of all Test variables")

Now, we create a Linear Multiple Regression model by inputting all parameters


M_imdb <- lm(imdb_score~Duration.In.Min.+Gross+Cast_Total_Facebook_Likes
+num_user_for_reviews +
Budget +movie_facebook_likes +Age,data=M)
summary(M_imdb)

##
## Call:
## lm(formula = imdb_score ~ Duration.In.Min. + Gross +
Cast_Total_Facebook_Likes +
## num_user_for_reviews + Budget + movie_facebook_likes + Age,
## data = M)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.2287 -0.5334 0.1142 0.6686 2.2141
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.194e+00 9.964e-02 52.133 < 2e-16 ***
## Duration.In.Min. 6.410e-03 6.993e-04 9.166 < 2e-16 ***
## Gross 2.374e-05 2.051e-05 1.158 0.24711
## Cast_Total_Facebook_Likes 1.270e-06 8.917e-07 1.424 0.15459
## num_user_for_reviews 5.194e-04 4.588e-05 11.322 < 2e-16 ***
## Budget 9.739e-05 2.091e-04 0.466 0.64138
## movie_facebook_likes 1.003e-05 9.167e-07 10.942 < 2e-16 ***
## Age 1.132e-02 3.723e-03 3.042 0.00237 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9446 on 2992 degrees of freedom
## Multiple R-squared: 0.1795, Adjusted R-squared: 0.1776
## F-statistic: 93.52 on 7 and 2992 DF, p-value: < 2.2e-16

Looking at the Regression model, we can infer few points.


1) Adjusted R square = 0.1776, which indicates that the model is weak.

2) We can reject the null hypotheses for the variables: Duration, Gross, Facebook likes of
cast and movie.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy