SDM - Task B - Group 1G - Movies
SDM - Task B - Group 1G - Movies
Akash Gupta, B S S Pramod, Raksha Shetty, Rishabh Agrawal, Sanya Sharma, Vipul Bhatia
14/09/2019
The success of any movie depends on many factors. Cast and crew of few movies expect
commercial success, while cast and crew of other movies expect critical success (i.e, higher
ratings by critics by movie critics and websites like IMDb, Rotten Tomatoes etc.).
In this project, we have tried to check whether some of the following factors affect the
IMDb ratings of the movies: Duration, Facebook likes of Cast, Directors and Movies, number
of reviews at IMDb, etc.
We have used some concepts of Statistics (like Boxplots, Histogram, Correlation, Multiple
Regression etc.) to determine not only how the factors impact IMDb score, but also to
determine by what level do the factors impact IMDb score.
Setting the working directory
setwd("D:/SDM/R")
Then, we converted few values which were sgtored as factors into integer value.
M$Gross<-as.integer(M$Gross)
M$Budget<-as.integer(M$Budget)
We see the structure of the data to ensure whether we have all variables in the required
formats.
str(M)
## [1] 3000 14
We summarise the entire data to get a rough idea of how variables are spread over a range
of values.
summary(M)
We represent the same data visually with boxplots, to check the skewness of the variables.
par(mfrow=c(1,1))
boxplot(M$Duration.In.Min., xlab="Duration in Mins", ylab="",
horizontal=TRUE,col=c("yellow"))
## Duration.In.Min. Gross
## Duration.In.Min. 1.0000 0.0103
## Gross 0.0103 1.0000
## Cast_Total_Facebook_Likes 0.0830 0.0448
## num_user_for_reviews 0.2105 0.0274
## Budget -0.0241 0.0403
## imdb_score 0.2412 0.0293
## movie_facebook_likes 0.1831 0.0144
## Age -0.0577 -0.0354
## Cast_Total_Facebook_Likes num_user_for_reviews
## Duration.In.Min. 0.0830 0.2105
## Gross 0.0448 0.0274
## Cast_Total_Facebook_Likes 1.0000 0.1842
## num_user_for_reviews 0.1842 1.0000
## Budget 0.0147 0.0038
## imdb_score 0.1174 0.3339
## movie_facebook_likes 0.2033 0.3551
## Age -0.1017 0.0592
## Budget imdb_score movie_facebook_likes Age
## Duration.In.Min. -0.0241 0.2412 0.1831 -0.0577
## Gross 0.0403 0.0293 0.0144 -0.0354
## Cast_Total_Facebook_Likes 0.0147 0.1174 0.2033 -0.1017
## num_user_for_reviews 0.0038 0.3339 0.3551 0.0592
## Budget 1.0000 -0.0007 -0.0432 0.0550
## imdb_score -0.0007 1.0000 0.3091 -0.0420
## movie_facebook_likes -0.0432 0.3091 1.0000 -0.4472
## Age 0.0550 -0.0420 -0.4472 1.0000
corrgram(M[,c(2,3,5,6,10,12,13,14)],
order=FALSE,
lower.panel=panel.pie,
upper.panel=panel.cor,
text.panel=panel.txt,
main="Corrgram of all Test variables")
##
## Call:
## lm(formula = imdb_score ~ Duration.In.Min. + Gross +
Cast_Total_Facebook_Likes +
## num_user_for_reviews + Budget + movie_facebook_likes + Age,
## data = M)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.2287 -0.5334 0.1142 0.6686 2.2141
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.194e+00 9.964e-02 52.133 < 2e-16 ***
## Duration.In.Min. 6.410e-03 6.993e-04 9.166 < 2e-16 ***
## Gross 2.374e-05 2.051e-05 1.158 0.24711
## Cast_Total_Facebook_Likes 1.270e-06 8.917e-07 1.424 0.15459
## num_user_for_reviews 5.194e-04 4.588e-05 11.322 < 2e-16 ***
## Budget 9.739e-05 2.091e-04 0.466 0.64138
## movie_facebook_likes 1.003e-05 9.167e-07 10.942 < 2e-16 ***
## Age 1.132e-02 3.723e-03 3.042 0.00237 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9446 on 2992 degrees of freedom
## Multiple R-squared: 0.1795, Adjusted R-squared: 0.1776
## F-statistic: 93.52 on 7 and 2992 DF, p-value: < 2.2e-16
2) We can reject the null hypotheses for the variables: Duration, Gross, Facebook likes of
cast and movie.