0% found this document useful (0 votes)

42 views11 pages

3 An Illustrative Analysis: 3.1 Gathering Data

The document describes performing a movie cluster analysis on actor Diego Luna's filmography. It involves gathering data on Luna's movies from Rotten Tomatoes and budget data from another site. The data is combined and visualized in a plot of rating vs domestic gross. A k-means clustering algorithm is used to group the movies into clusters, which largely separates out the Star Wars film as its own cluster and divides the rest into low and high rated movies. The document proposes abstracting these steps into a reusable function that could analyze any actor's filmography.

Uploaded by

Muzamil Ali

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

42 views11 pages

3 An Illustrative Analysis: 3.1 Gathering Data

Uploaded by

Muzamil Ali

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 11

3

An Illustrative Analysis
http://fivethirtyeight.com has a clever series of articles on the types of movies
different actors make in their careers: https://fivethirtyeight.com/tag/hollywood-
taxonomy/

I’d like to do a similar analysis. Let’s do this in order:

1. Let’s do this analysis for Diego Luna

2. Let’s use a clustering algorithm to determine the different types of movies they
make
3. Then, let’s write an application that performs this analysis for any actor and
test it with Gael García Bernal
4. Let’s make the application interactive so that a user can change the actor and
the number of movie clusters the method learns.

For now, we will go step by step through this analysis without showing how we
perform this analysis using R. As the course progresses, we will learn how to carry
out these steps.

3.1 Gathering data
3.1.1 Movie ratings
For this analysis we need to get the movies Diego Luna was in, along with their
Rotten Tomatoes ratings. For that we scrape this
webpage: https://www.rottentomatoes.com/celebrity/diego_luna .

Ratin
Title Credit BoxOffice Year
g

11 Berlin, I Love You Drag Queen — 2019

95 If Beale Street Could Talk Pedrocito — 2019

60 A Rainy Day in New York Actor — 2019

4 Flatliners Ray $16.9M 2017

Ratin
Title Credit BoxOffice Year
g

83 Rogue One: A Star Wars Story Captain Cassian Andor $532.2M 2016

88 Blood Father Jonah — 2016

82 The Book of Life Manolo — 2014

release_date movie production_budget domestic_gross worldwide_gross

2009-12-18 Avatar 425 760.50762 2783.9190

Star Wars
Ep. VII:
2015-12-18 306 936.66223 2058.6622
The Force
Awakens

Pirates of
the
Caribbean
2007-05-24 300 309.42043 963.4204
: At
World’s
End

2015-11-06 Spectre 300 200.07417 879.6209

The Dark
2012-07-20 Knight 275 448.13910 1084.4391
Rises

The Lone
2013-07-02 275 89.30212 260.0021
Ranger

John
2012-03-09 275 73.05868 282.7781
Carter

2010-11-24 Tangled 260 200.82194 586.5819

Spider-
2007-05-04 258 336.53030 890.8753
Man 3

Avengers:
2015-05-01 Age of 250 459.00587 1404.7059
Ultron

Once we scrape the data from the Rotten Tomatoes website and clean it up, this is
part of what we have so far:

This data includes, for each of the movies Diego Luna has acted in, the rotten
tomatoes rating, the movie title, Diego Luna’s role in the movie, the U.S. domestic
gross and the year of release.

3.1.2 Movie budgets and revenue

For the movie budgets and revenue data we scrape this webpage: http://www.the-
numbers.com/movie/budgets/all

(Note 01.2018: after the initial version of this analysis, this website added pagination
to this URL. We will be using the CSV file scraped originally in Summer 2017 for this
analysis and leave the issue of dealing with pagination as an exercise.)
## Parsed with column specification:
## cols(
## release_date = col_date(format = ""),
## movie = col_character(),
## production_budget = col_double(),
## domestic_gross = col_double(),
## worldwide_gross = col_double()
## )
This is part of what we have for that table after loading and cleaning up:

This data is for 5358 movies, including its release date, title, production budget and
total gross. The latter two are in millions of U.S. dollars.

One thing we might want to check is if the budget and gross entries in this table are
inflation adjusted or not. To do this, we can make a plot of domestic gross, which we
are using for the subsequent analyses.

Although we don’t know for sure, since the source of our data does not state this
specifically, it looks like the domestic gross measurement is not inflation adjusted
since gross increases over time.
3.2 Manipulating the data
Next, we combine the datasets we obtained to get closer to the data we need to make
the plot we want.

We combine the two datasets using the movie title, so that the end result has the
information in both tables for each movie.

Ratin BoxOffic Yea productio

Title Credit release_date
g e r

4 Flatliners Ray $16.9M 2017 1990-08-10

Rogue One: A Star Captain Cassian

83 $532.2M 2016 2016-12-16
Wars Story Andor

82 The Book of Life Manolo — 2014 2014-10-17

65 Elysium Julio $90.9M 2013 2013-08-09

52 Contraband Gonzalo $66.5M 2012 2012-01-13

93 Milk Jack Lira $31.8M 2008 2008-11-26

69 Criminal Rodrigo $0.8M 2004 2016-04-15

61 The Terminal Enrique Cruz $77.1M 2004 2004-06-18

79 Open Range Button $58.3M 2003 2003-08-15

Alejandro
75 Frida $25.7M 2002 2002-10-25
Gomez

3.3 Visualizing the data

Now that we have the data we need, we can make a plot:
Figure 3.1: Ratings and U.S. Domestic Gross of Diego Luna’s movies.

We see that there is one clear outlier in Diego Luna’s movies, which probably is the
one Star Wars movie he acted in. The remaining movies could potentially be grouped
into two types of movies, those with higher rating and those with lower ratings.

3.4 Modeling data
We can use a clustering algorithm to partition Diego Luna’s movies. We can use the
data we obtained so far and see if the k-means clustering algorithm partitions these
movies into three sensible groups using the movie’s rating and domestic gross.

Let’s see how the movies are grouped:

Title Rating domestic_gross cluster

Flatliners 4 61.30815 1

Elysium 65 93.05012 1

Contraband 52 66.52800 1

The Terminal 61 77.07396 1

Rogue One: A Star Wars Story 83 532.17732 2

The Book of Life 82 50.15154 3

Milk 93 31.84130 3

Criminal 69 14.70870 3

Open Range 79 58.33125 3

Frida 75 25.88500 3

3.5 Visualizing model result

Let’s remake the same plot as before, but use color to indicate each movie’s cluster
assignment given by the k-means algorithm.
The algorithm did make the Star Wars movie it’s own group since it’s so different that
the other movies. The grouping of the remaining movies is not as clean.

To make the plot and clustering more interpretable, let’s annotate the graph with
some movie titles. In the k-means algorithm, each group of movies is represented by
an average rating and an average domestic gross. What we can do is find the movie
in each group that is closest to the average and use that movie title to annotate each
group in the plot.
Roughly, movies are clustered into Star Wars and low vs. high rated movies. The
latter seem to have some difference in domestic gross. For example, movies like “The
Terminal” have lower rating but make slightly more money than movies like “Frida”.
We could use statistical modeling to see if that’s the case, but will skip that for now.
Do note also, that the clustering algorithm we used seems to be assigning one of the
movies incorrectly, which warrants further investigation.

3.6 Abstracting the analysis

While not a tremendous success, we decide we want to carry on with this analysis.
We would like to do this for other actors’ movies. One of the big advantages of using
R is that we can write a piece of code that takes an actor’s name as input, and
reproduces the steps of this analysis for that actor. We call these functions, we’ll see
them and use them a lot in this course.

For our analysis, this function must do the following:

1. Scrape movie ratings from Rotten Tomatoes

2. Clean up the scraped data
3. Join with the budget data we downloaded previously
4. Perform the clustering algorithm
5. Make the final plot

With this in mind, we can write functions for each of these steps, and then make one
final function that puts all of these together.

For instance, let’s write the scraping function. It will take an actor’s name and output
the scraped data.

Let’s test it with Gael García Bernal:

BoxOffic
Rating Title Credit
e

No Score
It Must Be Heaven Actor —
Yet

No Score Lorena, Light-Footed Woman (Lorena, la de pies Executive

—
Yet ligeros) Producer

85% Ema Gastón —

Good start. We can then write functions for each of the steps we did with Diego Luna
before.

Then put all of these steps into one function that calls our new functions to put all of
our analysis together:

We can test this with Gael García Bernal

analyze_actor("Gael Garcia Bernal")
3.7 Making analyses accessible
Now that we have written a function to analyze an actor’s movies, we can make these
analyses easier to produce by creating an interactive application that wraps our new
function. The shiny R package makes creating this type of application easy.

3.8 Summary
In this analysis we saw examples of the common steps and operations in a data
analysis:

1. Data ingestion: we scraped and cleaned data from publicly accessible sites

2. Data manipulation: we integrated data from multiple sources to prepare our

analysis

3. Data visualization: we made plots to explore patterns in our data

4. Data modeling: we made a model to capture the grouping patterns in data

automatically, using visualization to explore the results of this modeling

5. Publishing: we abstracted our analysis into an application that allows us and

others to perform this analysis over more datasets and explore the result of modeling
using a variety of parameter

Movielens Recommender System Capstone Project: Compiled by Mahesh Halkeri
No ratings yet
Movielens Recommender System Capstone Project: Compiled by Mahesh Halkeri
19 pages
MovieLens Project Report
No ratings yet
MovieLens Project Report
19 pages
Report - Project8 - FRA - Surabhi - Report
0% (1)
Report - Project8 - FRA - Surabhi - Report
15 pages
Vertopal.com IMDb+Movie+Assignment Stub
No ratings yet
Vertopal.com IMDb+Movie+Assignment Stub
9 pages
04 - Movie Rating Analysis
No ratings yet
04 - Movie Rating Analysis
9 pages
Final Project
No ratings yet
Final Project
7 pages
Movies Final Report
No ratings yet
Movies Final Report
22 pages
A Predictor For Movie Success: 2.1 Data Collection
No ratings yet
A Predictor For Movie Success: 2.1 Data Collection
5 pages
project 5
No ratings yet
project 5
5 pages
Analytic Project Report APR
No ratings yet
Analytic Project Report APR
42 pages
Rotten Tomatoes Audience Rating Prediction
No ratings yet
Rotten Tomatoes Audience Rating Prediction
36 pages
Individual Assignment - Alejandro Gutierrez - Data Science
No ratings yet
Individual Assignment - Alejandro Gutierrez - Data Science
4 pages
Adriano Axel Pliopas Pereira - 83393 - Exercise 8 - Ggplot2movies
No ratings yet
Adriano Axel Pliopas Pereira - 83393 - Exercise 8 - Ggplot2movies
15 pages
DSLAB5
No ratings yet
DSLAB5
17 pages
Movie Recommendation System Analysis
No ratings yet
Movie Recommendation System Analysis
8 pages
Movie Notebook
No ratings yet
Movie Notebook
91 pages
IMDB Dataframe Insights
No ratings yet
IMDB Dataframe Insights
3 pages
Team_Renegades_MMLA_Report
No ratings yet
Team_Renegades_MMLA_Report
27 pages
Hands-On Lab - Importing Data in R
No ratings yet
Hands-On Lab - Importing Data in R
8 pages
Python
No ratings yet
Python
30 pages
IMDB Analysis
No ratings yet
IMDB Analysis
4 pages
Report
No ratings yet
Report
26 pages
IMDB Movie Analysis
No ratings yet
IMDB Movie Analysis
17 pages
Movie Recommendation System in R Jupyter Notebook
No ratings yet
Movie Recommendation System in R Jupyter Notebook
18 pages
Project: Predicting Box Office Revenues: A Report Submitted To
No ratings yet
Project: Predicting Box Office Revenues: A Report Submitted To
10 pages
SNEHA KUMARI_262_DS PROJECT.
No ratings yet
SNEHA KUMARI_262_DS PROJECT.
19 pages
IMDB Movie Analysis1
No ratings yet
IMDB Movie Analysis1
14 pages
Final Project - CS181
No ratings yet
Final Project - CS181
3 pages
MovieLens Final-Project
No ratings yet
MovieLens Final-Project
18 pages
RE Paper
No ratings yet
RE Paper
25 pages
Project Problem Statement
No ratings yet
Project Problem Statement
3 pages
TMDB Box Office Prediction: Group 6
No ratings yet
TMDB Box Office Prediction: Group 6
7 pages
Mini / Basic Python Code
No ratings yet
Mini / Basic Python Code
6 pages
Document (3)
No ratings yet
Document (3)
4 pages
IMDB Movie Analysis
No ratings yet
IMDB Movie Analysis
2 pages
Hitchhiker's Guide To Exploratory Data Analysis - by Harshit Tyagi - Towards Data Science
No ratings yet
Hitchhiker's Guide To Exploratory Data Analysis - by Harshit Tyagi - Towards Data Science
14 pages
R_
No ratings yet
R_
13 pages
Group 15 Report
No ratings yet
Group 15 Report
23 pages
Netflix Data Analysis
No ratings yet
Netflix Data Analysis
23 pages
Review 2
No ratings yet
Review 2
21 pages
IMDB Movie Analysis
No ratings yet
IMDB Movie Analysis
80 pages
Project 2 - Movielens Case Study
No ratings yet
Project 2 - Movielens Case Study
5 pages
Netflix Data Exploration Solution Approach
No ratings yet
Netflix Data Exploration Solution Approach
6 pages
Data Analysis using Python_Homework 5.docx
No ratings yet
Data Analysis using Python_Homework 5.docx
3 pages
Applied Data Science: Machine Problem No. 1: Data Structures
No ratings yet
Applied Data Science: Machine Problem No. 1: Data Structures
4 pages
Chapter 9 - Recommendation Systems
No ratings yet
Chapter 9 - Recommendation Systems
12 pages
Project Movielense Solution
No ratings yet
Project Movielense Solution
4 pages
Homework 3 Visualization
No ratings yet
Homework 3 Visualization
2 pages
Python Project Description
No ratings yet
Python Project Description
4 pages
IMDB Movie Analysis: by Biswajeet Nayak
No ratings yet
IMDB Movie Analysis: by Biswajeet Nayak
23 pages
Week 3
No ratings yet
Week 3
2 pages
SDM - Task B - Group 1G - Movies
No ratings yet
SDM - Task B - Group 1G - Movies
11 pages
Project Movielense Solution
29% (7)
Project Movielense Solution
4 pages
DM 8
No ratings yet
DM 8
6 pages
Final Project1 IMDB Movie Analysis PDF
No ratings yet
Final Project1 IMDB Movie Analysis PDF
9 pages
Imdb Scrape v3
No ratings yet
Imdb Scrape v3
9 pages
Exp 7
No ratings yet
Exp 7
64 pages
Practical Work 1 - Recommender Systems
No ratings yet
Practical Work 1 - Recommender Systems
3 pages
ARTICULO ANALYSIS E IMPLEMENTACION DE FILM Y MARCADO EN PYTHON
No ratings yet
ARTICULO ANALYSIS E IMPLEMENTACION DE FILM Y MARCADO EN PYTHON
8 pages
NAAN MUTHALVAN PRACTICAL SAMPLE
No ratings yet
NAAN MUTHALVAN PRACTICAL SAMPLE
7 pages
The Distorters of History: Unexpected Changes in the Media and the Motion Picture Industry with Movies Forever Expanded-Updated Edition
From Everand
The Distorters of History: Unexpected Changes in the Media and the Motion Picture Industry with Movies Forever Expanded-Updated Edition
Robert D. Ronson
No ratings yet
Software Project Management (Department Elective Û I)
No ratings yet
Software Project Management (Department Elective Û I)
3 pages
Soalan Kimia
No ratings yet
Soalan Kimia
6 pages
Sekolah Menengah Kejuruan at Taqwa: A. Choose The Correct Answer by Crossing (X) A, B, C, D, or E!
No ratings yet
Sekolah Menengah Kejuruan at Taqwa: A. Choose The Correct Answer by Crossing (X) A, B, C, D, or E!
3 pages
Let's Envisage AngerManagement-combined
No ratings yet
Let's Envisage AngerManagement-combined
46 pages
Indian Ethos & Business Ethics UNIT I
100% (3)
Indian Ethos & Business Ethics UNIT I
41 pages
Tender Award Details-Weekly 07-03-2024
No ratings yet
Tender Award Details-Weekly 07-03-2024
6 pages
MX2600 MX3100 Installation 3E
No ratings yet
MX2600 MX3100 Installation 3E
102 pages
(Studies in Systems, Decision and Control, 221) Zhenhua Wang, Yi Shen - Model-Based Fault Diagnosis - Methods For State-Space Systems-Springer (2022)
No ratings yet
(Studies in Systems, Decision and Control, 221) Zhenhua Wang, Yi Shen - Model-Based Fault Diagnosis - Methods For State-Space Systems-Springer (2022)
207 pages
Project Planning and Execution
100% (3)
Project Planning and Execution
5 pages
Mariah Campbell Biology Homework
No ratings yet
Mariah Campbell Biology Homework
4 pages
MEGA_GR_7_ENGLISH_1
No ratings yet
MEGA_GR_7_ENGLISH_1
7 pages
In Amp Input Overvoltage Protection
No ratings yet
In Amp Input Overvoltage Protection
5 pages
Distress Signals
No ratings yet
Distress Signals
4 pages
Completion - Drilling Formulas
100% (2)
Completion - Drilling Formulas
13 pages
Asp Flooding Simulation
No ratings yet
Asp Flooding Simulation
8 pages
Composite Failure Analysis
No ratings yet
Composite Failure Analysis
251 pages
Learning Task in English 8
No ratings yet
Learning Task in English 8
2 pages
Sabbath Bible Lessons: April - June 1994
No ratings yet
Sabbath Bible Lessons: April - June 1994
60 pages
Project Assignment Group 18 Industrial Vending Machine
No ratings yet
Project Assignment Group 18 Industrial Vending Machine
16 pages
Author: JANET C. TAER 0 School/Station: Taligaman National High School Division: BUTUAN CITY Email Address: Janet - Taer@deped - Gov.ph
No ratings yet
Author: JANET C. TAER 0 School/Station: Taligaman National High School Division: BUTUAN CITY Email Address: Janet - Taer@deped - Gov.ph
11 pages
Design of Forming Tools
No ratings yet
Design of Forming Tools
5 pages
CV322 Assignment#2
No ratings yet
CV322 Assignment#2
8 pages
Classroom 0 GST
No ratings yet
Classroom 0 GST
10 pages
Exp 2018 Excellence
No ratings yet
Exp 2018 Excellence
24 pages
Introduction To The Philosophy of The Human Person Quarter 1 - Module 3.2 The Human Person As An Embodied Spirit
No ratings yet
Introduction To The Philosophy of The Human Person Quarter 1 - Module 3.2 The Human Person As An Embodied Spirit
2 pages
Project Management Notes
No ratings yet
Project Management Notes
82 pages
Vaccines You Need - and Those You Don't
100% (3)
Vaccines You Need - and Those You Don't
14 pages
Iv-Infusion-Rate-Calculations and Sample Questions
No ratings yet
Iv-Infusion-Rate-Calculations and Sample Questions
3 pages
Tirumala Tirupati Devasthanams: Tirupati
No ratings yet
Tirumala Tirupati Devasthanams: Tirupati
3 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

3 An Illustrative Analysis: 3.1 Gathering Data

Uploaded by

3 An Illustrative Analysis: 3.1 Gathering Data

Uploaded by

3

I’d like to do a similar analysis. Let’s do this in order:

1. Let’s do this analysis for Diego Luna

11 Berlin, I Love You Drag Queen — 2019

95 If Beale Street Could Talk Pedrocito — 2019

60 A Rainy Day in New York Actor — 2019

4 Flatliners Ray $16.9M 2017

88 Blood Father Jonah — 2016

82 The Book of Life Manolo — 2014

release_date movie production_budget domestic_gross worldwide_gross

2009-12-18 Avatar 425 760.50762 2783.9190

2015-11-06 Spectre 300 200.07417 879.6209

2010-11-24 Tangled 260 200.82194 586.5819

3.1.2 Movie budgets and revenue

Ratin BoxOffic Yea productio

4 Flatliners Ray $16.9M 2017 1990-08-10

Rogue One: A Star Captain Cassian

82 The Book of Life Manolo — 2014 2014-10-17

65 Elysium Julio $90.9M 2013 2013-08-09

52 Contraband Gonzalo $66.5M 2012 2012-01-13

93 Milk Jack Lira $31.8M 2008 2008-11-26

69 Criminal Rodrigo $0.8M 2004 2016-04-15

61 The Terminal Enrique Cruz $77.1M 2004 2004-06-18

79 Open Range Button $58.3M 2003 2003-08-15

3.3 Visualizing the data

Let’s see how the movies are grouped:

Title Rating domestic_gross cluster

The Terminal 61 77.07396 1

Rogue One: A Star Wars Story 83 532.17732 2

The Book of Life 82 50.15154 3

Open Range 79 58.33125 3

3.5 Visualizing model result

3.6 Abstracting the analysis

For our analysis, this function must do the following:

1. Scrape movie ratings from Rotten Tomatoes

Let’s test it with Gael García Bernal:

No Score Lorena, Light-Footed Woman (Lorena, la de pies Executive

85% Ema Gastón —

We can test this with Gael García Bernal

2. Data manipulation: we integrated data from multiple sources to prepare our

3. Data visualization: we made plots to explore patterns in our data

4. Data modeling: we made a model to capture the grouping patterns in data

5. Publishing: we abstracted our analysis into an application that allows us and

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.