3 An Illustrative Analysis: 3.1 Gathering Data
3 An Illustrative Analysis: 3.1 Gathering Data
An Illustrative Analysis
http://fivethirtyeight.com has a clever series of articles on the types of movies
different actors make in their careers: https://fivethirtyeight.com/tag/hollywood-
taxonomy/
For now, we will go step by step through this analysis without showing how we
perform this analysis using R. As the course progresses, we will learn how to carry
out these steps.
3.1 Gathering data
3.1.1 Movie ratings
For this analysis we need to get the movies Diego Luna was in, along with their
Rotten Tomatoes ratings. For that we scrape this
webpage: https://www.rottentomatoes.com/celebrity/diego_luna .
Ratin
Title Credit BoxOffice Year
g
83 Rogue One: A Star Wars Story Captain Cassian Andor $532.2M 2016
Star Wars
Ep. VII:
2015-12-18 306 936.66223 2058.6622
The Force
Awakens
Pirates of
the
Caribbean
2007-05-24 300 309.42043 963.4204
: At
World’s
End
The Dark
2012-07-20 Knight 275 448.13910 1084.4391
Rises
The Lone
2013-07-02 275 89.30212 260.0021
Ranger
John
2012-03-09 275 73.05868 282.7781
Carter
Avengers:
2015-05-01 Age of 250 459.00587 1404.7059
Ultron
Once we scrape the data from the Rotten Tomatoes website and clean it up, this is
part of what we have so far:
This data includes, for each of the movies Diego Luna has acted in, the rotten
tomatoes rating, the movie title, Diego Luna’s role in the movie, the U.S. domestic
gross and the year of release.
(Note 01.2018: after the initial version of this analysis, this website added pagination
to this URL. We will be using the CSV file scraped originally in Summer 2017 for this
analysis and leave the issue of dealing with pagination as an exercise.)
## Parsed with column specification:
## cols(
## release_date = col_date(format = ""),
## movie = col_character(),
## production_budget = col_double(),
## domestic_gross = col_double(),
## worldwide_gross = col_double()
## )
This is part of what we have for that table after loading and cleaning up:
This data is for 5358 movies, including its release date, title, production budget and
total gross. The latter two are in millions of U.S. dollars.
One thing we might want to check is if the budget and gross entries in this table are
inflation adjusted or not. To do this, we can make a plot of domestic gross, which we
are using for the subsequent analyses.
Although we don’t know for sure, since the source of our data does not state this
specifically, it looks like the domestic gross measurement is not inflation adjusted
since gross increases over time.
3.2 Manipulating the data
Next, we combine the datasets we obtained to get closer to the data we need to make
the plot we want.
We combine the two datasets using the movie title, so that the end result has the
information in both tables for each movie.
Alejandro
75 Frida $25.7M 2002 2002-10-25
Gomez
We see that there is one clear outlier in Diego Luna’s movies, which probably is the
one Star Wars movie he acted in. The remaining movies could potentially be grouped
into two types of movies, those with higher rating and those with lower ratings.
3.4 Modeling data
We can use a clustering algorithm to partition Diego Luna’s movies. We can use the
data we obtained so far and see if the k-means clustering algorithm partitions these
movies into three sensible groups using the movie’s rating and domestic gross.
Flatliners 4 61.30815 1
Elysium 65 93.05012 1
Contraband 52 66.52800 1
Milk 93 31.84130 3
Criminal 69 14.70870 3
Frida 75 25.88500 3
To make the plot and clustering more interpretable, let’s annotate the graph with
some movie titles. In the k-means algorithm, each group of movies is represented by
an average rating and an average domestic gross. What we can do is find the movie
in each group that is closest to the average and use that movie title to annotate each
group in the plot.
Roughly, movies are clustered into Star Wars and low vs. high rated movies. The
latter seem to have some difference in domestic gross. For example, movies like “The
Terminal” have lower rating but make slightly more money than movies like “Frida”.
We could use statistical modeling to see if that’s the case, but will skip that for now.
Do note also, that the clustering algorithm we used seems to be assigning one of the
movies incorrectly, which warrants further investigation.
With this in mind, we can write functions for each of these steps, and then make one
final function that puts all of these together.
For instance, let’s write the scraping function. It will take an actor’s name and output
the scraped data.
BoxOffic
Rating Title Credit
e
No Score
It Must Be Heaven Actor —
Yet
Good start. We can then write functions for each of the steps we did with Diego Luna
before.
Then put all of these steps into one function that calls our new functions to put all of
our analysis together:
3.8 Summary
In this analysis we saw examples of the common steps and operations in a data
analysis:
1. Data ingestion: we scraped and cleaned data from publicly accessible sites