Lab 5
Lab 5
1b. Remove the Region column from state.df . You can do this in (at least) two ways: by using negative
indexing, or by directly setting state.df$Region to be NULL . Display the first 3 rows and all 10 columns of
state.df .
1c. Add two columns to state.df , containing the x and y coordinates (longitude and latitude, respectively) of the
center of the states, that are stored in the (existing) list state.center . Hint: take a look at this list in the console,
to see what its elements are named. Name these two columns Center.x and Center.y . Display the first 3 rows
and all 12 columns of state.df .
1d. Make a new data frame which contains only those states whose longitude is less than -100. Do this in two
different ways: using manual indexing, and subset (). Check that they are equal to each other, using an
appropriate function call.
1e. Make a new data frame which contains only the states whose longitude is less than -100, and whose murder
rate is above 9%. Print this new data frame to the console. Among the states in this new data frame, which has
the highest average life expectancy?
2d. Using lapply() again, extract the p-values from the tests object you created in the last question, with just a
single line of code. Hint: first, take a look at the first element of tests , what kind of object is it, and how is the
p-value stored? Second, run the command
"[["(pros.dat, "lcavol") in your console—what does this do? Now use what you’ve learned to extract p-values
from the tests object.
Rio Olympics data set
Now we’re going to examine data from the 2016 Summer Olympics in Rio de Janeiro, taken from
https://github.com/flother/rio2016 (https://github.com/flother/rio2016) (itself put together by scraping the
official Summer Olympics website for information about the athletes). Below we read in the data and store it
as rio .
rio = read.csv("http://www.stat.cmu.edu/~ryantibs/statcomp/data/rio.csv")
3b. Use rio to answer the following questions. How many athletes competed in the 2016 Summer Olympics?
How many countries were represented? What were these countries, and how many athletes competed for each
one? Which country brought the most athletes, and how many was this?
3c. How many medals of each type—gold, silver, bronze—were awarded at this Olympics? Are they equal? Is
this result surprising, and can you explain what you are seeing?
3d. Create a column called total which adds the number of gold, silver, and bronze medals for each athlete, and
add this column to rio . Which athlete had the most number of medals and how many was this? Gold medals?
Silver medals? In the case of ties, here, display all the relevant athletes.
3e. Using tapply() , calculate the total medal count for each country. Save the result as total.by.nat , and print it
to the console. Which country had the most number of medals, and how many was this? How many countries
had zero medals?
3f. Among the countries that had zero medals, which had the most athletes, and how many athletes was this?
Young and old folks
4a. The variable date_of_birth contains strings of the date of birth of each athlete. Use text processing
commands to extract the year of birth, and create a new numeric variable called age , equal to 2016 - (the year
of birth). (Here we’re ignoring days and months for simplicity.) Add the age variable to the rio data frame.
variable Who is the oldest athlete, and how old is he/she? Youngest athlete, and how old is he/she? In the case
of ties, here, display all the relevant athletes.
4b. Answer the same questions as in the last part, but now only among athletes who won a medal.
4c. Using a single call to tapply() , answer: how old are the youngest and oldest athletes, for each sport?
4d. You should see that your output from tapply() in the last part is a list, which is not particularly convenient.
Convert this list into a matrix that has one row for each sport, and two columns that display the ages of the
youngest and oldest athletes in that sport. The first 3 rows should look like this:
Youngest Oldest
aquatics 14 41
archery 17 44
athletics 16 47
You’ll notice that we set the row names according to the sports, and we also set appropriate column names.
Hint: unlist() will unravel all the values in a list; and matrix() , as you’ve seen before, can be used to create a
matrix from a vector of values. After you’ve converted the results to a matrix, print it to the console (and make
sure its first 3 rows match those displayed above).
Challenge. Was that conversion in the last part annoying? Replace the original call to tapply() in Q4c by a call
to one of the d*ply() functions from the plyr() package, in order to directly create a matrix as you sought in Q4d.
Challenge. Determine the names of the youngest and oldest athletes in each sport, without using any explicit
iteration. In the case of ties, just return one relevant athlete name. Again, the d*ply() functions from the plyr
package provide a clean solution.
Sport by sport
5a. Create a new data frame called sports , which we’ll populate with information about each sporting event at
the Summer Olympics. Initially, define sports to contain a single variable called sport which contains the names
of the sporting events in alphabetical order. Then, add a column called n_participants which contains the
number of participants in each sport. Use one of the apply functions to determine the number of gold medals
given out for each sport, and add this as a column called n_gold . Using your newly created sports data frame,
calculate the ratio of the number of gold medals to participants for each sport. Which sport has the highest
ratio? Which has the lowest?
5b. Use one of the apply functions to compute the average weight of the participants in each sport, and add this
as a column to sports called ave_weight . Important: there are missing weights in the data set coded as NA, but
your column ave_weight should ignore these, i.e., it should be itself free of NA values. You will have to pass an
additional argument to your apply call in order to achieve this.
Hint: look at the help file for the mean() function; what argument can you set to ignore NA values?
Once computed, display the average weights along with corresponding sport names, in decreasing order of
average weight.
5c. As in the last part, compute the average weight of atheletes in each sport, but now separately for men and
women. You should therefore add two new columns, called ave_weight_men and ave_weight_women , to sports .
Once computed, display the average weights along with corresponding sports, for men and women, each list
sorted in decreasing order of average weight.
Are the orderings roughly similar?
Challenge. Repeat the calculation as in the last part, but with BMI (body mass index) replacing weight.
Challenge. Use one of the apply functions to compute the proportion of women among participating atheletes
in each sport. Use these proportions to recompute the average weight (over all athletes in each sport) from the
ave_weight_men and average_weight_women columns, and define a new column ave_weight2 accordingly. Does
ave_weight2 differ from ave_weight ? It should. Explain why. Then show how to recompute the average weight
from ave_weight_men and average_weight_women in a way that exactly recreates average_weight .