0% found this document useful (1 vote)

135 views5 pages

Lab 5

This document describes a lab analyzing data from the 2016 Rio Olympics and 1970s US state demographics. It introduces data frames and applies functions like apply(), tapply(), and lapply() to manipulate and analyze the data. Key results include determining the country with the most medals/athletes, the oldest/youngest athletes by sport, and average weights by sport and gender. Applying functions to data frames allows automated analysis of multiple variables.

Uploaded by

Naski Kuafni

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (1 vote)

135 views5 pages

Lab 5

Uploaded by

Naski Kuafni

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 5

Lab 5: Data Frames and Apply

This lab is to be done in class (completed outside of class if need be).

States data set
Below we construct a data frame, of 50 states x 10 variables. The first 8 variables are numeric and the last 2
are factors. The numeric variables here come from the built-in state.x77 matrix, which records various
demographic factors on 50 US states, measured in the 1970s. You can learn more about this state data set by
typing ?state.x77 into your R console.
state.df = data.frame(state.x77, Region=state.region, Division=state.division)

Basic data frame manipulations

1a. Add a column to state.df , containing the state abbreviations that are stored in the built-in vector state.abb.
Name this column Abbr . You can do this in (at least) two ways: by using a call to data.frame() , or by directly
defining state.df$Abbr . Display the first 3 rows and all 11 columns of the new state.df .

1b. Remove the Region column from state.df . You can do this in (at least) two ways: by using negative
indexing, or by directly setting state.df$Region to be NULL . Display the first 3 rows and all 10 columns of
state.df .

1c. Add two columns to state.df , containing the x and y coordinates (longitude and latitude, respectively) of the
center of the states, that are stored in the (existing) list state.center . Hint: take a look at this list in the console,
to see what its elements are named. Name these two columns Center.x and Center.y . Display the first 3 rows
and all 12 columns of state.df .

1d. Make a new data frame which contains only those states whose longitude is less than -100. Do this in two
different ways: using manual indexing, and subset (). Check that they are equal to each other, using an
appropriate function call.
1e. Make a new data frame which contains only the states whose longitude is less than -100, and whose murder
rate is above 9%. Print this new data frame to the console. Among the states in this new data frame, which has
the highest average life expectancy?

Prostate cancer data set

Let’s consider the prostate cancer data set (taken from the book The Elements of Statistical Learning
(http://statweb.stanford.edu/~hastie/ElemStatLearn/)). Below we read in a data frame of 97 men x 9 variables.
pros.dat =read.table("http://www.stat.cmu.edu/~ryantibs/statcomp/data/pros.dat")
Practice with the apply family
2a. Using sapply() , calculate the mean of each variable. Also, calculate the standard deviation of each variable.
Each should require just one line of code. Display your results.
2b. Let’s plot each variable against SVI. Using lapply() , plot each column, excluding SVI, on the y-axis with
SVI on the x-axis. This should require just one line of code. Challenge: label the y-axes in your plots
appropriately. Your solution should still consist of just one line of code and use an apply function. Hint: for
this part, consider using mapply() .
2c. Now, use lapply() to perform t-tests for each variable in the data set, between SVI and non-SVI groups. To
be precise, you will perform a t-test for each variable excluding the SVI variable itself.
For convenience, we’ve defined a function t.test.by.ind() below, which takes a numeric variable x , and then an
indicator variable ind (of 0s and 1s) that defines the groups. Run this function on the columns of pros.dat ,
excluding the SVI column itself, and save the result as tests . What kind of data structure is tests ? Print it to the
console.
t.test.by.ind = function(x, ind) {
stopifnot(all(ind %in% c(0, 1)))
return(t.test(x[ind == 0], x[ind == 1]))
}

2d. Using lapply() again, extract the p-values from the tests object you created in the last question, with just a
single line of code. Hint: first, take a look at the first element of tests , what kind of object is it, and how is the
p-value stored? Second, run the command
"[["(pros.dat, "lcavol") in your console—what does this do? Now use what you’ve learned to extract p-values
from the tests object.
Rio Olympics data set
Now we’re going to examine data from the 2016 Summer Olympics in Rio de Janeiro, taken from
https://github.com/flother/rio2016 (https://github.com/flother/rio2016) (itself put together by scraping the
official Summer Olympics website for information about the athletes). Below we read in the data and store it
as rio .
rio = read.csv("http://www.stat.cmu.edu/~ryantibs/statcomp/data/rio.csv")

More practice with data frames and apply

3a. What kind of object is rio? What are its dimensions and columns names of rio ? What does each row
represent? Is there any missing data?

3b. Use rio to answer the following questions. How many athletes competed in the 2016 Summer Olympics?
How many countries were represented? What were these countries, and how many athletes competed for each
one? Which country brought the most athletes, and how many was this?
3c. How many medals of each type—gold, silver, bronze—were awarded at this Olympics? Are they equal? Is
this result surprising, and can you explain what you are seeing?

3d. Create a column called total which adds the number of gold, silver, and bronze medals for each athlete, and
add this column to rio . Which athlete had the most number of medals and how many was this? Gold medals?
Silver medals? In the case of ties, here, display all the relevant athletes.

3e. Using tapply() , calculate the total medal count for each country. Save the result as total.by.nat , and print it
to the console. Which country had the most number of medals, and how many was this? How many countries
had zero medals?
3f. Among the countries that had zero medals, which had the most athletes, and how many athletes was this?
Young and old folks
4a. The variable date_of_birth contains strings of the date of birth of each athlete. Use text processing
commands to extract the year of birth, and create a new numeric variable called age , equal to 2016 - (the year
of birth). (Here we’re ignoring days and months for simplicity.) Add the age variable to the rio data frame.
variable Who is the oldest athlete, and how old is he/she? Youngest athlete, and how old is he/she? In the case
of ties, here, display all the relevant athletes.
4b. Answer the same questions as in the last part, but now only among athletes who won a medal.

4c. Using a single call to tapply() , answer: how old are the youngest and oldest athletes, for each sport?
4d. You should see that your output from tapply() in the last part is a list, which is not particularly convenient.
Convert this list into a matrix that has one row for each sport, and two columns that display the ages of the
youngest and oldest athletes in that sport. The first 3 rows should look like this:
Youngest Oldest
aquatics 14 41
archery 17 44
athletics 16 47
You’ll notice that we set the row names according to the sports, and we also set appropriate column names.
Hint: unlist() will unravel all the values in a list; and matrix() , as you’ve seen before, can be used to create a
matrix from a vector of values. After you’ve converted the results to a matrix, print it to the console (and make
sure its first 3 rows match those displayed above).
Challenge. Was that conversion in the last part annoying? Replace the original call to tapply() in Q4c by a call
to one of the d*ply() functions from the plyr() package, in order to directly create a matrix as you sought in Q4d.
Challenge. Determine the names of the youngest and oldest athletes in each sport, without using any explicit
iteration. In the case of ties, just return one relevant athlete name. Again, the d*ply() functions from the plyr
package provide a clean solution.
Sport by sport
5a. Create a new data frame called sports , which we’ll populate with information about each sporting event at
the Summer Olympics. Initially, define sports to contain a single variable called sport which contains the names
of the sporting events in alphabetical order. Then, add a column called n_participants which contains the
number of participants in each sport. Use one of the apply functions to determine the number of gold medals
given out for each sport, and add this as a column called n_gold . Using your newly created sports data frame,
calculate the ratio of the number of gold medals to participants for each sport. Which sport has the highest
ratio? Which has the lowest?
5b. Use one of the apply functions to compute the average weight of the participants in each sport, and add this
as a column to sports called ave_weight . Important: there are missing weights in the data set coded as NA, but
your column ave_weight should ignore these, i.e., it should be itself free of NA values. You will have to pass an
additional argument to your apply call in order to achieve this.
Hint: look at the help file for the mean() function; what argument can you set to ignore NA values?
Once computed, display the average weights along with corresponding sport names, in decreasing order of
average weight.
5c. As in the last part, compute the average weight of atheletes in each sport, but now separately for men and
women. You should therefore add two new columns, called ave_weight_men and ave_weight_women , to sports .
Once computed, display the average weights along with corresponding sports, for men and women, each list
sorted in decreasing order of average weight.
Are the orderings roughly similar?
Challenge. Repeat the calculation as in the last part, but with BMI (body mass index) replacing weight.
Challenge. Use one of the apply functions to compute the proportion of women among participating atheletes
in each sport. Use these proportions to recompute the average weight (over all athletes in each sport) from the
ave_weight_men and average_weight_women columns, and define a new column ave_weight2 accordingly. Does
ave_weight2 differ from ave_weight ? It should. Explain why. Then show how to recompute the average weight
from ave_weight_men and average_weight_women in a way that exactly recreates average_weight .

Verzani Answers
100% (8)
Verzani Answers
94 pages
Set #2- Sight Word Fluency and Word Work52页
100% (1)
Set #2- Sight Word Fluency and Word Work52页
52 pages
Solutions Manual Using R Introductory ST
No ratings yet
Solutions Manual Using R Introductory ST
33 pages
Lecture 6 Bedload Transport
No ratings yet
Lecture 6 Bedload Transport
45 pages
MultivariateRGGobi PDF
No ratings yet
MultivariateRGGobi PDF
60 pages
Workshop Activity: X Seq y Length
No ratings yet
Workshop Activity: X Seq y Length
3 pages
MDPN460 Lecture05
No ratings yet
MDPN460 Lecture05
32 pages
R Imp Funtions
No ratings yet
R Imp Funtions
10 pages
Analysis Using Statistical: Introduction & Data Exploration
No ratings yet
Analysis Using Statistical: Introduction & Data Exploration
23 pages
R
No ratings yet
R
15 pages
Exercise 1 For Apllied Statistics With R
No ratings yet
Exercise 1 For Apllied Statistics With R
3 pages
STAT-2450 Assignment 1: Name:, Student ID: B00
No ratings yet
STAT-2450 Assignment 1: Name:, Student ID: B00
9 pages
Final Cost Practical
No ratings yet
Final Cost Practical
29 pages
R Assignment
No ratings yet
R Assignment
9 pages
R - Tutorial: Matrices Are Vectors
No ratings yet
R - Tutorial: Matrices Are Vectors
13 pages
Lab File AD PDF
No ratings yet
Lab File AD PDF
25 pages
UL2
No ratings yet
UL2
2 pages
All Values in The First Column
No ratings yet
All Values in The First Column
7 pages
Simple Tutorial in R
No ratings yet
Simple Tutorial in R
15 pages
Comp Lab 2 GunExample 2425
No ratings yet
Comp Lab 2 GunExample 2425
15 pages
R Practicals
No ratings yet
R Practicals
32 pages
BT1101 - R Code Cheatsheet 1.0
No ratings yet
BT1101 - R Code Cheatsheet 1.0
12 pages
R Program
No ratings yet
R Program
22 pages
STAT 1000 - Worksheet 2
No ratings yet
STAT 1000 - Worksheet 2
14 pages
R Examples
No ratings yet
R Examples
56 pages
Module 5-6
No ratings yet
Module 5-6
12 pages
Lab0 R Tutorial EHS
No ratings yet
Lab0 R Tutorial EHS
9 pages
R Programming Practical File
No ratings yet
R Programming Practical File
38 pages
R Programming
No ratings yet
R Programming
50 pages
IntroR 2
No ratings yet
IntroR 2
18 pages
Assignment 1
No ratings yet
Assignment 1
8 pages
STAT 1000 - Worksheet 2
No ratings yet
STAT 1000 - Worksheet 2
14 pages
A Short List of The Most Useful R Commands
No ratings yet
A Short List of The Most Useful R Commands
11 pages
UNIT-II R Programming
No ratings yet
UNIT-II R Programming
41 pages
Statistics With R Unit 1: Divya Arun Kumar
No ratings yet
Statistics With R Unit 1: Divya Arun Kumar
65 pages
F24 Lab-01
No ratings yet
F24 Lab-01
4 pages
R File Code
No ratings yet
R File Code
16 pages
ProgrammingForDS14 Rbasics
No ratings yet
ProgrammingForDS14 Rbasics
32 pages
Basics: TH TH TH TH TH TH TH
No ratings yet
Basics: TH TH TH TH TH TH TH
3 pages
Mit 302 Cat Solutions - 1
No ratings yet
Mit 302 Cat Solutions - 1
4 pages
Apunts BLOC 1 Estadística
No ratings yet
Apunts BLOC 1 Estadística
15 pages
R For Data Exploration
No ratings yet
R For Data Exploration
52 pages
R Course Own English HS
No ratings yet
R Course Own English HS
70 pages
R Programming Cheat Sheet
No ratings yet
R Programming Cheat Sheet
7 pages
Advance R Prog.-1
No ratings yet
Advance R Prog.-1
24 pages
S24 Stats10 Lab1-1
No ratings yet
S24 Stats10 Lab1-1
8 pages
RBasics Handout
No ratings yet
RBasics Handout
6 pages
Advanced R Data Analysis Training PDF
No ratings yet
Advanced R Data Analysis Training PDF
72 pages
Lesson 7 - The Data Frame
No ratings yet
Lesson 7 - The Data Frame
7 pages
Module IV
No ratings yet
Module IV
43 pages
Introduction To R For Business Analytics
No ratings yet
Introduction To R For Business Analytics
7 pages
Zelig For R Cheat Sheet: Plots Vectors
No ratings yet
Zelig For R Cheat Sheet: Plots Vectors
2 pages
Broomspatial
No ratings yet
Broomspatial
31 pages
Basic R Commands For Data Analysis
No ratings yet
Basic R Commands For Data Analysis
7 pages
Applied Statistics MAT1011
No ratings yet
Applied Statistics MAT1011
22 pages
Tutorial On Loops and Functions: September 28, 2007
No ratings yet
Tutorial On Loops and Functions: September 28, 2007
3 pages
Unit3 R
No ratings yet
Unit3 R
19 pages
R Commands
No ratings yet
R Commands
18 pages
Data Analyses R Manual NYTS
No ratings yet
Data Analyses R Manual NYTS
24 pages
STAT 1000 - Worksheet 2
No ratings yet
STAT 1000 - Worksheet 2
14 pages
Lec 13
No ratings yet
Lec 13
46 pages
Statistical Analysis with R For Dummies
From Everand
Statistical Analysis with R For Dummies
Joseph Schmuller
5/5 (1)
LCD Interfacing With ATMEGA2561: Sabina Batyrkhanovna
No ratings yet
LCD Interfacing With ATMEGA2561: Sabina Batyrkhanovna
5 pages
Метод.указ. по ОИС (лаб1) -eng
No ratings yet
Метод.указ. по ОИС (лаб1) -eng
18 pages
Метод.указ. по ОИС (лаб 2) - eng
No ratings yet
Метод.указ. по ОИС (лаб 2) - eng
8 pages
Computer Performance Measurement. Amdahl's Law
No ratings yet
Computer Performance Measurement. Amdahl's Law
24 pages
Pipelining. Pipeline Hazards: Sabina Batyrkhanovna
No ratings yet
Pipelining. Pipeline Hazards: Sabina Batyrkhanovna
19 pages
Метод.указ. по ОИС (лаб3) - eng
No ratings yet
Метод.указ. по ОИС (лаб3) - eng
14 pages
Risks Type Probability Impact
No ratings yet
Risks Type Probability Impact
1 page
Advanced Algorithms & Data Structures: Lecturer: Karimzhan Nurlan Berlibekuly
No ratings yet
Advanced Algorithms & Data Structures: Lecturer: Karimzhan Nurlan Berlibekuly
28 pages
Advanced Algorithms & Data Structures: Lecturer: Karimzhan Nurlan Berlibekuly
No ratings yet
Advanced Algorithms & Data Structures: Lecturer: Karimzhan Nurlan Berlibekuly
47 pages
Advanced Algorithms & Data Structures: Lecturer: Karimzhan Nurlan Berlibekuly
No ratings yet
Advanced Algorithms & Data Structures: Lecturer: Karimzhan Nurlan Berlibekuly
47 pages
Using The Uploaded Data, Try To Answer The Following Questions
No ratings yet
Using The Uploaded Data, Try To Answer The Following Questions
1 page
Advanced Algorithms & Data Structures: Lecturer: Karimzhan Nurlan Berlibekuly
No ratings yet
Advanced Algorithms & Data Structures: Lecturer: Karimzhan Nurlan Berlibekuly
31 pages
Kazakhstan at The Beginning of The 20 Century (Baigabylova A)
No ratings yet
Kazakhstan at The Beginning of The 20 Century (Baigabylova A)
18 pages
Laboratory Work 6 1. Individual Task!
No ratings yet
Laboratory Work 6 1. Individual Task!
2 pages
Final Version - End of Term Speaking
No ratings yet
Final Version - End of Term Speaking
1 page
Information Systems Modeling
No ratings yet
Information Systems Modeling
23 pages
Kazakhstan On The Way To Independence: The Phase of Development and Nation-Building Ideas
No ratings yet
Kazakhstan On The Way To Independence: The Phase of Development and Nation-Building Ideas
20 pages
Final Speaking Topics IELTS 2020
No ratings yet
Final Speaking Topics IELTS 2020
1 page
Lecture #3 1 WW, Revolutions
No ratings yet
Lecture #3 1 WW, Revolutions
24 pages
Phrasal Verbs Section 4 Developed by B. Jolamanova Assignments
100% (1)
Phrasal Verbs Section 4 Developed by B. Jolamanova Assignments
2 pages
Software Requirements Specification: Discipline "Fundamentals of Information Systems"
No ratings yet
Software Requirements Specification: Discipline "Fundamentals of Information Systems"
20 pages
Cultural Revolution in Soviet Union
No ratings yet
Cultural Revolution in Soviet Union
32 pages
Phrasal Verbs Part 3
No ratings yet
Phrasal Verbs Part 3
1 page
Software Requirements Specification: Discipline "Fundamentals of Information Systems"
No ratings yet
Software Requirements Specification: Discipline "Fundamentals of Information Systems"
18 pages
Concentration Energy Circles
100% (5)
Concentration Energy Circles
33 pages
Human Capacity For Culture
No ratings yet
Human Capacity For Culture
32 pages
1.1.3 Study - Biology, Technology, and Society (Study Guide)
No ratings yet
1.1.3 Study - Biology, Technology, and Society (Study Guide)
8 pages
Nietzsche BirthOfTragedy
No ratings yet
Nietzsche BirthOfTragedy
76 pages
Effort Rubrics Revised
No ratings yet
Effort Rubrics Revised
3 pages
HPB5002 Ch05
No ratings yet
HPB5002 Ch05
13 pages
A Level Course Guide 2023 2
No ratings yet
A Level Course Guide 2023 2
23 pages
Reported Speech (Cont)
No ratings yet
Reported Speech (Cont)
4 pages
Đề Ôn Hsg Số 2 Ms. Dung
No ratings yet
Đề Ôn Hsg Số 2 Ms. Dung
14 pages
Name: Guillermo Carreño Solano Grupo:1906333
No ratings yet
Name: Guillermo Carreño Solano Grupo:1906333
10 pages
H.U - H.P - Geografía - 3er Año - A.R. Ce Cor Asd
No ratings yet
H.U - H.P - Geografía - 3er Año - A.R. Ce Cor Asd
72 pages
Intopia WCAG 2.1 Map Audio Shorter Transcript August 2018
No ratings yet
Intopia WCAG 2.1 Map Audio Shorter Transcript August 2018
5 pages
Unit 2 - Water Chemistry - NEP - Final
No ratings yet
Unit 2 - Water Chemistry - NEP - Final
27 pages
OOP Introduction Arduino C++
No ratings yet
OOP Introduction Arduino C++
11 pages
Aperture Science Enrichment Center Volunteer Application Form
0% (1)
Aperture Science Enrichment Center Volunteer Application Form
1 page
Modeling and Simulation of Methanation Catalytic Reactor in Ammonia Plant
No ratings yet
Modeling and Simulation of Methanation Catalytic Reactor in Ammonia Plant
8 pages
Moba Hmimc Folder en
No ratings yet
Moba Hmimc Folder en
11 pages
PANSolastalgia
No ratings yet
PANSolastalgia
16 pages
ECE317 L1 Introduction
No ratings yet
ECE317 L1 Introduction
18 pages
Artificial Intelligence
100% (1)
Artificial Intelligence
117 pages
Math10 q2 Mod3 Solving Problems On Polynomial Equations v2
No ratings yet
Math10 q2 Mod3 Solving Problems On Polynomial Equations v2
58 pages
Re: Lab Equipment - Kharian Campus: Dr. Syed Muha
No ratings yet
Re: Lab Equipment - Kharian Campus: Dr. Syed Muha
1 page
Opara Peterdamian (The Future of Work in The Global Financial Services Industry)
No ratings yet
Opara Peterdamian (The Future of Work in The Global Financial Services Industry)
5 pages
MassMin 2020 Digital Transformation
No ratings yet
MassMin 2020 Digital Transformation
13 pages
Geography
No ratings yet
Geography
4 pages
Our Erasmus+ Ka122 (SCH) Project
No ratings yet
Our Erasmus+ Ka122 (SCH) Project
3 pages
BITSAT 2024 Brochure
No ratings yet
BITSAT 2024 Brochure
24 pages
Complete Syllabus of Class XI & XII: Botany (Medical)
No ratings yet
Complete Syllabus of Class XI & XII: Botany (Medical)
6 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Lab 5

Uploaded by

Lab 5

Uploaded by

Lab 5: Data Frames and Apply

This lab is to be done in class (completed outside of class if need be).

Basic data frame manipulations

Prostate cancer data set

More practice with data frames and apply

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.