R Assignment
R Assignment
R Assignment
table(songs$year)
How many songs does the dataset include for which the artist name is
"Michael Jackson"?
18
If you look at the structure of the dataset by typing str(songs), you can
see that there are 1032 different values of the variable "artistname".
So if we create a table of artistname, it will be challenging to find
Michael Jackson. Instead, we can use subset:
MichaelJackson = subset(songs, artistname == "Michael Jackson")
Then, by typing str(MichaelJackson) or nrow(MichaelJackson), we
can see that there are 18 observations.
Which of these songs by Michael Jackson made it to the Top 10? Select
all that apply.
You Rock My World, You Are Not Alone, - correct
Beat It You Rock My World
Billie Jean You Are Not Alone
We can answer this question by using our subset MichaelJackson from
the previous question. If you output the vector
MichaelJackson$songtitle, you can see the row number of each of the
songs. Then, you can see whether or not that song made it to the top
10 by outputing the value of Top10 for that row. For example, "Beat
It" is the 13th song in our subset. So then if we type:
MichaelJackson$Top10[13]
we get 0, which means that this song did not make it to the Top 10.
The song "You Rock My World" is first on the list, so if we type:
MichaelJackson$Top10[1]
we get 1, which means that this song did make it to the Top 10.
As a shortcut, you could just output:
MichaelJackson[c(songtitle, Top10)]
The variable corresponding to the estimated time signature
(timesignature) is discrete, meaning that it only takes integer values (0,
1, 2, 3, . . . ). What are the values of this variable that occur in our
dataset? Select all that apply.
0, 1, 3, 4, 5, 7, - correct
0
1
2 3
4
5
6 7
8
Which timesignature value is the most frequent among songs in our
dataset?
0 1 2 3 4 4 - correct 5 6 7 8
You can answer these questions by using the table command:
table(songs$timesignature)
The only values that appear in the table for timesignature are 0, 1, 3,
4, 5, and 7. We can also read from the table that 6787 songs have a
value of 4 for the timesignature, which is the highest count out of all
of the possible timesignature values.
Out of all of the songs in our dataset, the song with the highest tempo is
one of the following songs. Which one is it?
Until The Day I Die Wanna Be Startin' Somethin' Wanna Be
Startin' Somethin' - correct
My Happy Ending
You Make Me
Wanna...
You can answer this question by using the which.max function. The
output of which.max(songs$tempo) is 6206, meaning that the song
with the highest tempo is the row 6206. We can output the song title
by typing:
songs$songtitle[6206]
The song title is: Wanna be Startin' Somethin'.
We wish to predict whether or not a song will make it to the Top 10. To
do this, first use the subset function to split the data into a training set
"SongsTrain" consisting of all the observations up to and including 2009
song releases, and a testing set "SongsTest", consisting of the 2010 song
releases.
How many observations (songs) are in the training set?
7201
You can split the data into the training set and the test set by using the
following commands:
SongsTrain = subset(songs, year <= 2009)
SongsTest = subset(songs, year == 2010)
To answer this question, you first need to run the three given
commands to remove the variables that we won't use in the model
from the datasets:
nonvars = c("year", "songtitle", "artistname", "songID", "artistID")
SongsTrain = SongsTrain[ , !(names(SongsTrain) %in% nonvars) ]
SongsTest = SongsTest[ , !(names(SongsTest) %in% nonvars) ]
Then, you can create the logistic regression mo
del with the following command:
SongsLog1 = glm(Top10 ~ ., data=SongsTrain, family=binomial)
Looking at the bottom of the summary(SongsLog1) output, we can
see that the AIC value is 4827.2.
Let's now think about the variables in our dataset related to the
confidence of the time signature, key and tempo
(timesignature_confidence, key_confidence, and tempo_confidence).
Our model seems to indicate that these confidence variables are
significant (rather than the variables timesignature, key and tempo
themselves). What does the model suggest?
The lower our confidence about time signature, key and tempo, the
more likely the song is to be in the Top 10
about time signature, key and tempo, the more likely the song is to be in
the Top 10 The higher our confidence about time signature, key and
tempo, the more likely the song is to be in the Top 10 - correct
If you look at the output summary(model), where model is the name
of your logistic regression model, you can see that the coefficient
estimates for the confidence variables (timesignature_confidence,
key_confidence, and tempo_confidence) are positive. This means that
higher confidence leads to a higher predicted probability of a Top 10
hit.
In general, if the confidence is low for the time signature, tempo, and
key, then the song is more likely to be complex. What does Model 1
suggest in terms of complexity?
Mainstream listeners tend to prefer more complex songs
Mainstream listeners tend to prefer less complex songs Mainstream
listeners tend to prefer less complex songs - correct
Since the coefficient values for timesignature_confidence,
tempo_confidence, and key_confidence are all positive, lower
confidence leads to a lower predicted probability of a song being a hit.
So mainstream listeners tend to prefer less complex songs.
Songs with heavier instrumentation tend to be louder (have higher
values in the variable "loudness") and more energetic (have higher
values in the variable "energy").
0
The correlation can be computed with the following command:
cor(SongsTrain$loudness, SongsTrain$energy)
Given that these two variables are highly correlated, Model 1 suffers
from multicollinearity. To avoid this issue, we will omit one of these two
variables and rerun the logistic regression. In the rest of this problem,
we'll build two variations of our original model: Model 2, in which we
Yes
You can make predictions on the test set by using the command:
testPredict = predict(SongsLog3, newdata=SongsTest,
type="response")
Then, you can create a confusion matrix with a threshold of 0.45 by
using the command:
table(SongsTest$Top10, testPredict >= 0.45)
The accuracy of the model is (309+19)/(309+5+40+19) = 0.87936
How many non-hit songs does Model 3 predict will be Top 10 hits
(again, looking at the test set), using a threshold of 0.45?
Model 3 favors
1
The number of people who took the poll is equal to the number of
rows of the data frame, and can be obtained with nrow(poll) or from
the output of str(poll).
Let's look at the breakdown of the number of people with smartphones
using the table() and summary() commands on the Smartphone variable.
(HINT: These three numbers should sum to 1002.)
4
How many interviewees did not respond to the question, resulting in a
missing value, or NA, in the summary() output?
43
entry of the table counts the number of observations in the data set that
have the value of the first value in that row, and the value of the second
variable in that column. For example, suppose we want to create a table
of the variables "Sex" and "Region". We would type
table(poll$Sex, poll$Region)
in our R Console, and we would get as output
Midwest Northeast South West
Female 123 90 176 116
Male 116 76 183 122
This table tells us that we have 123 people in our dataset who are female
and from the Midwest, 116 people in our dataset who are male and from
the Midwest, 90 people in our dataset who are female and from the
Northeast, etc.
You might find it helpful to use the table() function to answer the
following questions:
Which of the following are states in the Midwest census region? (Select
all that apply.)
Kansas, Missouri, Ohio, - correct
Colorado Kansas
Kentucky
Missouri
Ohio
Pennsylvan
ia
Which was the state in the South census region with the largest number
of interviewees?
Texas
Texas Texas -
correct
From table(poll$State, poll$Region), we can identify the census
region of a particular state by looking at the region associated with all
its interviewees. We can read that Colorado is in the West region,
Kentucky is in the South region, Pennsylvania is in the Northeast
region, but the other three states are all in the Midwest region. From
the same chart we can read that Texas is the state in the South region
with the largest number of interviewees, 72.
Another way to approach these problems would have been to subset
the data frame and then use table on the limited data frame. For
instance, to find which states are in the Midwest region we could have
used:
MidwestInterviewees = subset(poll, Region=="Midwest")
table(MidwestInterviewees$State)
and to find the number of interviewees from each South region state
we could have used:
SouthInterviewees = subset(poll, Region=="South")
table(SouthInterviewees$State)
How many interviewees reported having used the Internet and having
used a smartphone?
470
How many interviewees reported having used the Internet but not having
used a smartphone?
285
2
How many interviewees reported having used a smartphone but not
having used the Internet?
17
How many interviewees have a missing value for their Internet use?
1
1
How many interviewees have a missing value for their smartphone use?
43
4
The number of missing values can be read from summary(poll)
Hide Answer
You have used 3 of 3 submissions
PROBLEM 2.3 - INTERNET AND SMARTPHONE USERS
Use the subset function to obtain a data frame called "limited", which is
limited to interviewees who reported Internet use or who reported
smartphone use. In lecture, we used the & symbol to use two criteria to
make a subset of the data. To only take observations that have a certain
value in one variable or the other, the | character can be used in place of
the & symbol. This is also called a logical "or" operation.
How many interviewees are in the new data frame?
792
7
The new data frame can be constructed with:
limited = subset(poll, Internet.Use == 1 | Smartphone == 1)
The number of rows can be computed with nrow(limited).
Hide Answer
You have used 3 of 3 submissions
Important: For all remaining questions in this assignment please use the
limited data frame you created in Problem 2.3.
PROBLEM 3.1 - SUMMARIZING OPINIONS ABOUT
INTERNET PRIVACY
Which variables have missing values in the limited data frame? (Select
all that apply.)
Smartphone, Age, Conservativeness, Worry.About.Info,
Privacy.Importance, Anonymity.Possible, Tried.Masking.Identity,
Privacy.Laws.Effective, - correct
Internet.Use Smartphone
Sex Age
State Region
servativeness
tance
Info.On.Internet
Anonymity.Possible
Worry.About.Info
Tried.Masking.Identity
Con
Privacy.Impor
Privacy.Law
s.Effective
EXPLANATION
You can read the number of missing values for each variable from
summary(limited)
105
1
How many interviewees reported the maximum value of 11 for
Info.On.Internet?
8
EXPLANATION
These can be read from table(limited$Info.On.Internet)
Note that we did not divide by 792 (the total number of people in the
data frame) to compute this proportion.
An easier way to compute this value is from the summary(limited)
output. The mean value of a variable that has values 1 and 0 will be
the proportion of the values that are a 1.
What proportion of interviewees who answered the Anonymity.Possible
question think it is possible to be completely anonymous on the Internet?
0.3691899
People aged about 60 years old People aged about 60 years old - correct
People aged about 80 years old
From hist(limited$Age), we see the histogram peaks at around 60
years old.
Both Age and Info.On.Internet are variables that take on many values, so
a good way to observe their relationship is through a graph. We learned
in lecture that we can plot Age against Info.On.Internet with the
command plot(limited$Age, limited$Info.On.Internet). However,
because Info.On.Internet takes on a small number of values, multiple
points can be plotted in exactly the same location on this graph.
What is the largest number of interviewees that have exactly the same
value in their Age variable AND the same value in their Info.On.Internet
variable? In other words, what is the largest number of overlapping
points in the plot plot(limited$Age, limited$Info.On.Internet)? (HINT:
Use the table function to compare the number of observations with
different values of Age and Info.On.Internet.)
6
amount of random noise to the values passed to it, and two runs will
yield the same result jitter adds or subtracts a small amount of random
noise to the values passed to it, and two runs will yield different
results jitter adds or subtracts a small amount of random noise to the
values passed to it, and two runs will yield different results - correct
4
What is the average Info.On.Internet value for non-smartphone users?
2.922807
0.1925466