Homework 1
Homework 1
Homework 1
Part A: We throw two dice. Each one is a normal die with six equally-weighted sides.
What is the probability that the sum of the two numbers is less than 7?
1 2 3 4 5 6 7
2 3 4 5 6 7 8
3 4 5 6 7 8 9
4 5 6 7 8 9 10
Part B: Suppose that the overall probability of living to age 70 is 0.62 and that the
overall probability of living to age 80 is 0.23. If a person reaches their 70th birthday,
P(70) = 0.62
P(80n70) = P(80) If an individual lives to age 80, they have already lived to age 70. The
The probability of living to age 80 if a person reacher their 70th birthday is equal to
Part C: At the Iron Bank, 62% of customers have checking accounts, 24% have savings
accounts, and 17% have both checking and savings accounts. Of the Iron Bank customers
who hold checking accounts, what percentage also have a savings account?
P(C) = 0.62 P(S) = 0.24 P(C,S) = 0.17
P(S|C) = P(C,S)/P(C)
probability.
= 0.17/0.62
Of those who hold checking accounts, 0.274 (0.17/0.62) or 27.4% also have a savings account.
We can see through the diagram the percentages each account holds as well as the customers
they share and those who do not have either account which is displayed outside of the circles.
Therefore P(C,S) represents the overlap the accounts hold which is divided by the total
percentage of checking holders to find those who have savings given they have a checking.
Part D: With the same numbers from Part C, what is the probability that an Iron Bank
Part E: Again with the same numbers from Part C, what proportion of Iron Bank
Return to the data on playlists from a music streaming service contained in plays_top50.csv.
Using the walkthrough on plays_top50.R as a starting point, please answer the following
questions.
Part A: Using the R functions xtabs and prop.table, make a 2x2 table of conditional
probabilities, where the bottom right entry in your table shows the P(plays Bob Dylan |
plays the Beatles). These variables are called bob.dylan and the.beatles in this data set.
Part B: Are the events “plays Coldplay” and “plays Radiohead” independent? Why or
Visitors to your website are asked to answer a single “yes or no” survey question before
they get access to the content on the page. Among all of the visitors to this page, there are
two categories: Random Clicker (RC) and Truthful Clicker (TC). There are two possible
answers to the survey: yes and no. Random clickers would click either one with equal
probability. You are also given the information that the overall fraction of random
clickers is 0.3.
After a trial period, you get the following survey results: 65% said Yes and 35% said No.
What is your estimate for P(yes | TC), the probability that someone answers “yes” to the
click probability for truthful clickers, we can still multiply down using n as our variable
representing click rates for TC and solve using algebra as shown below:
0.15+0.7-0.7n=0.35, n=0.72 ← using the probability of “no” in the tree and the 35% test result
P(yes|TC)≈0.5
Problem 4
Suppose that you work in the analytics office for one of the state’s largest health-insurance
companies. Your first assignment is to study the cost-effectiveness of instituting a universal
test for a disease called SOS. Your firm is thinking about making the test free and universal for
all 10 million of its clients in Texas. The tests themselves are a significant expense. Yet if
caught early, before the onset of its worst symptoms, SOS can be treated much more
cost-effectively. This could potentially save your firm a large amount of money down the road.
You are charged with understanding the properties of the test.
We know that SOS afflicts roughly 1 Texan out of every 1000, and let’s further assume that
your 10 million clients are a representative sample of all Texans (so you can expect this base
rate for Texas to hold for your clients as well). No medical test is perfect, but this one is
reasonably accurate: it gives a positive result for 95% of people who have SOS, and a negative
result for 99% of people who do not have SOS.
In light of these numbers, what is the posterior probability that a patient has SOS, given that
they test positive for the disease?
positive test
This tree is filled in with all of the given information in the problem and I multiplied down each
Using the formula for conditional probability, we can plug in the values from our tree to get our
P(V|+)=P(V,+)/P(+)
= 0.00095/0.00095+0.00999 =0.086
Therefor the posterior probability that a patient has SOS given that they test positive for the
disease is 0.086
Problem 5
Download the s550.csv data set from the course Canvas site. This file contains a sample
of 1501 Mercedes S-Class S550 cars offered for sale in the United States via cars.com.
There are several variables in the data set, but for this problem the three relevant
variables are mileage, price, and year.
Using the ggplot function within the ggplot2 R library, make a scatterplot of price (y)
versus mileage (x), faceted by year. (That is, there should be a separate panel for each
year.) Format your figure so that it has 2 rows of 4 panels each, so that it looks wider than
it is tall.
The scatter plot above shows price (y) versus mileage (x) faceted by year using the function
Bike-sharing systems are a new generation of traditional bike rentals where the whole
process from rental to return is automatic. There are thousands of municipal bike-sharing
systems around the world (e.g. Citi bikes in NYC or “Boris bikes” in London), and they
have attracted a great deal of interest because of their important role in traffic,
environmental, and health issues—especially in the wake of the COVID-19 pandemic,
when ridership levels on public-transit systems have plummeted.
These bike-sharing systems also generate a tremendous amount of data, with time of
travel, departure, and arrival position recorded for every trip. This feature turns the bike
sharing system into a virtual sensor network that can be used for sensing mobility
patterns across a city.
Plot A: a line graph showing average bike rentals (total) by hour of the day (hr).
The graph above shows the average bike rentals according to the hour of the day. The x
axis represents the hour of the day while the y axis represents the average bike rentals
corresponding to those hours.
The peaks are expected as the hours 8am and 5pm are times before/after school or a typical
work day in which people would be out riding bikes while at hours such as 4am and
midnight you would expect valleys because people are typically asleep or at home.
Take-home Lesson: From the plot, we can observe that riders are typically more active
during the day and are especially active during regular commuting times such as 8am and
5pm. However, we should expect activity to be low during the especially early/late hours of
the day when individuals are typically at home or sleeping.
Plot B: a faceted line graph showing average bike rentals by hour of the day, faceted
according to whether it is a working day (workingday).
The graph above shows the average bike rentals according to the hour of the day
depending on if the day is a working day or not. The x axis represents the hour of the day
while the y axis represents the average bike rentals corresponding to those hours. Panel 0
represents the non-working days and Panel 1 represents the working days.
Typically, the hours of 8am and 5pm are most popular for commuting or being out on
working days as they correspond to the typical hours before and after school or work. On
weekends, it is expected that mid-afternoon would be the most popular time to be out
which is demonstrated in Panel 0. The hours between 12pm and 4pm are when most bike
rentals occur and we can assume it is because people do not work/go to school on those
days and have that free time in the afternoon.
Take-home Lesson: From the plot, we can observe that riders are typically active in the
early morning and evenings on working days and are especially active during most of the
afternoon on non-working days.
Plot C: a faceted bar plot showing average ridership during the 8 AM hour by weather
situation code (weathersit), faceted according to whether it is a working day or not.
The graph above shows the average bike rentals (y axis) during the hour of 8am. This is
according to the weather situation displayed on the x axis (numbers correspond to different
weather conditions). Furthermore, the data is separated by whether it is a working day in
which Panel 0 represents the non-working days and Panel 1 represents the working days.
At 8am on both working days and non-working days, there is a greater average of bike
rentals when the weather is more clear and ‘ideal’ for being outdoors. Similarly, there are
fewer rentals when the weather is snowy or stormy. Although there are fewer rides in the
morning on non-working days because individuals may not have to be up as early to go to
work, the trends according to the weather are similar across the board and are predictable.
Take-home Lesson: From the plot, we can observe that riders will typically be more active
in the mornings if the weather is more clear or ‘ideal’ (not storming or snowy) and more
severe weather will result in fewer rentals. Additionally, we can expect fewer rentals during
mornings on non-working days as not as many individuals may be up and out commuting
to school or work.