Business Statistics Final Exam Example
Business Statistics Final Exam Example
Business Statistics Final Exam Example
Spring 2018
• The standard error for the difference in the averages between groups a and b is
defined as: s
s2a s2
s(X̄a −X̄b ) = + b
na nb
where s2a denotes the sample variance of group a and na the number of observations
in group a.
Good Luck!
Honor Code Pledge: “I pledge my honor that I have not violated the Honor Code
during this examination.”
Signed:
Name:
1
Problem 1: Who’s to blame? (10 points)
In manufacturing its iPhone, Apple buys a particular kind of microchip from 3 suppliers:
30% from Freescale, 20% from Texas Instruments and 50% from Samsung.
Apple has extensive histories on the reliability of the chips and knows that 3% of the
chips from Freescale are defective; 5% from Texas Instruments are defective and 4%
from Samsung are defective.
In testing a newly assembled iPhone, Apple found the microchip to be defective. What
provider is the likely culprit?
Page 2
Problem 2: Breaking Bad... (10 points each)
Two chemists working for a chicken fast food company, have been
producing a very popular sauce. Let’s call then Jesse and Mr. White. Gus, their
boss, is tired of Mr. White’s negative attitude and is thinking about “firing” him and
keeping only Jesse on payroll. The problem, however, is that Mr. White seems to
produce a higher quality sauce whenever he is in charge of production if compared to
Jesse. Before making a final decision, Gus collected some data measuring the quality
of different batches of sauce produced by Mr. White and Jesse. The results, measured
on a quality scale, are listed below:
Two questions:
1. Based in this data, can we tell for sure which one is the better chemist?
Page 3
(b) Hypothesis testing
Null hypothesis : mean of two companies are the same. H0 : µ1 = µ2 , which
is equivalent to H0 : µ1 − µ2 = 0.
Difference of the mean is 97 − 94 = 3. Standard deviation of the difference
of the mean is s r
s21 s22 1 32
+ = + = 1.02.
n1 n2 7 10
3
t statistic is 1.02 = 2.94 > 2. We reject the null hypothesis at 95% level. Mr.
White is better.
2. Gus wants to keep the mean quality score for the sauce above 90. In this case, can
he can rid of Mr. White, i.e., is Jesse good enough to run the sauce production?
Yes, 95% confidence interval of mean quality score for Jesse is [92.102, 95.898],
which is above 90.
Page 4
Problem 3: Portfolios (5 points each)
We’re considering building a portfolio from three investments: a fund tracking the
SP500, a bond fund, and a fund of large cap stocks. The portfolios under consideration
are:
Returns on the large cap fund and the bond fund have the same expected value and
standard deviation. Historically, there is a small negative correlation between the bond
and SP500 funds, and a small positive correlation between the large cap and SP500
funds. The returns on each investment have normal distributions.
Using only the information given above, choose the single correct response to each
question below:
(a) (4 points) What is the relationship between the expected returns for each portfo-
lio?
Portfolio A has higher expected returns
Portfolio B has higher expected returns
Both portfolios have the same expected returns correct
Impossible to say without more information
(b) (4 points) If we want the portfolio with the largest Sharpe ratio, which portfolio
should we choose?
Portfolio A correct
Portfolio B
Either one; their Sharpe ratios are the same
Impossible to say without more information
(c) (4 points) If we want the portfolio with the most potential for growth (say, the
portfolio that is most likely to generate returns greater than its average plus 2%),
which portfolio should we choose?
Portfolio A
Portfolio B correct
Either one; they are equally likely to generate returns greater than their
average plus 2%
Impossible to say without more information
Page 5
Problem 4 (2 points each)
(a) 5
(b) 9
(c) 7 correct
(d) 8
(a) 9
(b) 81 correct
(c) 3
(d) 6
(a) 15%
(b) 68%
(c) 98%
(d) 87% correct
(a) 5%
(b) 23% correct
(c) 2.5%
(d) 34%
Page 6
Problem 5 (5 points each)
In trying to validate their claim and make sure that SDS is a good fund that appro-
priately tracks its target, I decided to collect data on monthly returns (in percentage
terms) of SDS and the S&P500 Index since 2009 and run the following regression:
Regression Statistics
Multiple R 0.994
R Square 0.989
Adjusted R Square 0.988
Standard Error 0.760
Observations 62.000
ANOVA
df SS MS F Significance F
Regression 1.000 3024.488 3024.488 5242.184 0.000
Residual 60.000 34.617 0.577
Total 61.000 3059.106
1. In trying to evaluate the claim made by ProShares, test the appropriate hypothe-
ses about β0 . What is your conclusion?
Page 7
2. In trying to evaluate the claim made by ProShares, test the appropriate hypothe-
ses about β1 . What is your conclusion?
3. What is your final evaluation? Is SDS a good ETF? Justify your answer (and
don’t forget to address the estimate of σ 2 ).
It’s not a good ETF. Standard error is 0.760, which means that the ETF cannot
keep track of S&P 500 index well.
Page 8
Problem 6: Crime data from our homework
(5 points each)
Let’s recall the “Crime” vs. “Police” example from our homework. There, we were
trying to understand the effect of more police on crime and we couldn’t just get data
from a few different cities and run the regression of “Crime” on “Police”. The problem
here is that data on police and crime cannot tell the difference between more police
leading to crime or more crime leading to more police... in fact I would expect to see a
potential positive correlation between police and crime if looking across different cities
as mayors probably react to increases in crime by hiring more cops. Again, it would be
nice to run an experiment and randomly place cops in the streets of a city in different
days and see what happens to crime. Obviously we can’t do that!
The researchers from UPENN mentioned in the homework were able to estimate this
effect by using what we call a natural experiment. They were able to collect data on
crime in DC and also relate that to days in which there was a higher alert for potential
terrorist attacks. Why is this a natural experiment? Well, by law the DC mayor has
to put more cops in the streets during the days in which there is a high alert. That
decision has nothing to do with crime so it works essentially as a experiment.
Here’s is the main table displaying the results from the analysis:
TABLE 2
Total Daily Crime Decreases on High-Alert Days
(1) (2)
High Alert !7.316* !6.046*
(2.877) (2.537)
Log(midday ridership) 17.341**
(5.309)
R2 .14 .17
IV. Results
Page 9
The results from our most basic regression are presented in Table 2, where
we regress daily D.C. crime totals against the terror alert level (1 p high,
Answer the following questions:
1. Why it was not enough to present the results from column (1) in the table? Why
did they have to include the METRO ridership variable?
There are confounding effects that may induce omitted variable bias. For example,
METRO ridership may be correlated with crime, because on the day with high
alert, people get out less, so there are less targets for crimes and thus daily crime
goes down. This effect is not caused by increasing police.
2. Can you explain why the estimates of the impact of police on crime from the
columns are different?
The direction of change is just as we expected. That is, some part of the effect in
column (1) can be explained by the fact that high alert will cause people to go out
less, so the coefficient for High Alert in column (2) is smaller in absolute value,
and the coefficient for Log(midday ridership) has significant effect on crime.
Page 10
Problem 7: House Prices (2 points each)
Let’s go back to the Midcity housing prices dataset from our homework... For simplicity
I have combined the two cheap neighborhoods into one group so we are left with only
two neighborhoods.
where N BH is a dummy variable that takes the value 1 if the house is in neighborhood
2 and BRICK is a dummy variable that equals 1 if the house is made out of brick. The
figure below displays the results from the regression. This is a graphical representation
of of the estimates of all coefficients in this regression.
Nbhd = 1
200
Nbhd = 2
Nbhd = 2 and Brick = 1
180
160
Price
140
120
100
80
Size
(a) 65.32
(b) 30.45
(c) 17.98
(d) 49.85 - correct
Page 11
2. What is the estimated value for the effect of Size on P rices for houses in neigh-
borhood 2?
(a) 65.32
(b) 49.85 - correct
(c) 20.31
(d) 12.67
(a) 15.76
(b) 38.61
(c) 26.08 - correct
(d) 52.10
4. What is the estimated average difference between a 1,800 sqft wood house in
neighborhood 2 and neighborhood 1?
Page 12
Problem 8: House Prices again! (2 points each)
Continuing in analyzing the MidCity data (same as the previous question), I now de-
cided to investigate whether or not the effect of Size on P rices changes in the different
neighborhoods. To this end, I worked with the following model:
Nbhd = 1
200
Nbhd = 2
Nbhd = 2 and Brick = 1
180
160
Price
140
120
100
80
Size
1. In model 2, what is the estimated value for the effect of Size on P rices for houses
in neighborhood 1?
(a) 71.30
(b) 30.45
(c) 17.98
(d) 51.27 - correct
Page 13
2. In model 2, what is the estimated value for the effect of Size on P rices for houses
in neighborhood 2?
(a) 75.23
(b) 46.67 - correct
(c) 20.31
(d) 51.27
(a) 46.67
(b) 51.27
(c) 13.15
(d) -4.60 - correct
4. What is the t-stat for the difference between the slope for Size in the two neigh-
borhoods?
(a) 2.15
(b) -4.44
(c) -0.35 - correct
(d) 5.63
Page 14
Problem 9: Medal Count
(3 points each)
Using data from Beijing 2008 and London 2012 I run a regression trying to understand
the impact of GDP (gross domestic product measured in billions of US$) and Popu-
lation (in millions of people) on the total number of medals won by a country in
SUMMARY OUTPUT
the summer Olympics. The results are
Regression Statistics
Multiple R 0.82488
R Square 0.68043
Adjusted R Square0.67660
Standard Error 10.83097
Observations 170.00000
ANOVA
df SS MS F Significance F
Regression 2.00000 41712.86080 20856.43040 177.78909 0.00000
Residual 167.00000 19590.76273 117.30996
Total 169.00000 61303.62353
No. The intercept corresponds to the case where population and GDP are both
zero, while we can’t have a country with zero population, in which case it’s not a
country.
Page 15
(b) Provide an interpretation for the coefficients associated with Population and
GDP?
(c) What is the t-stat for Population telling you? Clearly explain the hypothesis
being tested and your conclusion.
(d) From the results, give a 95% prediction interval for the total number of medals for
the U.S. in the Rio 2016 Olympics, given that the U.S. current GDP is of 18.5
trillion of dollars and population is 300 million?
Page 16
The following table shows the total medal count for a few countries in Rio 2016 Olympics
along with their current GDP and Population:
Country Total Medals GDP (in US$ billions) Population (in millions)
U.S. 121 18,500 300
Great Britain 67 2,800 64
China 70 11,300 1,357
Brazil 19 1,600 200
India 2 1,877 1,250
Holland 19 853 16.8
Fiji 1 3.8 0.881
(e) Using the results from the regression, which of these countries performance in the
Rio 2016 is not surprising? Why?
2 standard error = 21.66
(f ) Based on the regression results, rank the performance of these countries in the Rio
Olympics. Explain your ranking methodology.
Page 17
I proceeded to add a dummy variable for the host country into the regression... I also
ran a regression with only GDP and Host. The results are below:
Regression Statistics
Multiple R 0.8639
R Square 0.7462
Adjusted R Square0.7417
Standard Error 9.6805
Observations 170.0000
ANOVA
df SS MS F Significance F
Regression 3.0000 45747.2827 15249.0942 162.7214 0.0000
Residual 166.0000 15556.3409 93.7129
Total 169.0000 61303.6235
Regression Statistics
Multiple R 0.86332
R Square 0.74532
Adjusted R Square0.74227
Standard Error 9.66902
Observations 170.00000
ANOVA
df SS MS F Significance F
Regression 2.00000 45690.80714 22845.40357 244.36222 0.00000
Residual 167.00000 15612.81639 93.48992
Total 169.00000 61303.62353
Page 18
(h) Of the 3 models presented, which one is the best in your opinion? Carefully explain
why?
The model with GDP and host dummy included is the best one. One can see
from the standard error of the models.
(i) In the last model presented, provide an interpretation for the coefficient associated
with Host.
Given GDP fixed, being host country is associated with 50.15148 more medals than
not being host country.
(j) Using your chosen model, evaluate Brazil’s performance in the Rio Olympics. Com-
pare and explain the difference in the results if you were to talk about Brazil’s
performance based on the first regression.
So using this model, Brazil has bad performance in the Rio Olympics, given it
being the host country.
Page 19