Mini Tests
Mini Tests
Minitests
Instructions Notation and terminology used below is in line with the lectures (see lecture presentations
from the course website).
Questions
1. The following regression output was obtained in Gretl using a cross-sectional data set containing
the variables wage (a respondent’s hourly wage in USD) and educ (years of completed education):
Write down the estimated regression function and interpret the estimated coefficients; for the
coefficient on educ, provide both a descriptive and a causal interpretation.
2. Our aim is to quantify a causal link leading from x to y. Explain why a randomized experiment,
where x is assigned to subjects at random, enables us to interpret the β1 coefficient in the regression
equation y = β0 + β1x + u in a causal fashion.
3. Using a cross-sectional dataset with data on 60 cities, we obtained the sample correlation between
car thefts and policemen per capita. The value is −0.26, a result significantly different from zero at
the conventional 5% significance level. Describe different causal relationships that might have
caused (or contributed to) this correlation. (Try to justify each of the three causation schemes from
the lectures.)
4. Give examples of a descriptive, causal and forecasting research question in empiric research.
5. Formulate a simple regression model both in a descriptive setting (using the population regression
function) and in a causal setting (using a structural equation).
6. Give an example of a cross-sectional and a time-series dataset (sketch a part of the data table
showing first three observations). Time series data are typically considered more difficult to analyze
statistically; why is it so?
7. Explain the difference between a repeated (or pooled) cross section and a panel data set.
8. The table below shows the average wage in a population of interest, broken down by education
groups. What is the average wage in the entire population? Formally put, the table lists the values
of E(wage | education) and the probability mass function of education. In this formalized setting,
your calculation should follow the law of iterated expectations; write down expression that
formalizes the calculation you used to obtain the population average (or: the unconditioned
expectation).
9. Using the ordinary least squares method (OLS), we estimate the β1 coefficient in the regression
model y = β0 + β1x + u. What is the relationship between the coefficient estimate and the sample
covariance of x and y?
10. Using the ordinary least squares method (OLS), we estimate the β1 coefficient in the regression
model y = β0 + β1x + u. Which of the following can possibly happen? Give a yes/no response for
each item, explain all negative responses.
a) βˆ1 4.25 , sample correlation of x and y is 2.11.
b) βˆ1 2.27 , sample covariance of x and y is 0.11.
c) βˆ1 0.02 , sample correlation of x and y is 0.
11. What squares exactly are being minimized if OLS is applied to the model y = β0 + β1x + u?
12. Using data on a sample of n = 147 respondents, we estimate using OLS the linear regression model
y = β0 + β1x + u. Write down an expression for the sum of squared residuals.
13. Using data on a sample of n = 147 respondents, we estimate using OLS the linear regression model
y = β0 + β1x + u. What is the range of possible values of the sum of residuals? Explain.
14. We are going to estimate a structural equation y = β0 + β1x + u, with an intention to interpret the
regression coefficients in causal manner. What is the crucial assumption about the random error
that we need to make? Does the assumption imply anything about the correlation of x and u?
15. In a simple regression fitted by OLS, show that the sample mean of y (actual values) equals the
sample mean of y (fitted values). Hint: From our derivations of the OLS estimator, we know a very
useful fact about the sum of residuals (u) that can be put to good effect.
16. The sample mean of x and y is 4.5 and 6, respectively. Fill in the value of the slope parameter of a
regression line fitted by OLS: yˆ 3 __ x. Explain. Hint: See lecture 2, slide 13.
17. Using OLS, we estimated the sample regression function wage = 1620 + 2.0 height, where wage
is a respondent’s monthly wage in EUR and height is measured in centimetres. Write down the
estimated function that we would have obtained if we expressed height in inches instead of
centimetres (1 inch = 2.54 cm) and wage in 000s EUR (thousands of euros) instead of EUR.
18. The following regression output was obtained in Gretl using a cross-sectional data set on 168 used
cars (Škoda Felicias) containing the variables price (price of a used car in CZK) and age (age of
the used car in years); l_price is the natural log of price.
Write down the estimated regression function and interpret the estimated slope coefficient (use
either causal or descriptive interpretation). Next, interpret the R-squared.
19. The following regression output was obtained in Gretl using a cross-sectional data set on 168 used
cars (Škoda Felicias) containing the variables price (price in CZK) and km (kilometres travelled);
l_price is the natural log of price and l_km is the natural log of km.
Write down the estimated regression function and interpret the estimated slope coefficient (use
either causal or descriptive interpretation). Next, interpret the R-squared.
20. Using a dataset on 450 manufacturing companies, you want to estimate the elasticity of sales (in
000,000s EUR) with respect to labour (number of employees). Write down the simple regression
equation you are about to estimate.
21. Explain the term sampling distribution of an estimator. (Consider using as an example the OLS
estimator of β1 in simple regression.)
22. Give an example of variables x and y such that the random error in equation y = β0 + β1x + u is
heteroskedastic. Explain why you expect the presence of heteroskedasticity.
23. Consider the usual simple regression equation y = β0 + β1x + u. Give an example of variables x and
y such that var(u | x) is a decreasing function of x.
24. Consider a linear regression model that satisfies assumptions SLR.1–SLR.5. How would you
estimate the variance of random errors? Is your estimator unbiased?
25. Explain the term standard error of β̂ 1 .
26. What exactly is the standardized (or Studentized) estimator βˆ 1? What sampling distribution does
it have under the assumptions SLR.1–SLR.6?
27. Using data on 100 000 respondents, we estimated a simple regression by OLS. The value of the
standard error for the slope coefficient is 4.2. What value (approximately) will you expect for the
standard error if we re-estimate the same equation using a random subsample of 1000 respondents
only? Hint: We have discussed this in the lectures, even though it is not explicitly contained in the
slides. In the formula for the standard error of the slope coefficient, think about how the
denominator changes if we change the number of observations.
28. Formulate the MLR.4 assumption.
29. Formulate the MLR.5 assumption.
30. Formulate the MLR.6 assumption.
31. Explain the term classical linear regression model.
32. In a regression model that satisfies assumptions MLR.1–MLR.4, the random errors are heavily
heteroskedastic. Can this make the OLS estimator biased?
33. After the estimation of y = β0 + β1x1 + β2x2 + β3x3 + β4x4 + u using data on n = 21 respondents, we
obtained R2 = 0.12. Calculate the adjusted R-squared.
34. Which of the following can possibly happen after an OLS estimation of a linear regression? For
each list item, give a yes/no answer (can /cannot happen); for each negative answer, provide a brief
explanation.
a) R2 = 0.50, and the sample correlation of y and ŷ equals 0.68.
b) R2 = 1.25, R 2 1.20 .
c) R2 = 0.01, R 2 0.02 .
35. The variables x1 and x2 are heavily correlated, their population correlation being 0.95. Can this
cause a bias in the OLS-estimator of the regression coefficients in the equation y = β0 + β1x1 + β2x2
+ u? Explain.
36. Using OLS, we estimated the equation yˆ 3.4 5.2 x. Next, we use the same data file to run another
regression that contains an additional explanatory variable z (in addition to x, which has been
retained from the previous model). The sample correlation between x and z is zero. Fill in the
missing value in the estimated equation: yˆ 1.5 __ x 3.6 z. Explain.
37. What is the bias-variance tradeoff in multiple regression (in the context of estimating the partial
effect of x on y)?
38. The following regression output was obtained in Gretl using a cross-sectional data set containing
the variables wage (a respondent’s average hourly earnings in USD) and educ (years of education).
What is the approximate width of the 95% confidence interval for the intercept? Show and comment
on your calculations.
In your proof, you can use freely use our results about the distribution of the standardized estimator
of the regression coefficient βj.
41. What exactly is the p-value of a t-test about the regression parameter βj?
42. The following regression output was obtained in Gretl using a cross-sectional data set containing
the variables wage (a respondent’s average hourly earnings in USD), exper (years of work
experience) and educ (years of education).
What is the p-value of a two-tailed test with the null H0: βeduc = 0? Show your work.
43. The following regression output was obtained in Gretl using a cross-sectional data set containing
the variables wage (a respondent’s average hourly earnings in USD) and educ (years of education).
What is the approximate width of the 95% confidence interval for the intercept? Show and comment
on your calculations.
(a) Find the 95% confidence interval for βeduc; use the approximate calculation.
(b) Assuming that the random error is normally distributed and statistically independent of educ,
how you would obtain the exact 95% confidence interval?
44. The following regression output was obtained in Gretl using cross-sectional data on young
employed men containing the variables lwage (log of respondents’ average hourly earnings in
USD), col2year (years completed at a 2-year college), col4year (years completed at a 4-year
college), age (in years) and married (an indicator of marital status, = 1 if the respondent is married):
Your aim is to test whether the return to education is the same for years spent at 2-year and at 4-
year colleges. Formulate (formally) the null hypothesis and briefly describe the statistical test you
would use.
45. In the output shown below, the logarithmic price of a house (l_price) is being explained by several
observed characteristics of the house. Test the null hypothesis H0: βcolonial = 0 against an alternative
implying that houses built in the colonial style (colonial = 1) are pricier than others (colonial = 0).
46. The following regression output was obtained in Gretl using cross-sectional data on young
employed men containing the variables lwage (log of respondents’ average hourly earnings in
USD), col2year (years completed at a 2-year college), col4year (years completed at a 4-year
college), age (in years) and married (an indicator of marital status, = 1 if the respondent is married).
Is the partial effect of age significant at the conventional 5% significance level? Report the formal
statement of the null hypothesis and the conclusion of the test, including underlying calculations.
47. What is the impact of heteroskedasticity on the OLS estimator in linear regression? What are the
implications for statistical inference?
48. Briefly describe the Breusch-Pagan and the White heteroskedasticity tests. What are the auxiliary
regressions run in these tests? What is the null hypothesis of the test?
49. The following regression output was obtained in Gretl using a cross-sectional data set on young
employed men containing the variables lwage (log of respondents’ average hourly earnings in
USD), col2year (years completed at a 2-year college), col4year (years completed at a 4-year
college), age (in years) and married (an indicator of marital status, = 1 if the respondent is married):
According to the estimated equation, what is the exact partial effect of marital status on wage?
50. The following regression output was obtained in Gretl using a cross-sectional data set containing
(among others) the variables l_wage (log of respondents’ wage), age (in years), and sq_age = age2.
Calculate the turning point of the estimated relationship between age and wage. Does the
relationship have the “u” shape or the “inverted u” shape?
51. The following regression output was obtained in Gretl using a cross-sectional data set containing
(among others) the variables l_wage (log of respondents’ wage), age (in years), and sq_age = age2.
Calculate the relative change in the wage between the 35th and the 36th birthday, as implied by the
estimated equation. (Feel free to use the approximative interpretation of the regression coefficients
in a log-level model.)
52. The following regression output was obtained in Gretl using a cross-sectional data set on 320 used
cars containing the variables price (price in CZK), km (kilometres travelled), age (in years), diesel
(= 1 for cars running on diesel, = 0 for cars running on petrol; no other fuel types are present in the
dataset). Additional variables were created via simple transforms of the existing variables: l_price
= log(price), l_km = log(km), ageXdiesel = age × diesel. Your goal is to investigate whether the
depreciation rate (loss of value with age) is lower for diesel cars than for petrol cars. What test will
you use to find out? What is the conclusion?
53. The following regression output was obtained in Gretl using a cross-sectional data set on 320 used
cars containing the variables price (price in CZK), km (kilometres travelled), age (in years), diesel
(= 1 for cars running on diesel, = 0 for cars running on petrol; no other fuel types are present in the
dataset). Additional variables were created via simple transforms of the existing variables: l_price
= log(price), l_km = log(km), ageXdiesel = age × diesel. What is the predicted difference between
the price of a diesel car and a petrol car, both 5 years old and with the same km on the clock?
exper 13.216
sq_exper 13.493
educ 1.867
female 22.899
femaleXeduc 22.869
nonwhite 1.013
smsa 1.059
57. The following regression output was obtained in Gretl using a cross-sectional data set on 327 used
cars (Škodas) containing the variables price (price of a used car in CZK), km (kilometres travelled),
age (age in years), and a categorical variable model with three levels: felicia, octavia and superb.
The variable l_price is the the natural log of price. Predict the price of a used Škoda Felicia that is
10 years old and has travelled 100,000 km.
58. The following regression output was obtained in Gretl using a cross-sectional data set on 327 used
cars (Škodas) containing the variables price (price of a used car in CZK), km (kilometres travelled),
age (age in years), and a categorical variable model with three levels: felicia, octavia and superb.
Additional variables have been created using the formulas km_100000 = km – 100000 and age_10
= age – 10. Find the 95% prediction interval for the price of a used Škoda Felicia that is 10 years
old and has travelled 100,000 km.
59. In a time-series regression, we often need remove a long-term (linear) trend from the dependent
variable. Describe the procedure you would use to obtain the detrended version of a time series
{yt : t = 1, …, n}.
60. The following regression output contains the variables x and y (time series for the dependent and
the explanatory variable), and their detrended versions x_detrended and y_detrended (obtained as
the residuals from the regression of the original series on time, like the salmon and gdp variables
from the lecture).
a) Fill in the blanks in the output below.
b) Which of the two R2s presented below would you use to quantify the strength of the relationship
between x and y? Justify your choice.
Model 1:
^y_detrended = 1.73*x_detrended
(0.218)
Model 2
^y = 1.46 + ________*x + 0.000550*time
(1.30) (________) (0.00101)
61. In order to analyze the long-term trend and seasonality of the series {yt : t = 1, …, n}, we ran a
regression in Gretl, the output of which is shown below. The time variable is the sequence 1 through
n and dqj is an indicator (dummy) variable for quarter j. What statistical test would you use to find
out if yt exhibits a significant seasonal pattern? Formulate the null hypothesis of the test and give
the distribution of the test statistic under H0 (no formulas needed, just report the name of he
distribution, along with its parameters).
62. In the regression output from Gretl below, l_GNP is the log of GNP in Sweden (in billions of
current dollars), time is the sequence 1 through 58. Interpret the coefficient on time and explain
why this regression is said to describe an exponential trend.
63. In the regression output from Gretl below, qdgp is the quarterly GDP in the U.S. (in billion USD);
l_qgdp = log(qgdp); is the sequence 1 through 258 and dqj is an indicator (dummy) variable for
quarter j. What is the average annual percentage growth of GDP, as implied by the equation?
Explain your answer.
64. In a regression with a finite distributed lag (FDL), e.g. yt = β0 + δ0xt + δ1xt−1+ δ2xt−2+ δ2xt−3 + ut, we
often observe very wide confidence intervals for the marginal effects of a unit impulse in xt
(temporary unit change) on yt, yt+1, yt+2, and yt+3. However, the long-run propensity is typically
estimated much more accurately (its confidence interval is much narrower). Why is it so?
65. Consider a regression with a finite distributed lag (FDL), yt = β0 + δ0xt + δ1xt−1 + δ2xt−2 + δ3xt−3 + ut.
a) What is the impact propensity and the long-run propensity (LRP) in this model?
b) If we need to obtain the 95% confidence interval for the LRP, we can use a suitable transform
of the explanatory variables that makes the LRP emerge as one of the model coefficients.
Describe this transform and demonstrate that it works.
66. Sketch a plot of the lag distribution for the equation yt = 1.5 + 0.7xt + 1.3xt-1 + 0.5xt-2 + 1.5xt-3 .
67. Calculate the long-run propensity in the FDL equation yt = 1.5 + 0.7xt + 1.3xt-1 + 0.5xt-2 + 1.5xt-3
and interpret its value. (Assume that xt and yt are annual time series.)
68. Formulate assumptions TS.1–TS.3 for regression with time series. Explain why TS.3 is likely
violated in a model that explains the number of monthly car thefts per capita in a particular city (yt)
with the average monthly number of policemen per capita (xt).
69. Describe the random processes denoted as AR(1) and MA(1); for each, give a full name and a
mathematical definition.
70. Give an example of a weakly dependent and a strongly dependent random process.
71. Explain the term impulse response curve. As an example, draw a plot of an impulse response curve
for the AR(1) process.
72. Formulate the assumptions about homoskedasticity and no serial correlation in a regression with
time series. Use the version of assumptions that relies on strict exogeneity of regressors.
73. Formulate assumptions TS.1’–TS.3’ for regression with time series. Why do we need assumptions
about stationarity and weak dependence? (A short and non-technical answer is expected for the
latter question.)