Econometrics - Exercise set 1 (solution)
Econometrics - Exercise set 1 (solution)
Econometrics - Exercise set 1 (solution)
Exercise 1
Answer the following questions and state the differences between simple- and multiple linear regression:
1) State the simple linear regression model and explain the variables
True model: 𝑦𝑖 = 𝛽0 + 𝛽1 𝑥𝑖 + 𝜀𝑖
Estimated model: 𝑦𝑖 = 𝑏0 + 𝑏1 𝑥𝑖 + 𝑒𝑖
Where:
• 𝑦𝑖 = outcome
• 𝑏0 = intercept
• 𝑏1 = slope/effect on outcome from a change in 𝑥
• 𝑥𝑖 = explanatory variable for each observation 𝑖
• 𝑒 = residual (variation in 𝑦 that cannot be explained from the model)
2) State the multiple linear regression model and explain the variables
True model: 𝑦𝑖 = 𝛽𝑋 + 𝜀
Estimated model: 𝑦𝑖 = 𝑏𝑋 + 𝑒
Where:
• 𝑌 = 𝑛 × 1 matrix
𝛽0
𝛽
• 𝑏 = 𝑘 × 1 matrix of parameter values ( 1 )
…
𝛽𝑘
• 𝑋 =𝑛×𝑘 matrix containing values of each explanatory variable (incl. intercept)
1 𝑥21 … 𝑥𝑘1
1 𝑥22 … 𝑥𝑘2
( … …)
1 …
1 𝑥2𝑛 … 𝑥𝑘𝑛
Page 1 of 7
• 𝑒 = residual term, 𝑛 × 1 matrix (column vector)
𝜎2 0 … 0
2 0 …
𝑣𝑎𝑟(𝜖𝑖 ) = ( 0 𝜎 )
… 0 𝜎2 0
0 … 0 𝜎2
Heteroskedasticity: Variance of the error term differs across the value of one (or more) explanatory
variable(s), 𝑣𝑎𝑟(𝜖𝑖 ) = 𝜎𝑖2 , 𝜎𝑖2 ≠ 𝜎𝑗2
𝜎12 0 … 0
2 … 0
𝑣𝑎𝑟(𝜖𝑖 ) = 0 𝜎2
… … … 0
( 0 0 0 𝜎𝑛2 )
Exogenous: A variable which is not determined within the model (e.g. explanatory variables). By
assumption of strict exogeneity 𝐸[𝜀𝑖 |𝑋] = 0.
Endogenous: A variable determined within the model (e.g. the outcome variable, or other economic
explanatory variables such as years of education).
Page 2 of 7
Exercise 2
State the seven Gauss-Markov assumptions and give a brief explanation of each assumption.
Assumption 1: Fixed regressors. All variables are fixed (non-stochastic) and 𝑟𝑎𝑛𝑘(𝑋) = 𝑘 ≤ 𝑛 and no perfect
multicollinearity.
Assumption 2: Random disturbances, zero mean. The error term is randomly distributed with mean = 0.
Hence, 𝐸[𝜖𝑖 |𝑋] = 0.
𝜎2 0 … 0
2 0 …
𝑣𝑎𝑟(𝜖𝑖 ) = ( 0 𝜎 )
… 0 𝜎2 0
0 … 0 𝜎2
Assumption 4: No correlation. The off-diagonal elements of the covariance matrix of the disturbances (see
above) are all equal to zero. In other words, 𝑐𝑜𝑣(𝜀𝑖 𝜀𝑗 ) = 𝐸[𝜀𝑖 𝜀𝑗 ] = 0 for 𝑖 ≠ 𝑗.
Assumption 5: Constant parameters. The elements of 𝛽 and 𝜎 are fixed unknown parameters, and 𝜎 ≥ 0.
Assumption 6: Linear model. The outcome variable 𝑦 is a linear function of 𝛽 and 𝜖, and have been generated
by the data generating process (DGP):
𝑦 = 𝑋𝛽 + 𝜀
Assumption 7: Normality. The disturbances are jointly normally distributed. Hence, 𝜖𝑖 ~ 𝑁(0, 𝜎 2 ) and the
sample is randomly drawn from the population.
Exercise 3
Give a brief explanation of the difference between the “True model” and the “estimated model”.
Page 3 of 7
The true model contains the true parameter values (𝛽) that generate the outcome 𝑦. There is no way to know
the exact value of these parameters.
The true model encompasses the entire population, not only a sample of the population.
The estimated model is the model which minimizes the error term and is used to obtain 𝑏 - our guess of the
value of 𝛽. We can estimate values of 𝑏 using OLS, which gives us a best-fit line. The estimated model is based
on a sample which is representative. The more representative the sample is, the more precise our model is.
Exercise 4
Given the multiple linear regression,
𝑦 = 𝑋𝛽 + 𝜀
Where 𝛽 is a vector of coefficients, 𝑋 is the covariance matrix and 𝜀 is the error term which 𝜀 ~𝑁(0, 𝜎 2 ).
The Ordinary Least Squares (OLS) function (also called our estimated model) is then given as,
𝑦 = 𝑋𝑏 + 𝑒
4.1
Derive the objective function and find the minimizer
𝑒 = 𝑦 − 𝑋𝑏
Our objective is to minimize the sum of squared residuals. Hence, our objective function is:
𝑛
= ∑(𝑦1 − 𝑏1 − 𝑏2 𝑥2 − 𝑏3 𝑥3 − ⋯ − 𝑏𝑘 𝑥𝑘𝑖 )2
= 𝑒 ′ 𝑒 = (𝑦 − 𝑋𝑏)′ (𝑦 − 𝑋𝑏)
= (𝑦 ′ 𝑦 − 𝑦 ′ 𝑋𝑏 − 𝑏′ 𝑋 ′ 𝑦 + 𝑏′𝑋 ′ 𝑋𝑏)
Page 4 of 7
Which we can rewrite as:
4.2
−𝟏
Show that 𝒃 = (𝒙′ 𝒙) 𝒙′𝒚 becomes the OLS estimate of 𝜷
𝜕𝑆(𝑏) 𝜕
= (𝑦 ′ 𝑦 − 2𝑏′ 𝑋 ′ 𝑦 + 𝑏′𝑋′𝑋𝑏) = −2𝑋 ′ 𝑦 + 2𝑋′𝑋𝑏
𝜕𝑏 𝜕𝑏
𝜕𝑆(𝑏)
Then, solve =0
𝜕𝑏
𝜕𝑆(𝑏)
= 0 ⇔ −2𝑋 ′ 𝑦 + 2𝑋′𝑋𝑏 = 0
𝜕𝑏
2𝑋 ′ 𝑦 = 2𝑋 ′ 𝑋𝑏
2𝑋 ′ 𝑦
=𝑏
2𝑋′𝑋
𝑋′ 𝑦
=𝑏
𝑋′ 𝑋
Or similarly:
−1
(𝑋 ′ 𝑋) 𝑋′𝑦 = 𝑏
Exercise 5*
Imagine we want to estimate the wage for our future job give some already given factors. With econometrics,
we have various ways of doing so. Here are some guidelines one might want to consider when creating your
own model and estimating Ordinary Least Square (OLS). These will be tested in the problems below.
- Analyzing the dataset. What factors are important and what do we want to estimate?
- Can we do better?
For the exercise, use the dataset called wage1.RData. To load the dataset into RStudio, use the following
coding:
Page 5 of 7
load("𝑤𝑎𝑔𝑒1. 𝑅𝐷𝑎𝑡𝑎")
5.1
What is the mean, variance and quantiles of the variable wage?
5.2
Estimate Model 1 with OLS in RStudio. What do we observe with Model 1?
Interpretation:
• Educ: an additional year of education is expected to increase wages by 0.599 units - significant at 1%
level
• Experience: an additional year of education is expected to increase wages by 0.022 units - note that
this coefficient is only significant at a 10% significance level
• Tenure: an additional year of tenure increases wages by 0.169 units - highly significant
5.3
Given the factors available in the dataset, your knowledge of econometrics and the estimate in 2), could
we do better?
Page 6 of 7
We could introduce additional (relevant) covariates. Based on our knowledge of economics, we could include
𝑒𝑥𝑝𝑒𝑟 2 as there is a non-linear relationship between wages and experience. Moreover, we could include
female and non-white because of gender- and racial earning gaps.
As the distribution of income is notoriously known as being skewed, we also estimate an alternative to Model
1 where we log-transform our dependent variable.
Both proposed models out-perform Model 1 in terms of Adjusted R-Squared (see below).
Page 7 of 7