Regression Analysis - Introduction SLR
Regression Analysis - Introduction SLR
Regression Analysis - Introduction SLR
Tintner (1953)
“The Study of the Application of Statistical Methods
to the Analysis of Economic Phenomena”
Business analysts often need to be in
position to:
Regression Analysis
Components of Regression Analysis
Regression Analysis involves four phases:
Components of
Regression Analysis
Verification Prediction
Getting Started
Regression Analysis Begins
with Model Specification
- Model specification entails the expression of theoretical
constructs in mathematical terms
- This phase of regression analysis constitutes the model building
activity
- In essence, model specification is the translation of theoretical
constructs into mathematical/statistical forms
- Fundamental principles in model building:
* The principle of parsimony (other things the same, simple
models generally are preferable to complex
models, especially in forecasting)
* The shrinkage principle (imposing restrictions either on
estimated parameters or on forecasts often
improves model performance)
* The KISS principle “Keep it Sophistically Simple”
The Simple Linear Regression Model
y = 0 + 1x + u
Coefficients Error Term
β0: Intercepts Disturbance Term
β1: Slope Innovation
The error term u explicitly relates that the
relationship between y and x is not an
identity; u arises for two reasons:
(1) measurement error
(2) the regression inadvertently omits the effects
of other variables besides x that could impact y.
Graphical Illustration
y = ß0 + ß1x + u
Regression line: E(y|x)= ß0 + ß1x
Y (Dependent Variable)
Β1: Slope
ß 0
x
X (Independent Variable) x+h
How to interpret coefficients?
Example: Estimation of Demand Relationships
• Randomly sample n
Independent Dependent
observations from a
variable variable
population (Exogenous
• For each observation, variable)
Sampling
Inference
Sample Sample
(data) parameters
t tests, F tests,
Regression confidence intervals
Descriptive
Statistics OLS: assumptions, properties of OLS
estimates, interpretation of estimates
Data Types:
-Time-Series
-Cross-Sectional
Data—The Critical Ingredient
Critical Ingredient – data (sample “sufficiently large”)
- Time-series data
* daily, weekly, monthly, quarterly, annual
DAILY – closing prices of stock
prices
WEEKLY – measures of money
supply
MONTHLY – housing starts
QUARTERLY – GDP figures
ANNUAL – salary figures
- Cross-Sectional Data
* Snapshot of activity at a given point in
time
* Survey of household expenditure patterns
Quote from Lord Kelvin
“I often say that when you can measure what you are speaking
about, and express it in numbers, you know something about
it; but when you cannot measure it, when you cannot express
it in numbers, your knowledge is of a meager and
unsatisfactory kind.”
Get a Feel for the Data
-Plots of key variables
-Scatter plots
-Descriptive statistics
•skewness
•mean
•kurtosis
•median
•distribution
•minimum
•maximum
•standard deviation
Descriptive Statistics
X1
X2
Let X
correspond to a vector of T observations for the
X variable X.
T
Mean
The mean, a measure of central tendency, corresponds to
the average of the set of observations corresponding to a particular
data series. The mean is given by: T
x i
x i 1
T
The units associated with the mean are the same as the units of xi,
i= 1, 2, …, T.
continued . . .
Median
The median also is a measure of central tendency of a data
series. The median corresponds to the 50th percentile of the data
series. The units associated with the median are the same as the
units of xi, i = 1, 2, …, T. To find the median, arrange the
observations in increasing order. When sample values are
arranged in this fashion, they often are called the 1st, 2nd, 3rd …
order statistics. In general, if T is an odd number, the median is the
T 1
order statistic whose number is .
2
continued . . .
Standard Deviation
The standard deviation is a measure of the spread or
dispersion of a series about the mean. 1
T
2
2
i ( x x )
The standard deviation is given by S = i 1
.
T 1
The units associated with the standard deviation are the same as
the units of xi.
Variance
The variance also is a measure of the spread or dispersion of
a series about the mean. T
(x x)
2
i
2 2
The variance is expressed as
2
ˆ i 1
Note that ˆ S .
T 1
The units associated with the variance are the square of the units of
xi. continued . . .
Minimum
The minimum series corresponds to the smallest value,
min(x1, x2, …, x ). The units associated with the minimum are the
T
mˆ
S3
The skewness statistic is a unitless measure.
continued . . .
Kurtosis
Kurtosis is a measure of the flatness or peakedness of the
distribution of a series relative to that of a normal distribution. A
normal random variable has a kurtosis of 3. A kurtosis statistic
greater than 3 indicates a more peaked distribution than the
normal distribution. A kurtosis statistic less than 3 indicates a
more flat distribution than the normal distribution. The kurtosis
coefficient is given by 1 T
T
(x x)
i
4
.
kˆ i 1
S4
continued . . .
Coefficient of variation
The coefficient of variation is the ratio of the standard
deviation to its mean. This measure typically is converted to a
percentage by multiplying the ratio by 100. This statistic
describes how much dispersion exists in a series relative to its
mean. This measure is given by:
S
CV 100% .
x
The utility of this information is that in most cases the mean and
the standard deviation change together. As well, this statistic is
not dependent on units of measurement.
Correlation Coefficient
The correlation coefficient is a measure of the degree of
linear association between
T
to variables. The statistic, denoted by r,
is given by:
( xi x )( yi y )
r i 1
T T
i
( x
i 1
x ) 2
i
( y
i 2
y ) 2
y1 .} û1 errors.
x1 x2 x3 x4 x
Intuitive Thinking about OLS
-OLS is fitting a line through the sample points such that the
sum of squared prediction errors is as small as possible,
hence the term least squares.
uˆ AV i FVi i = 1, 2, . . ., n.
x x y y
n i i
2 yi ˆ0 ˆ1 xi 0 ̂1 i 1
n
x x
i 1 2
i
n
2 xi yi ˆ0 ˆ1 xi 0
i 1
i 1 ˆ0 y ˆ1 x
yi = ß0 + ß1xi + ui
Assumption 1: Zero Mean of u
E(u) = 0: The average value of u, the error term, is 0.
Assumption 2: Independent Error Terms
Each observed ui is independent of all other uj,
Corr(uiuj) = 0 for all i j
Assumption 3: Homoscedasticity
Var(u|x) = σ², the variance of the regression is constant.
f(y|x) y
f(y|x)
. .
. .
.
x1 x2 x3 x x1 x2 x3 x
Assumption 4: Normality
-The error term u is normally distributed with mean zero and
variance σ².
-This assumption is essential for inference and forecasting.
-This assumption is not essential to estimate the parameters of
the regression model.
-We only need assumptions 1-3 to derive the OLS estimators
-OLS → ordinary least squares
Properties of OLS Estimators
Unbiasedness: OLS estimators represent the
true population parameters.
E ˆ1 1
E ˆ 0 0
Variance of OLS Estimators
-We know that the sampling distribution of our estimate
is centered around the true parameter (unbiasedness).
uˆi ui ˆ0 0 ˆ1 1 xi
Then, an unbiased estimator of 2 is
1
i SSE / n 2
2 2
ˆ
ˆ
u
n 2
* Note: SSE is the residual or error sum of squares
and (n-2) is the degrees-of-freedom.
Standard Error of OLS Estimates
ˆ ˆ 2 standard error of the regression
ˆ
n
se ( ˆ0 )
i x x 2
1
2
Gauss-Markov Theorem
Under the following assumptions, the OLS procedure produces unbiased estimates
of the regression model population parameters.
E ( ˆ0 ) 0 and E ( ˆ1 ) 1
Assumptions:
(1) The model is linear in parameters.
yi = ß0 + ß1Xi +ui
ln yi = c0 + c1lnxi + vi
(2) E(ui) = 0
(3) Corr(uiuj) = 0 i≠j
(4)E(ui²) = σ² for all i (Homoscedasticity)
(5) the sample outcomes on x (xi, i = 1, 2, …, n) are not all the
same values.
Also, in the class of linear unbiased estimators, the OLS Estimator is best (in the sense of
providing the minimum variance).
OLS Estimators are BLUE! (Best Linear Unbiased Estimators)
Goodness-of-Fit: Some Terminology
yi yˆ i uˆi
We then define the following :
i y y 2
is the total sum of squares (SST)
i ˆ
y y 2
is the regression sum of squares (SSR)
y ˆ
y i 2
ˆ
u 2
i is the residual or error sum of squares (SSE)
-Goodness-of-fit: how well does the simple regression line fit the
sample data?
-Calculate R2 = SSR/SST = 1 – SSE/SST
continued . . .
Goodness-of-Fit
-Concept: measures the proportion of the variation in the
dependent variable explained by the regression equation.
-Formula:
n
iˆ
y y 2
SSR SSE
R2 i 1
n
1
Total sample variability SST SST
y y
2
i
i 1
2 SSE /( n k 1)
-Adjusted R² R 1
SST /( n 1)
Questions:
(a)Why do we care about the adjusted R²?
(b) Is adjusted R² always better than R²?
(c)What’s the relationship between R² and adjusted R²?
Estimate of residual
variance
p-value
goodness-of-fit
statistics
p-value
Simple Linear Regression of RGDP
on Total U.S. Personal Bankruptcies
2 SSE
ˆ 11,651,319,045
92
se( ˆ0 ) 53,414 R 2 0.9367
se( ˆ1 ) 7.02259 R 2 0.9360
-Goodness-of-fit
(a) R²
(b) adjusted R²
Coming Attractions