Bivariate Regression - Part I: Indep Var / Dep Var Continuous Discrete

Bivariate Regression - Part I
I. Background. We have previously studied relationships between (a) Continuous

dependent variable and a categorical independent variable (T-Test, ANOVA); and (b)
Categorical Dependent variable and a categorical independent variable (Categorical data
analysis, or Nonparametric tests).
We will now examine relationships between continuous dependent variables and
continuous independent variables.
The following typology may be helpful:
Indep var \ Dep var Continuous Discrete
Continuous OLS Regression Logistic Regression
Discrete T-Test, ANOVA Categorical Data Analysis
II. Regression.
A. OLS. With OLS (Ordinary Least Squares) Regression, we are interested in how
changes in one set of variables are related to changes in another set. That is, we want to describe
or estimate the value of one variable, called the dependent variable, on the basis of one or more
other variables, called independent variables.
Examples:
T What is the relationship between education and income? For each year of education,
how much does income increase (on average)?
T What will be the rate of return on investment? For each dollar invested, how much
will sales increase?
T For a political candidate, how many votes will she get for each dollar she spends on
advertising?
It is usually not the case that the independent variables will perfectly predict the values of
the dependent variable. For the most part, we are interested in determining the average
relationship between the dependent and independent variable. That is, we want to know
E(Y | X), i.e. for a particular value of X, what value, on average, do people have on Y?
For most regression problems, the average relationship between the dependent variable
(y) and the independent variable (x) is assumed to be linear. That is, the Population regression
line is
E(Y | X) = α + βX
Bivariate Regression - Part I - Page 1

E(Y | X) = the average value of Y for a given value of X
ß = slope coefficient. This tells you how much a 1-unit increase in X affects the value of
Y.
α = intercept. This is the point where the regression line crosses the Y axis, i.e. when X
= 0, E (Y | X) = α. (NOTE: Do not confuse this α with the α we use when specifying
significance levels!)
Graphically, we can show this as
Example. Suppose E(Y | X) = $5,000 + $1,500 X where Y = income and X = years of

education. This means that, on average, a person with no education makes $5,000 a year.
People with 12 years of education average $23,000, while those with 16 years of education
average $29,000.
Of course, not all people with 16 years of education will make $29,000. Some will make
more, some will make less. The score a particular individual has on Y can be written as:
y i = α + β xi + ε i ; or,
y i = E(y | xi ) + ε i
Here, εi is a random error term, or disturbance.

A scatter diagram reflects this:

Note that, for any particular xi, some values of y lie above the regression line, some
below it.
B. Sample estimation. Of course, we don't know the values of the population

parameters. They must be estimated from sample data. The Sample regression line is:
Yˆ = a + bX; or
Yˆ = αˆ + βˆX
and the Sample regression model is:
y i = a + bxi + ei = yˆ i + ei ; or
y i = αˆ + βˆ xi + εî = yˆ i + εî
Question: How do we determine values for a and b?

Answer: We could just plot the values, and draw a line by hand. But, two people might draw
two different lines - and we would not have any means for determining sampling errors.
A better approach is to try to find values for a and b so as to minimize the values of ei, where ei =
yi - yî. The approach used to do this is called Ordinary Least Squares. With OLS, we choose a
and b so as to minimize
∑ ei2 = ∑( y i - yˆ i )2
Through calculus, we can show that the best values are
1
∑( x i - x )( y i - y )
s
b= N - 1 = xy2 ,
1 sx
∑( x i - x )
2
N -1
a = y - bx
(NOTE: Hayes offers an informal proof on pp. 548-549).
Question: suppose x = x . What does ŷ equal?
ŷ = a + b x = y - b x + b x = y
Hence, the regression line includes ( x , y ) .

Question: Suppose b = 0. What does y^ equal?
ŷ = a + bx = y - b x + bx = y
Hence, if the slope is zero, the best estimate of y^ is y - put another way, knowing x is of no
value to you when predicting y.
C. Hypothesis testing. We are interested in whether the population parameter ß

differs from zero. If ß = 0, then knowing X is of no use to us (since no matter what the value of
X is, E(Y | X) = µy.) Therefore, we want to test
H0: ß=0
HA: ß <> 0.
To do this, we have to make certain assumptions:
0. relationship between x and y is linear. This may or may not be a

reasonable assumption.
1. cov(ε, x) = 0. The error terms are independent of the values of X. That

is, the size of X has no relation to the size of the error term. An example of where this might not
be true: Errors in predicted income may get bigger as education increases. At low education
levels, most errors may be within a few thousand dollars; at higher education levels the errors
may tend to be in the tens of thousands.
2. ε ~ Normal. Most of the actual values of Y will be close to the regression

line.
3. E(ε) = 0. The average error will be zero; positive errors will be offset by
negative errors.
4. COV(εk, εj) = 0 for j <> k. Knowing one error term tells you nothing
about the value of another error term. A violation of this assumption might occur if samples are
not independent of each other (e.g. husbands and their wives are treated as separate cases in one
sample). Serial correlation is another common violation (Errors are correlated across time, as
when you collect data on industries at multiple points in time).
5. V(εi | xi) = σe2 for all x. This is referred to as the assumption of

homoskedasticity. Populations that do not have a constant variance are heteroskedastic.
D. Gauss-Markov Theorem: If assumptions 1, 3, 4, and 5 hold true, then

estimators a and b determined by the least squares method are BLUE (Best Linear Unbiased
Estimate) i.e. they are unbiased and have the smallest possible variance.

Bivariate Regression - Part I: Indep Var / Dep Var Continuous Discrete

Uploaded by

Copyright:

Available Formats

Bivariate Regression - Part I: Indep Var / Dep Var Continuous Discrete

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Bivariate Regression - Part I: Indep Var / Dep Var Continuous Discrete

Uploaded by

Copyright:

Available Formats

Bivariate Regression - Part I

I. Background. We have previously studied relationships between (a) Continuous

Indep var \ Dep var Continuous Discrete

Continuous OLS Regression Logistic Regression

Discrete T-Test, ANOVA Categorical Data Analysis

Bivariate Regression - Part I - Page 1

Example. Suppose E(Y | X) = $5,000 + $1,500 X where Y = income and X = years of

Here, εi is a random error term, or disturbance.

Bivariate Regression - Part I - Page 2

B. Sample estimation. Of course, we don't know the values of the population

and the Sample regression model is:

Question: How do we determine values for a and b?

Through calculus, we can show that the best values are

(NOTE: Hayes offers an informal proof on pp. 548-549).

Question: suppose x = x . What does ŷ equal?

Hence, the regression line includes ( x , y ) .

Bivariate Regression - Part I - Page 3

C. Hypothesis testing. We are interested in whether the population parameter ß

To do this, we have to make certain assumptions:

0. relationship between x and y is linear. This may or may not be a

1. cov(ε, x) = 0. The error terms are independent of the values of X. That

2. ε ~ Normal. Most of the actual values of Y will be close to the regression

5. V(εi | xi) = σe2 for all x. This is referred to as the assumption of

D. Gauss-Markov Theorem: If assumptions 1, 3, 4, and 5 hold true, then

Bivariate Regression - Part I - Page 4

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.