OpenStax Chapter 12 Power Point
OpenStax Chapter 12 Power Point
OpenStax Chapter 12 Power Point
This presentation is based on material and graphs from Open Stax and is copyrighted by Open Stax and
Georgia Highlands College.
INTRODUCTION
Professionals often want to know how two or more numeric variables are
related.
For example, is there a relationship between the grade on the second
math exam a student takes and the grade on the final exam? If there is
a relationship, what is the relationship and how strong is it?
In another example, your income maybe determined by your education,
your profession, your years of experience, and your ability. The amount
you pay a repair person for labor is often determined by an initial
amount plus an hourly fee.
The type of data described in the examples is bivariate data — "bi" for
two variables. In reality, statisticians use multivariate data, meaning
many variables. In this chapter, you will be studying the simplest form
of regression, "linear regression" with one independent variable (x).
This involves data that fits a line in two dimensions. You will also study
correlation which measures how strong the relationship is.
12.1 | LINEAR EQUATIONS
LINEAR EQUATIONS
Linear regression for two variables is based on a linear equation
with one independent variable.
From algebra recall that the slope is a number that describes the
steepness of a line, and the y-intercept is the ycoordinate of the
point (0, a) where the line crosses the y-axis.
THREE GRAPHS F0R LINEAR EQUATIONS
The third exam score, x, is the independent variable and the final exam score, y,
is the dependent variable. We will plot a regression line that best "fits" the data.
If each of you were to fit a line "by eye," you would draw different lines. We can
use what is called a least-squares regression line to obtain the best fit line.
Y-HAT
If the observed data point lies above the line,the residual is positive, and the
line underestimates the actual data value for y. If the observed data point
lies below the line, the residual is negative,and the line overestimates that
actual data value for y. In the diagram, y0 – ŷ0 = ε0 is the residual for the
point shown. Here the point lies above the line and the residual is positive.
SUM OF SQUARED ERRORS
(SSE)
ε= the Greek letter epsilon
For each data point, you can calculate the residuals or errors,
yi -ŷi =εi for i= 1, 2, 3, ..., 11.
Each |ε| is a vertical distance.
If you square each ε and add them together, you get the Sum of
Squared Errors (SSE).
Using calculus, you can determine the values of a and b that make
the SSE a minimum. When you make the SSE a minimum, you have
determined the points that are on the line of best fit.
It turns out that the line of best fit has the equation: y-hat= a +
bx
LINEAR REGRESSION
The process of fitting the best-fit line is called linear
regression.
The idea behind finding the best-fit line is based on
the assumption that the data are scattered about a
straight line.
The criteria for the best fit line is that the sum of the
squared errors (SSE) is minimized, that is, made as
small as possible.
Any other line you might choose would have a
higher SSE than the best fit line.
TO CREATE A SCATTERPLOT
AND THE LINE OF BEST FIT IN
THE CALCULATOR
DATA
ENTRY
Enter Variable 1 (x—
independent--predictor) into
list (L1)
Enter Variable 2 (y – dependent
—response) into list (L2)
Be sure to keep the data paired as
presented
Go to
ALGEBRAIC INTERPRETATION
OF CORRELATION
Press Scroll down and choose option
Arrow right to highlight “CALC” 4: LinReg (ax+b)
ALGEBRAIC INTERPRETATION
OF CORRELATION
Indicate location of x-variable
and y-variable
Enter L1 (default) –press
Enter L2 (default) –press
Arrow down three times to
highlight “calculate”
ALGEBRAIC INTERPRETATION
OF CORRELATION
Press ENTER to see results
(a)A scatter plot showing data with a positive correlation. 0 < r < 1
(b)A scatter plot showing data with a negative correlation. –1 < r < 0
(c)A scatter plot showing data with zero correlation. r = 0
THE COEFFICIENT OF
DETERMINATION “R²”
The variable r² is called the coefficient of determination and is the
square of the correlation coefficient, but is usually stated as a
percent, rather than in decimal form. It has an interpretation in
the context of the data:
• r², when expressed as a percent, represents the percent of
variation in the dependent (predicted) variable y that can be
explained by variation in the independent (explanatory) variable x
using the regression (best-fit) line.
•1– r²,when expressed as a percentage, represents the percent of
variation in y that is NOT explained by variation in x using the
regression line. This can be seen as the scattering of the observed
data points about the regression line.
THE COEFFICIENT OF
DETERMINATION “R²”
Consider the third exam/final exam example introduced in the previous
section
• The line of best fit is: ŷ= –173.51 + 4.83x
• The correlation coefficient is r= 0.6631
• The coefficient of determination is r² = (0.66312)² = 0.4397
• Interpretation of r² in the context of this example:
• Approximately 44% of the variation (0.4397 is approximately 0.44) in the
final-exam grades can be explained by the variation in the grades on the
third exam, using the best-fit regression line.
• Therefore ,approximately 56% of the variation (1–0.44=0.56) in the final
exam grades cannot be explained by the variation in the grades on the
third exam, using the best-fit regression line. (This is seen as the scattering
of the points about the line.)
12.4 TESTING THE
SIGNIFICANCE OF THE
CORRELATION COEFFICIENT
The correlation coefficient, r, tells us about the strength and
direction of the linear relationship between x and y.
However, the reliability of the linear model also depends on
how many observed data points are in the sample.
We need to look at both the value of the correlation
coefficient r and the sample size n, together.
We perform a hypothesis test of the "significance of the
correlation coefficient" to decide whether the linear
relationship in the sample data is strong enough to use to
model the relationship in the population.
12.4 TESTING THE
SIGNIFICANCE OF THE
CORRELATION COEFFICIENT
The sample data are used to compute r, the correlation coefficient for
the sample. If we had data for the entire population, we could find the
population correlation coefficient. But because we have only have
sample data, we cannot calculate the population correlation coefficient.
The sample correlation coefficient, r, is our estimate of the unknown
population correlation coefficient.
The symbol for the population correlation coefficient is ρ, the Greek
letter "rho."
ρ = population correlation coefficient (unknown)
r = sample correlation coefficient (known; calculated from sample data)
12.4 POPULATION
CORRELATION COEFFICIENT
The hypothesis test lets us decide whether the value of the population
correlation coefficient ρ is "close to zero" or "significantly different from
zero". We decide this based on the sample correlation coefficient r and
the sample size n.
If the test concludes that the correlation coefficient is significantly
different from zero, we say that the correlation coefficient is
"significant."
• Conclusion: There is sufficient evidence to conclude that there is a
significant linear relationship between x and y because the correlation
coefficient is significantly different from zero.
• What the conclusion means: There is a significant linear relationship
between x and y. We can use the regression line to model the linear
relationship between x and yin the population.
12.4 POPULATION
CORRELATION COEFFICIENT
If the test concludes that the correlation coefficient is not
significantly different from zero (it is close to zero), we say
that correlation coefficient is "not significant".
• Conclusion: "There is insufficient evidence to conclude that
there is a significant linear relationship between x and y
because the correlation coefficient is not significantly
different from zero."
• What the conclusion means: There is not a significant linear
relationship between x and y. Therefore, we CANNOT use the
regression line to model a linear relationship between x and
yin the population.
REMEMBER
•If r is significant and the scatterplot shows a linear
trend, the line can be used to predict the value of y
for values of x that are within the domain of
observed x values.
•If r is not significant OR if the scatterplot does not
show a linear trend, the line should not be used for
prediction.
• If r is significant and if the scatter plot shows a
linear trend, the line may NOT be appropriate or
reliable for prediction OUTSIDE the domain of
observed x values in the data.
PERFORMING THE
HYPOTHESIS TEST
• Null Hypothesis: H0: ρ= 0
• Alternate Hypothesis: Ha: ρ≠ 0
WHAT THE HYPOTHESES MEAN IN WORDS:
• Null Hypothesis H0: The population correlation coefficient IS
NOT significantly different from zero. There IS NOT a significant
linear relationship(correlation) between x and y in the
population.
•Alternate Hypothesis Ha: The population correlation
coefficient IS significantly DIFFERENT FROM zero. There IS A
SIGNIFICANT LINEAR RELATIONSHIP (correlation) between x and
y in the population.
DRAWING A CONCLUSION:
METHOD 1 (USING THE P-
VALUE
If the p-value is less than the significance level (α = 0.05):
• Decision: Reject the null hypothesis.
• Conclusion: "There is sufficient evidence to conclude that there is
a significant linear relationship between x and y because the
correlation coefficient is significantly different from zero."
The y values for each x value are normally distributed about the line with the same standard deviation.
For each x value, the mean of the y values lies on the regression line. More y values lie near the line
than are scattered further away from the line.
12.5 PREDICTION
PREDICTION
Recall the third exam/final exam example. We examined the
scatterplot and showed that the correlation coefficient is
significant. We found the equation of the best-fit line for the final
exam grade as a function of the grade on the third-exam. We can
now use the least-squares regression line for prediction.
Suppose you want to estimate, or predict, the mean final exam
score of statistics students who received 73 on the third exam. The
exam scores(x-values) range from 65 to 75. Since73 is between the
x-values 65 and 75, substitute x=73 into the equation.
Then: = −173.51+4.83(73)=179.08
We predict that statistics students who earn a grade of 73 on the
third exam will earn a grade of 179.08 on the final exam, on
average.
PREDICTION
Recall the third exam/final exam example.
y-hat = −173.51+4.83(x)
a. What would you predict the final exam score to be for a student
who scored a 66 on the third exam?
b. What would you predict the final exam score to be for a student
who scored a 90 on the third exam?
PREDICTION
a. What would you predict the final exam score to be for a student who scored a
66 on the third exam?
145.27
b. What would you predict the final exam score to be for a student who scored a
90 on the third exam?
The x values in the data are between 65 and 75. Ninety is outside of the domain
of the observed x values in the data (independent variable), so you cannot
reliably predict the final exam score for this student. (Even though it is possible
to enter 90 into the equation for x and calculate a corresponding y value, the y
value that you get will not be reliable.) To understand really how unreliable the
prediction can be outside of the observed x values observed in the data, make
the substitution x= 90 into the equation.
= −173.51+4.83(90) = 261.19
The final-exam score is predicted to be 261.19. The largest the final-exam score
can be is 200.
INTERPOLATION AND
EXTRAPOLATION
The process of predicting inside of the observed x
values observed in the data is called interpolation.
The new line with r=0.9121 is a stronger correlation than the original
(r=0.6631) because r=0.9121is closer to one. This means that the new
line is a better fit to the ten remaining data values. The line can better
predict the final exam score given the third exam score.