Chapter 10

Chapter 10
Correlation and Regression
McGraw-Hill, Bluman, 7th ed., Chapter 10 1

Chapter 10 Overview
Introduction
 10-1 Scatter Plots and Correlation
 10-2 Regression
 10-3 Coefficient of Determination and
Standard Error of the Estimate
 10-4 Multiple Regression (Optional)
Bluman, Chapter 10 2
Chapter 10 Objectives
1. Draw a scatter plot for a set of ordered pairs.
2. Compute the correlation coefficient.
3. Test the hypothesis H0: ρ = 0.
4. Compute the equation of the regression line.
5. Compute the coefficient of determination.
6. Compute the standard error of the estimate.
7. Find a prediction interval.
8. Be familiar with the concept of multiple
regression.
Introduction
 In addition to hypothesis testing and
confidence intervals, inferential statistics
involves determining whether a
relationship between two or more
numerical or quantitative variables exists.
Introduction
 Correlation is a statistical method used
to determine whether a linear relationship
between variables exists.
 Regression is a statistical method used

to describe the nature of the relationship
between variables—that is, positive or
negative, linear or nonlinear.
Introduction
 The purpose of this chapter is to answer
these questions statistically:
1. Are two or more variables related?
2. If so, what is the strength of the
relationship?
3. What type of relationship exists?
4. What kind of predictions can be
made from the relationship?
Introduction
1. Are two or more variables related?
2. If so, what is the strength of the
relationship?
To answer these two questions, statisticians use

the correlation coefficient,
coefficient a numerical
measure to determine whether two or more
variables are related and to determine the
strength of the relationship between or among
the variables.
Introduction
3. What type of relationship exists?
There are two types of relationships: simple and

multiple.
In a simple relationship, there are two variables:
an independent variable (predictor variable)
and a dependent variable (response variable).
In a multiple relationship, there are two or more

independent variables that are used to predict
one dependent variable.
Introduction
4. What kind of predictions can be made from
the relationship?
Predictions are made in all areas and daily.

Examples include weather forecasting, stock
market analyses, sales predictions, crop
predictions, gasoline price predictions, and sports
predictions. Some predictions are more accurate
than others, due to the strength of the relationship.
That is, the stronger the relationship is between
variables, the more accurate the prediction is.
10.1 Scatter Plots and Correlation
 A scatter plot is a graph of the ordered
pairs (x, y) of numbers consisting of the
independent variable x and the
dependent variable y.
Chapter 10
Section 10-1
Example 10-1
Page #536
Example 10-1: Car Rental Companies
Construct a scatter plot for the data shown for car rental
companies in the United States for a recent year.
Step 1: Draw and label the x and y axes.

Step 2: Plot each point on the graph.
Positive Relationship
Chapter 10
Section 10-1
Example 10-2
Page #537
Example 10-2: Absences/Final Grades
Construct a scatter plot for the data obtained in a study on
the number of absences and the final grades of seven
randomly selected students from a statistics class.

Negative Relationship
Chapter 10
Section 10-1
Example 10-3
Page #538
Example 10-3: Exercise/Milk Intake
Construct a scatter plot for the data obtained in a study on
the number of hours that nine people exercise each week
and the amount of milk (in ounces) each person
consumes per week.

Very Weak Relationship
Correlation
 The correlation coefficient computed from the
sample data measures the strength and
direction of a linear relationship between two
variables.
 There are several types of correlation
coefficients. The one explained in this section is
called the Pearson product moment
correlation coefficient (PPMC).
(PPMC)
 The symbol for the sample correlation
coefficient is r. The symbol for the population
correlation coefficient is .
Correlation
 The range of the correlation coefficient is from
1 to 1.
 If there is a strong positive linear
relationship between the variables, the value
of r will be close to 1.
 If there is a strong negative linear
relationship between the variables, the value
of r will be close to 1.
Correlation
Correlation Coefficient
The formula for the correlation coefficient is
n   xy     x   y 
r
 n  x 2    x 2   n  y 2    y 2 
       
where n is the number of data pairs.
Rounding Rule: Round to three decimal places.
Chapter 10
Section 10-1
Example 10-4
Page #540
Compute the correlation coefficient for the data in
Example 10–1.
Cars x Income y
Company (in 10,000s) (in billions) xy x2 y2
A 63.0 7.0 441.00 3969.00 49.00
B 29.0 3.9 113.10 841.00 15.21
C 20.8 2.1 43.68 432.64 4.41
D 19.1 2.8 53.48 364.81 7.84
E 13.4 1.4 18.76 179.56 1.96
F 8.5 1.5 2.75 72.25 2.25
Σx= Σy= Σ xy = Σ x2 = Σ y2 =
153.8 18.7 682.77 5859.26 80.67
Example 10–1.
Σx = 153.8, Σy = 18.7, Σxy = 682.77, Σx2 = 5859.26,
Σy2 = 80.67, n = 6
n   xy     x   y 
r
 n  x 2    x 2   n  y 2    y 2 
       
r
 6 682.77   153.818.7 
 6  5859.26   153.8 2   6 80.67   18.7 2 
  
r  0.982 (strong positive relationship)
Chapter 10
Section 10-1
Example 10-5
Page #541
Example 10–2.
Number of Final Grade
Student absences, x y (pct.) xy x2 y2
A 6 82 492 36 6,724
B 2 86 172 4 7,396
C 15 43 645 225 1,849
D 9 74 666 81 5,476
E 12 58 696 144 3,364
F 5 90 450 25 8,100
G 8 78 624 64 6,084
Σx= Σy= Σ xy = Σ x2 = Σ y2 =
57 511 3745 579 38,993
Example 10–2.
Σx = 57, Σy = 511, Σxy = 3745, Σx2 = 579,
Σy2 = 38,993, n = 7
n   xy     x   y 
r
 n  x 2    x 2   n  y 2    y 2 
       
r
 7 3745   57 511
 7 579   57 2   7 38, 993  5112 
  
r  0.944 (strong negative relationship)
Chapter 10
Section 10-1
Example 10-6
Page #542
Example 10–3.
Subject Hours, x Amount y xy x2 y2

A 3 48 144 9 2,304
B 0 8 0 0 64
C 2 32 64 4 1,024
D 5 64 320 25 4,096
E 8 10 80 64 100
F 5 32 160 25 1,024
G 10 56 560 100 3,136
H 2 72 144 4 5,184
I 1 48 48 1 2,304
Σx= Σy= Σ xy = Σ x2 = Σ y2 =
36 370 1,520 232 19,236
Example 10–3.
Σx = 36, Σy = 370, Σxy = 1520, Σx2 = 232,
Σy2 = 19,236, n = 9
n   xy     x   y 
r
 n  x 2    x 2   n  y 2    y 2 
       
r
 7 1520   36 370 
 7  232   36 2   7 19, 236   370 2 
  
r  0.067 (very weak relationship)
Hypothesis Testing
 In hypothesis testing, one of the following is
true:
H0:   0 This null hypothesis means that
there is no correlation between
the x and y variables in the
population.
H1:   0 This alternative hypothesis
means that there is a significant
correlation between the variables in
the population.
t Test for the Correlation Coefficient
n2
tr
1 r2
with degrees of freedom equal to n  2.
Chapter 10
Section 10-1
Example 10-7
Page #544
Test the significance of the correlation coefficient found in
Example 10–4. Use α = 0.05 and r = 0.982.
Step 1: State the hypotheses.

H0: ρ = 0 and H1: ρ  0
Step 2: Find the critical value.

Since α = 0.05 and there are 6 – 2 = 4 degrees of
freedom, the critical values obtained from Table F are
±2.776.
Step 3: Compute the test value.
n2 62
tr  0.982  10.4
1   0.982 
2 2
1 r
Step 4: Make the decision.

Reject the null hypothesis.
Step 5: Summarize the results.

There is a significant relationship between the number
of cars a rental agency owns and its annual income.
Chapter 10
Section 10-1
Example 10-8
Page #545
Using Table I, test the significance of the correlation
coefficient r = 0.067, from Example 10–6, at α = 0.01.
Step 1: State the hypotheses.

H0: ρ = 0 and H1: ρ  0
There are 9 – 2 = 7 degrees of freedom. The value in

Table I when α = 0.01 is 0.798.
For a significant relationship, r must be greater than 0.798
or less than -0.798. Since r = 0.067, do not reject the null.
Hence, there is not enough evidence to say that there is a
significant linear relationship between the variables.
Possible Relationships Between
Variables
When the null hypothesis has been rejected for a specific
a value, any of the following five possibilities can exist.
1.There is a direct cause-and-effect relationship between
the variables. That is, x causes y.
2.There is a reverse cause-and-effect relationship
between the variables. That is, y causes x.
3.The relationship between the variables may be caused
by a third variable.
4.There may be a complexity of interrelationships among
many variables.
5.The relationship may be coincidental.
Variables
1. There is a reverse cause-and-effect relationship
For example,
 water causes plants to grow
 poison causes death
 heat causes ice to melt
Variables
2. There is a reverse cause-and-effect relationship
For example,
 Suppose a researcher believes excessive coffee
consumption causes nervousness, but the
researcher fails to consider that the reverse
situation may occur. That is, it may be that an
extremely nervous person craves coffee to calm
his or her nerves.
Variables
3. The relationship between the variables may be
caused by a third variable.
For example,
 If a statistician correlated the number of deaths
due to drowning and the number of cans of soft
drink consumed daily during the summer, he or
she would probably find a significant relationship.
However, the soft drink is not necessarily
responsible for the deaths, since both variables
may be related to heat and humidity.
Variables
4. There may be a complexity of interrelationships
among many variables.
For example,
 A researcher may find a significant relationship
between students’ high school grades and college
grades. But there probably are many other
variables involved, such as IQ, hours of study,
influence of parents, motivation, age, and
instructors.
Variables
5. The relationship may be coincidental.
For example,
 A researcher may be able to find a significant
relationship between the increase in the number of
people who are exercising and the increase in the
number of people who are committing crimes. But
common sense dictates that any relationship
between these two values must be due to
coincidence.
10.2 Regression
 If the value of the correlation coefficient is
significant, the next step is to determine
the equation of the regression line which
is the data’s line of best fit.
Regression
 Best fit means that the sum of the
squares of the vertical distance from
each point to the line is at a minimum.
Regression Line y   a  bx
a
        x  xy 
y x 2
n  x    x 
2 2
n   xy     x   y 
b
n  x    x 
2 2
where
a = y  intercept
b = the slope of the line.
Chapter 10
Section 10-2
Example 10-9
Page #553
Find the equation of the regression line for the data in
Example 10–4, and graph the line on the scatter plot.
Σx = 153.8, Σy = 18.7, Σxy = 682.77, Σx2 = 5859.26,
Σy2 = 80.67, n = 6
  y    x     x   xy 
2
a
n  x    x 2 2

18.7 5859.26   153.8  682.77 
 0.396
6 5859.26   153.8 
2
n   xy     x   y  6  682.77   153.8 18.7 

b   0.106
n  x    x 
2
6 5859.26   153.8 
2 2
y  a  bx  y  0.396  0.106 x
Find two points to sketch the graph of the regression line.
Use any x values between 10 and 60. For example, let x

equal 15 and 40. Substitute in the equation and find the
corresponding y value.
y   0.396  0.106 x y  0.396  0.106 x
 0.396  0.106 15   0.396  0.106  40 
 1.986  4.636
Plot (15,1.986) and (40,4.636), and sketch the resulting

line.
Find the equation of the regression line for the data in
Example 10–4, and graph the line on the scatter plot.
y  0.396  0.106 x
 40, 4.636 
15, 1.986 
Chapter 10
Section 10-2
Example 10-11
Page #555
Use the equation of the regression line to predict the
income of a car rental agency that has 200,000
automobiles.
x = 20 corresponds to 200,000 automobiles.

y  0.396  0.106 x
 0.396  0.106  20 
 2.516
Hence, when a rental agency has 200,000 automobiles, its
revenue will be approximately $2.516 billion.
Regression
 The magnitude of the change in one variable
when the other variable changes exactly 1 unit
is called a marginal change.
change The value of
slope b of the regression line equation
represents the marginal change.
 For valid predictions, the value of the
correlation coefficient must be significant.
 When r is not significantly different from 0, the
best predictor of y is the mean of the data
values of y.
Assumptions for Valid Predictions
1. For any specific value of the independent
variable x, the value of the dependent variable
y must be normally distributed about the
regression line. See Figure 10–16(a).
2. The standard deviation of each of the
dependent variables must be the same for
each value of the independent variable. See
Figure 10–16(b).
Extrapolations (Future Predictions)
 Extrapolation,
Extrapolation or making predictions beyond
the bounds of the data, must be interpreted
cautiously.
 Remember that when predictions are made,
they are based on present conditions or on the
premise that present trends will continue. This
assumption may or may not prove true in the
future.
Procedure Table
Step 1: Make a table with subject, x, y, xy, x2, and
y2 columns.
Step 2: Find the values of xy, x2, and y2. Place
them in the appropriate columns and sum
each column.
Step 3: Substitute in the formula to find the value
of r.
Step 4: When r is significant, substitute in the
formulas to find the values of a and b for
the regression line equation y  = a + bx.
10.3 Coefficient of Determination
and Standard Error of the Estimate
 The total variation   y  y  is the
2
sum of the squares of the vertical

distances each point is from the mean.
 The total variation can be divided into two
parts: that which is attributed to the
relationship of x and y, and that which is
due to chance.
Variation
 The variation obtained from the relationship
(i.e., from the predicted y' values) is
and is   y the
y explained variation.
2
called variation
 Variation due to chance, found by
, is called the unexplained
variation.
 
variation
y   y 
2
This variation cannot be
attributed to the relationships.
Variation
Coefficient of Determiation
 The coefficient of determination is the
ratio of the explained variation to the total
variation.
 The symbol for the coefficient of
determination is r 2.
 2 explained variation
r 
total variation
 Another way to arrive at the value for r 2
is to square the correlation coefficient.
Coefficient of Nondetermiation
 The coefficient of nondetermination is
a measure of the unexplained variation.
 The formula for the coefficient of
determination is 1.00 – r 2.
Standard Error of the Estimate
 The standard error of estimate,
estimate
denoted by sest is the standard deviation
of the observed y values about the
predicted y' values. The formula for the
standard error of estimate is:
  y  y 
2
sest 
n2
Chapter 10
Section 10-3
Example 10-12
Page #569
Example 10-12: Copy Machine Costs
A researcher collects the following data and determines
that there is a significant relationship between the age of a
copy machine and its monthly maintenance cost. The
regression equation is y  = 55.57 + 8.13x. Find the
standard error of the estimate.
Age x Monthly
Machine (years) cost, y y y–y  (y – y )2
A 1 62 63.70 -1.70 2.89
B 2 78 71.83 6.17 38.0689
C 3 70 79.96 -9.96 99.2016
D 4 90 88.09 1.91 3.6481
E 4 93 88.09 4.91 24.1081
F 6 103 104.35 -1.35 1.8225
169.7392
y  55.57  8.13 x
y  55.57  8.13 1  63.70   y  y 
2
sest 
y  55.57  8.13  2   71.83 n2
y  55.57  8.13 3  79.96 169.7392
sest   6.51
y  55.57  8.13  4   88.09 4
y  55.57  8.13  6   104.35
Chapter 10
Section 10-3
Example 10-13
Page #570
sest 
  a y  b xy
y 2
n2
Age x Monthly
Machine (years) cost, y xy y2
A 1 62 62 3,844
B 2 78 156 6,084
C 3 70 210 4,900
D 4 90 360 8,100
E 4 93 372 8,649
F 6 103 618 10,609
496 1778 42,186
sest 
  a y  b xy
y 2
n2
42,186  55.57  496   8.13 1778 
sest   6.48
4
Formula for the Prediction Interval
about a Value y 
nx  X 
2
1
y  t 2 sest 1  y
n n x 2    x 2
nx  X 
2
1
 y  t 2 sest 1 
n n x 2    x 2
with d.f. = n - 2
Chapter 10
Section 10-3
Example 10-14
Page #571
For the data in Example 10–12, find the 95% prediction
interval for the monthly maintenance cost of a machine
that is 3 years old.
Step 1: Find   , and X .

2
x , x
20
 x  20  x 2
 82 X 
6
 3.3
Step 2: Find y for x = 3.

y  55.57  8.13 3  79.96
Step 3: Find sest.
sest  6.48 (as shown in Example 10-13)
Step 4: Substitute in the formula and solve.
nx  X 
2
1
y   t 2 sest 1  y
n n x    x 
2 2
nx  X 
2
1
 y  t 2 sest 1 
n n x 2    x 2
6 3  3.3
2
1
79.96   2.776  6.48  1   y
6 6 82    20 2
6  3  3.3
2
1
 79.96   2.776  6.48  1  
6 6 82    20 2
Step 4: Substitute in the formula and solve.
6 3  3.3
2
1
79.96   2.776  6.48  1  y
6 6 82    20 2
6  3  3.3
2
1
 79.96   2.776  6.48  1 
6 6 82    20 2
79.96  19.43  y  79.96  19.43

60.53  y  99.39
Hence, you can be 95% confident that the interval

60.53 < y < 99.39 contains the actual value of y.
10.4 Multiple Regression (Optional)
In multiple regression, there are several
independent variables and one dependent
variable, and the equation is
y  a  b1 x1  b2 x2    bk xk
where
x1 , x2 , , xk = independent variables.
Assumptions for Multiple Regression
1. normality assumption—for any specific value of the
independent variable, the values of the y variable are
normally distributed.
2. equal-variance assumption—the variances (or
standard deviations) for the y variables are the same
for each value of the independent variable.
3. linearity assumption—there is a linear relationship
between the dependent variable and the independent
variables.
4. nonmulticollinearity assumption—the independent
variables are not correlated.
5. independence assumption—the values for the y
variables are independent.
Multiple Correlation Coefficient
 In multiple regression, as in simple
regression, the strength of the
relationship between the independent
variables and the dependent variable is
measured by a correlation coefficient.
 This multiple correlation coefficient is
symbolized by R.
Multiple Correlation Coefficient
The formula for R is
2 2
ryx1 r yx2  2ryx1  ryx2  rx1x2
R 2
1 r x1 x2
where
ryx1 = correlation coefficient for y and x1
ryx2 = correlation coefficient for y and x2
rx1x2 = correlation coefficient for x1 and x2
Chapter 10
Section 10-4
Example 10-15
Page #576
Example 10-15: State Board Scores
A nursing instructor wishes to see whether a student’s
grade point average and age are related to the student’s
score on the state board nursing examination. She
selects five students and obtains the following data.
Find the value of R.
A nursing instructor wishes to see whether a student’s
grade point average and age are related to the student’s
score on the state board nursing examination. She
selects five students and obtains the following data.
Find the value of R.
The values of the correlation coefficients are

rx1x2  0.371
ryx2  0.791
ryx1  0.845
ryx2 1  ryx2 2  2ryx1  ryx2  rx1x2
R
1  rx21x2
 0.845    0.791  2  0.845 0.7910.371

2 2
R
1   0.371
2
R  0.989
Hence, the correlation between a student’s grade point
average and age with the student’s score on the nursing
state board examination is 0.989. In this case, there is a
strong relationship among the variables; the value of R is
close to 1.00.
F Test for Significance of R
The formula for the F test is
2
R k
F
1  R   n  k  1
2
where
n = the number of data groups
k = the number of independent variables.
d.f.N. = n – k
d.f.D. = n – k – 1
Chapter 10
Section 10-4
Example 10-16
Page #577
Test the significance of the R obtained in Example 10–15 at α = 0.05.
R2 k
F
 that there isa significant
1  R
The critical value obtained from Table H with a 0.05,
reject the null hypothesis and conclude
score on the nursing state board examination.
 n relationship
k  1among the student’s GPA, age, and
2 d.f.N. = 3, and d.f.D. = 2 is 19.16. Hence, the decision is to
0.978 2
F  44.45
1  0.978  5  2  1
Adjusted R2
The formula for the adjusted R2 is
R 2
 1
1  R   n  1 2
adj
n  k 1
Chapter 10
Section 10-4
Example 10-17
Page #578
Calculate the adjusted R2 for the data in Example 10–16. The value for R is 0.989.
In this case, when the number of data pairs and the number of independent variables are accounted for, the adjusted multiple coefficient of determination is 0.956.
Radj
2
 1
1  R 2
  n  1
n  k 1
Radj
2
 1
1  0.989 2
 5  1  0.956
5  2 1

Chapter 10

Uploaded by

Copyright:

Available Formats

Chapter 10

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 10

Uploaded by

Copyright:

Available Formats

Chapter 10

Correlation and Regression

McGraw-Hill, Bluman, 7th ed., Chapter 10 1

 Regression is a statistical method used

To answer these two questions, statisticians use

There are two types of relationships: simple and

In a multiple relationship, there are two or more

Predictions are made in all areas and daily.

Step 1: Draw and label the x and y axes.

Step 1: Draw and label the x and y axes.

Step 1: Draw and label the x and y axes.

Very Weak Relationship

Rounding Rule: Round to three decimal places.

Subject Hours, x Amount y xy x2 y2

with degrees of freedom equal to n  2.

Step 1: State the hypotheses.

Step 2: Find the critical value.

Step 4: Make the decision.

Step 5: Summarize the results.

Step 1: State the hypotheses.

There are 9 – 2 = 7 degrees of freedom. The value in

n   xy     x   y  6  682.77   153.8 18.7 

Use any x values between 10 and 60. For example, let x

Plot (15,1.986) and (40,4.636), and sketch the resulting

x = 20 corresponds to 200,000 automobiles.

sum of the squares of the vertical

Step 1: Find   , and X .

Step 2: Find y for x = 3.

79.96  19.43  y  79.96  19.43

Hence, you can be 95% confident that the interval

The values of the correlation coefficients are

 0.845    0.791  2  0.845 0.7910.371

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.