Chapter 5 Regression Analysis
Chapter 5 Regression Analysis
Chapter 5 Regression Analysis
REGRESSION ANALYSIS
Topic Outline:
1. Simple Linear Regression Analysis
2. Multiple Linear Regression Analysis
3. Multicollinearity
Learning Outcomes:
After careful study of this chapter, students should be able to do the following:
1. defined regression analysis;
2. discussed when to use regression analysis;
3. illustrated the SLR model
4. discussed the assumptions in SLR;
5. performed SLR analysis using SPSS;
6. estimated and interpreted and regression coefficients;
7. illustrated the MLR model
8. discussed the assumptions in MLR;
9. performed and discussed the results of the different MLR procedures using
SPSS; and
10. defined multicollinearity.
Prepared by:
Prof. Jeanne Valerie Agbayani-Agpaoa
STAT 201: Statistical Methods I
Dr. Virgilio Julius P. Manzano, Jr.
Engr. Lawrence John C. Tagata
CHAPTER 5: REGRESSION ANALYSIS
INTRODUCTION
Regression analysis is a way to find trends in data. For example, you might guess that there’s
a connection between how much you eat and how much you weigh; regression analysis can help you
quantify that.
Regression analysis will provide you with an equation for a graph so that you can make
predictions about your data. For example, if you’ve been putting on weight over the last few years, it
can predict how much you’ll weigh in ten years time if you continue to put on weight at the same rate.
It will also give you a slew of statistics (including a p-value and a correlation coefficient) to tell you
how accurate your model is. Most elementary stats courses cover very basic techniques, like
making scatter plots and performing linear regression. However, you may come across more advanced
techniques like multiple regression.
A reasonable form of a relationship between the response Y and the predictor/regressor (natural
independent variables) x is the linear relationship
𝒀 = 𝜷𝒐 + 𝜷𝟏 𝒙
The concept of regression analysis deals with finding the best relationship between Y and x,
quantifying the strength of that relationship, and using methods that allow for prediction of the response
values given values of the predictor/regressor, x. In other words, regression analysis applies to
situations in which the relationships among variables are not deterministic or not exact.
More often than not, the models that are simplifications of more complicated and unknown
structures are linear in nature (i.e., linear in the parameters β0 and β1 or, in the case of the model
involving the price, size, and age of the house, linear in the parameters β0, β1, and β2). These linear
structures are simple and empirical in nature and are thus called empirical models.
An analysis of the relationship between Y and x requires the statement of a statistical model.
A model is often used by a statistician as a representation of an ideal that essentially defines how we
perceive that the data were generated by the system in question. The model must include the set {(xi,
yi); i = 1, 2, . . . , n} of data involving n pairs of (x, y) values. One must bear in mind that the value yi
depends on xi via a linear structure that also has the random component involved.
The fitted line is an estimate of the true regression line. Expect that the fitted line should be
closer to the true regression line when a large amount of data is available.
Estimating the Given the sample {(xi, yi); i = 1, 2, . . . , n}, the least squares estimates b0 and b1
Regression of the regression coefficients β0 and β1 are computed from the formulas
Coefficients
𝒏 ∑𝒏𝒊=𝟏 𝒙𝒊 𝒚𝒊 − (∑𝒏𝒊=𝟏 𝒙𝒊 )(∑𝒏𝒊=𝟏 𝒚𝒊 ) ∑𝒏𝒊=𝟏(𝒙𝒊 − 𝒙 ̅)(𝒚𝒊 − 𝒚̅)
𝒃𝟏 𝒃𝟏 = =
𝒏 ∑𝒏 𝒙𝟐 − (∑𝒏 𝒙 )
𝟐 ∑𝒏𝒊=𝟏(𝒙𝒊 − ̅
𝒙)𝟐
𝒊=𝟏 𝒊 𝒊=𝟏 𝒊
and
∑𝒏𝒊=𝟏 𝒚𝒊 − 𝒃𝟏 ∑𝒏𝒊=𝟏 𝒙𝒊
𝒃𝟎 𝒃𝟎 = ̅ − 𝒃𝟏 𝒙
=𝒚 ̅
𝒏
Practical Application
Example 01:
In statistics, it’s hard to stare at a set of random numbers in a table and try to make any sense of it. For
example, global warming may be reducing average snowfall in your town and you are asked to predict
how much snow you think will fall this year. Looking at the following table you might guess somewhere
around 10-20 inches. That’s a good guess, but you could make a better guess, by using regression.
Year Amount 45
(inches)
2000 40 40
2001 39 35
2002 41
2003 29 30
2004 32
25
2005 30
2006 33 20
2007 15
2008 10 15
2009 11 10
2020 20
y = -1.4622x + 2959.9
2011 24 5
2012 10
0
2013 15
1995 2000 2005 2010 2015 2020 2025
Essentially, regression is the “best guess” at using a set of data to make some kind of prediction.
It’s fitting a set of points to a graph. There’s a whole host of tools that can run regression for you,
including Excel, which I used here to help make sense of that snowfall data:
STAT 201: Statistical Methods I
Just by looking at the regression line running down through the data, you can fine tune your
best guess a bit. You can see that the original guess (20 inches or so) was way off. For 2015, it looks
like the line will be somewhere between 5 and 10 inches! That might be “good enough”, but regression
also gives you a useful equation, which for this chart is: 𝑦 = −1.4622𝑥 + 2959.9.
What that means is you can plug in an x value (the year) and get a pretty good estimate of
snowfall for any year. For example, 2005: 𝑦 = −1.4622(2005) + 2959.9 = 28.189 inches, which
is pretty close to the actual figure of 30 inches for that year.
97
CHAPTER 5: REGRESSION ANALYSIS
Best of all, you can use the equation to make predictions. For example, how much snow will
fall in 2017? 𝑦 = −1.4622(2017) + 2959.9 = 10.6426 inches.
Regression also gives you an R squared value, which for this graph is 0.49. This number tells
you how good your model is. The values range from 0 to 1, with 0 being a terrible model and 1 being a
perfect model. As you can probably see, 0.49 is not a decent model so you cannot be fairly confident in
your weather prediction!
Example 02:
A statistics instructor at a large western Table 1: Course grade versus the number of
university would like to examine the relationship optional homework problems completed.
(if any) between the number of optional Problems Course Grade
homework problems students do during the
51 62
semester and their final course grade. She
58 68
randomly selects 12 students for study and asks
62 66
them to keep track of the number of these
problems completed during the course of the 65 66
semester. At the end of the class each student’s 68 67
total is recorded along with their final grade. 76 72
77 73
78 72
78 78
84 73
85 76
91 75
5. Test the hypothesis for a significant linear correlation at 0.05 level of significance.
• Statement of hypotheses:
Ho: There is no significant relationship between the number of optional homework
problems students do during the semester and their final course grade.
Ha: There is a significant relationship between the number of optional homework
problems students do during the semester and their final course grade.
STAT 201: Statistical Methods I
• Since the computed value of t is greater than the critical value, reject Ho.
• At a significance level of 0.05 we can conclude that there is a significant linear correlation
between the number of homework assignments and a student’s final grade. Furthermore, we
can conclude that this correlation is moderately positive.
98
CHAPTER 5: REGRESSION ANALYSIS
6. Determine the valid prediction range. The valid prediction range is the
range of the “predictor” variable.
In this case it’s from 51 - 91
8. Use the regression equation to predict a student’s final 𝐺𝑟𝑎𝑑𝑒 = 44.8 + 0.355(75)
course grade if 75 optional homework assignments are 𝐺𝑟𝑎𝑑𝑒 = 71.4
done.
SPSS Outputs
Descriptive Statistics
Problems
ANOVAa
Coefficientsa
USE OF T-TESTS
Confidence A 100(1 − α)% confidence interval for the parameter β1 in the regression line μY
Interval |x = β0 + β1x is
for the Slope, 𝑠 𝑠
β1 (𝑏1 − 𝑡𝛼 ) < 𝛽1 < (𝑏1 + 𝑡𝛼 )
2 √𝑆𝑥𝑥 2 √𝑆𝑥𝑥
Hypothesis To test the null hypothesis H0 that β1 = β10 against a suitable alternative, we again
Testing on the use the t-distribution with n−2 degrees of freedom to establish a critical region
Slope, β1 and then base our decision on the value of
𝒃𝟏 − 𝜷𝟏𝟎
𝒕= 𝒔
√𝑺𝒙𝒙
Confidence A 100(1 − α)% confidence interval for the parameter β1 in the regression line μY
Interval |x = β0 + β1x is
for the 𝑛 𝑛
Intercept, β0 𝑠 2
𝑠
(𝑏0 − 𝑡𝛼 √∑ 𝑥𝑖 ) < 𝛽0 < (𝑏0 + 𝑡𝛼 √∑ 𝑥𝑖 2 )
2 √𝑛𝑆𝑥𝑥 2 √𝑛𝑆𝑥𝑥
𝑖=1 𝑖=1
Statistical To test the null hypothesis H0 that β0 = β00 against a suitable alternative, we can
Inference on use the t-distribution with n−2 degrees of freedom to establish a critical region
the Intercept, β0 and then base our decision on the value of
𝒃𝟎 − 𝜷𝟎𝟎
𝒕=
∑𝒏 𝒙 𝟐
𝒔√ 𝒊=𝟏 𝒊
𝒏𝑺𝒙𝒙
STAT 201: Statistical Methods I
100
CHAPTER 5: REGRESSION ANALYSIS
When the null hypothesis is rejected, that is, when the computed F-statistic exceeds the critical value
fα(1, n − 2), we conclude that there is a significant amount of variation in the response accounted
for by the postulated model, the straight-line function. If the F-statistic is in the fail to reject region,
we conclude that the data did not reflect sufficient evidence to support the model postulated.
Confidence A 100(1 − α)% confidence interval for the mean response μY |x0 is
Interval for μY
1 (𝑥0 − 𝑥̅ )2 1 (𝑥0 − 𝑥̅ )2
|x0 (𝑦̂0 − 𝑡𝛼 𝑠√ + ) < 𝜇𝑌|𝑥0 < (𝑦̂0 + 𝑡𝛼 𝑠√ + )
2 𝑛 𝑆𝑥𝑥 2 𝑛 𝑆𝑥𝑥
101
CHAPTER 5: REGRESSION ANALYSIS
What is an acceptable value for R2? This is a difficult question to answer. A chemist, charged
with doing a linear calibration of a high-precision piece of equipment, certainly expects to experience
a very high R2-value (perhaps exceeding 0.99), while a behavioral scientist, dealing in data impacted
by variability in human behavior, may feel fortunate to experience an R2 as large as 0.70. An
experienced model fitter senses when a value is large enough, given the situation confronted. Clearly,
some scientific phenomena lend themselves to modeling with more precision than others.
The R2 criterion is dangerous to use for comparing competing models for the same data set.
Adding additional terms to the model (e.g., an additional regressor) decreases SSE and thus increases
R2 (or at least does not decrease it). This implies that R2 can be made artificially high by an unwise
practice of overfitting (i.e., the inclusion of too many model terms). Thus, the inevitable increase in R2
enjoyed by adding an additional term does not imply the additional term was needed. In fact, the simple
model may be superior for predicting response values.
Temperature, x 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0
Converted Sugar, y 8.1 7.8 8.5 9.8 9.5 8.9 8.6 10.2 9.3 9.2 10.5
2. The grades of a class of 9 students on a midterm report (x) and on the final examination (y) are
as follows:
x 77 50 71 72 81 94 96 99 67
y 82 66 78 34 47 85 99 99 68
102
CHAPTER 5: REGRESSION ANALYSIS
In one-variable linear regression, you would input one dependent variable (i.e. “sales”) against
an independent variable (i.e. “profit”). But you might be interested in how different types of sales
effect the regression. You could set your X1 as one type of sales, your X2 as another type of sales and
so on.
From this graph it might appear there is a relationship between life-expectancy of women and
the number of doctors in the population. In fact, that’s probably true and you could say it’s a simple fix:
put more doctors into the population to increase life expectancy. But the reality is you would have to
look at other factors like the possibility that doctors in rural areas might have less education or
experience. Or perhaps they have a lack of access to medical facilities like trauma centers.
The addition of those extra factors would cause you to add additional dependent variables to
your regression analysis and create a multiple regression analysis model.
STAT 201: Statistical Methods I
The output would include a summary, similar to a summary for simple linear regression, that
includes:
• R (the multiple correlation coefficient),
103
CHAPTER 5: REGRESSION ANALYSIS
These statistics help you figure out how well a regression model fits the data.
The ANOVA table in the output would give you the p-value and f-statistic.
where 𝜖𝑖 and ei are the random error and residual, respectively, associated with
the response 𝒚𝒊 and fitted value 𝒚̂𝒊 .
Normal These equations can be solved for b0, b1, b2, . . . , bk by any appropriate method
Estimation for solving systems of linear equations.
𝒏 𝒏 𝒏 𝒏
Equations for
Multiple Linear 𝒏𝒃𝟎 + 𝒃𝟏 ∑ 𝒙𝟏𝒊 + 𝒃𝟐 ∑ 𝒙𝟐𝒊 + ⋯ + +𝒃𝒌 ∑ 𝒙𝒌𝒊 = ∑ 𝒚𝒊
Regression 𝒊=𝟏 𝒊=𝟏 𝒊=𝟏 𝒊=𝟏
𝒏 𝒏 𝒏 𝒏 𝒏
𝟐
𝒃𝟎 ∑ 𝒙𝟏𝒊 + 𝒃𝟏 ∑ 𝒙𝟏𝒊 + 𝒃𝟐 ∑ 𝒙𝟏𝒊 𝒙𝟐𝒊 + ⋯ + +𝒃𝒌 ∑ 𝒙𝟏𝒊 𝒙𝒌𝒊 = ∑ 𝒙𝟏𝒊 𝒚𝒊
𝒊=𝟏 𝒊=𝟏 𝒊=𝟏 𝒊=𝟏 𝒊=𝟏
𝒏 𝒏 𝒏 𝒏 𝒏
𝟐
𝒃𝟎 ∑ 𝒙𝒌𝒊 + 𝒃𝟏 ∑ 𝒙𝒌𝒊 𝒙𝟏𝒊 + 𝒃𝟐 ∑ 𝒙𝒌𝒊 𝒙𝟐𝒊 + ⋯ + +𝒃𝒌 ∑ 𝒙𝒌𝒊 = ∑ 𝒙𝒌𝒊 𝒚𝒊
𝒊=𝟏 𝒊=𝟏 𝒊=𝟏 𝒊=𝟏 𝒊=𝟏
Example: A study was done on a diesel-powered light-duty pickup truck to see if humidity,
air temperature, and barometric pressure influence emission of nitrous oxide (in
ppm). Emission measurements were taken at different times, with varying
experimental conditions. The data are given in the table below. The model is
𝝁𝒀|𝒙 ,𝒙 ,𝒙 = 𝜷𝟎 + 𝜷𝟏 𝒙𝟏 + 𝜷𝟐 𝒙𝟐 + 𝜷𝟑 𝒙𝟑
𝟏 𝟐 𝟑
or, equivalently,
STAT 201: Statistical Methods I
𝒚𝒊 = 𝜷𝟎 + 𝜷𝟏 𝒙𝟏 + 𝜷𝟐 𝒙𝟐 + 𝜷𝟑 𝒙𝟑 + 𝝐𝒊
for i= 1, 2, . . . , 20.
Fit this multiple linear regression model to the given data and then estimate the
amount of nitrous oxide emitted for the conditions where humidity is 50%,
temperature is 76◦F, and barometric pressure is 29.30.
104
CHAPTER 5: REGRESSION ANALYSIS
The solution of the set of estimating equations yields the unique estimates
b0 = −3.507778,
b1 = −0.002625,
b2 = 0.000799,
b3 = 0.154155.
where 𝑟 is the degree of the polynomial 𝜖𝑖 and ei are again the random error and
residual associated with the response yi and fitted value 𝑦̂𝑖 respectively. Here, the
number of pairs, n, must be at least as large as r+1, the number of parameters to
be estimated.
x 0 1 2 3 4 5 6 7 8 9
y 9.1 7.3 3.2 4.6 4.8 2.9 5.7 7.1 8.8 10.2
b2 = 0.288.
Example 2: The data in the table below represents the percent of impurities that resulted for
various temperatures and sterilizing times during a reaction associated with the
manufacturing of a certain beverage. Estimate the regression coefficients in the
polynomial model
105
CHAPTER 5: REGRESSION ANALYSIS
for i = 1, 2, . . . , 18.
Ho: 𝛽1 = 𝛽2 = ⋯ = 𝛽𝑘 = 0
Ha: 𝛽𝑗 ≠ 0 for at least one j
106
CHAPTER 5: REGRESSION ANALYSIS
Because is the error or residual mean square and is a constant, 𝑹𝟐𝒂𝒅𝒋 will only increase when a
variable is added to the model if the new variable reduces the error mean square. The adjusted R2
statistic essentially penalizes the analyst for adding terms to the model. It is an easy way to guard against
overfitting, that is, including regressors that are not really useful. Consequently, it is very useful in
comparing and evaluating competing regression models.
̂ 𝟐 (𝟏 + 𝒙′𝟎 (𝑿′𝑿)−𝟏 𝒙𝟎 )
̂𝟎 + 𝒕𝜶,𝒏−𝒑 √𝝈
≤𝒚
𝟐
TOPIC 3: MULTICOLLINEARITY
An easy way to detect multicollinearity is to calculate correlation coefficients for all pairs of
predictor variables. If the correlation coefficient, r, is exactly +1 or -1, this is called perfect
multicollinearity. If r is close to or exactly -1 or +1, one of the variables should be removed from the
model if at all possible.
It’s more common for multicollinearity to rear its ugly head in observational studies; it’s less
common with experimental data. When the condition is present, it can result in unstable and unreliable
regression estimates. Several other problems can interfere with analysis of results, including:
• The t-statistic will generally be very small and coefficient confidence intervals will be
very wide. This means that it is harder to reject the null hypothesis.
• The partial regression coefficient may be an imprecise estimate; standard errors may be
STAT 201: Statistical Methods I
very large.
• Partial regression coefficients may have sign and/or magnitude changes as they pass
from sample to sample.
• Multicollinearity makes it difficult to gauge the effect of independent
variables on dependent variables.
cases, variables may be highly correlated (usually due to collecting data from purely
observational studies) and there is no error on the researcher’s part. For this reason, you
should conduct experiments whenever possible, setting the level of the predictor variables
in advance.
• Structural multicollinearity: caused by you, the researcher, creating new predictor
variables.
References:
• D.C. Montgomery and G.C. Runger, Applied Statistics and Probability for Engineers, 5th
Edition, John Wiley & Sons, Inc., 2011.
• R.E. Walpole. R.H. Myers, S.L. Myers and K. Ye, Probability and Statistics for Engineers
and Scientists, 9th Edition, Pearson International Edition, 2012.
• Zulueta, F. M. and Nestor Edilberto B. Costales, Jr. (2005). Methods of Research: Thesis
Writing and Applied Statistics. Mandaluyong City: National Bookstore, Inc.
• https://www.statisticshowto.com/probability-and-statistics/regression-analysis/
• https://www.statisticshowto.com/multicollinearity/
108