Chapter 5 Regression Analysis

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

CHAPTER 5:

REGRESSION ANALYSIS
Topic Outline:
1. Simple Linear Regression Analysis
2. Multiple Linear Regression Analysis
3. Multicollinearity

Learning Outcomes:
After careful study of this chapter, students should be able to do the following:
1. defined regression analysis;
2. discussed when to use regression analysis;
3. illustrated the SLR model
4. discussed the assumptions in SLR;
5. performed SLR analysis using SPSS;
6. estimated and interpreted and regression coefficients;
7. illustrated the MLR model
8. discussed the assumptions in MLR;
9. performed and discussed the results of the different MLR procedures using
SPSS; and
10. defined multicollinearity.

Prepared by:
Prof. Jeanne Valerie Agbayani-Agpaoa
STAT 201: Statistical Methods I
Dr. Virgilio Julius P. Manzano, Jr.
Engr. Lawrence John C. Tagata
CHAPTER 5: REGRESSION ANALYSIS

INTRODUCTION

Regression analysis is a way to find trends in data. For example, you might guess that there’s
a connection between how much you eat and how much you weigh; regression analysis can help you
quantify that.

Regression analysis will provide you with an equation for a graph so that you can make
predictions about your data. For example, if you’ve been putting on weight over the last few years, it
can predict how much you’ll weigh in ten years time if you continue to put on weight at the same rate.
It will also give you a slew of statistics (including a p-value and a correlation coefficient) to tell you
how accurate your model is. Most elementary stats courses cover very basic techniques, like
making scatter plots and performing linear regression. However, you may come across more advanced
techniques like multiple regression.

TOPIC 1: SIMPLE LINEAR REGRESSION ANALYSIS

A reasonable form of a relationship between the response Y and the predictor/regressor (natural
independent variables) x is the linear relationship
𝒀 = 𝜷𝒐 + 𝜷𝟏 𝒙

where β0 is the intercept and β1 is the slope.

The concept of regression analysis deals with finding the best relationship between Y and x,
quantifying the strength of that relationship, and using methods that allow for prediction of the response
values given values of the predictor/regressor, x. In other words, regression analysis applies to
situations in which the relationships among variables are not deterministic or not exact.

More often than not, the models that are simplifications of more complicated and unknown
structures are linear in nature (i.e., linear in the parameters β0 and β1 or, in the case of the model
involving the price, size, and age of the house, linear in the parameters β0, β1, and β2). These linear
structures are simple and empirical in nature and are thus called empirical models.

An analysis of the relationship between Y and x requires the statement of a statistical model.
A model is often used by a statistician as a representation of an ideal that essentially defines how we
perceive that the data were generated by the system in question. The model must include the set {(xi,
yi); i = 1, 2, . . . , n} of data involving n pairs of (x, y) values. One must bear in mind that the value yi
depends on xi via a linear structure that also has the random component involved.

Simple The statistical model for simple linear


Linear regression is
Regression 𝒀 = 𝜷𝒐 + 𝜷𝟏 𝒙 + 𝝐
Model
where β0 and β1 are unknown intercept and
slope parameters, respectively, and 𝜖 is the
STAT 201: Statistical Methods I

random error or random disturbance.

The Fitted Regression Line


An important aspect of regression analysis is, very simply, to estimate the parameters β0 and β1
(i.e., estimate the so-called regression coefficients). Suppose we denote the estimates b0 for β0 and b1
for β1. Then the estimated or fitted regression line is given by
𝒚̂ = 𝒃𝟎 + 𝒃𝟏 𝒙
96
CHAPTER 5: REGRESSION ANALYSIS

where 𝑦̂ is the predicted or fitted value.

The fitted line is an estimate of the true regression line. Expect that the fitted line should be
closer to the true regression line when a large amount of data is available.

Estimating the Given the sample {(xi, yi); i = 1, 2, . . . , n}, the least squares estimates b0 and b1
Regression of the regression coefficients β0 and β1 are computed from the formulas
Coefficients
𝒏 ∑𝒏𝒊=𝟏 𝒙𝒊 𝒚𝒊 − (∑𝒏𝒊=𝟏 𝒙𝒊 )(∑𝒏𝒊=𝟏 𝒚𝒊 ) ∑𝒏𝒊=𝟏(𝒙𝒊 − 𝒙 ̅)(𝒚𝒊 − 𝒚̅)
𝒃𝟏 𝒃𝟏 = =
𝒏 ∑𝒏 𝒙𝟐 − (∑𝒏 𝒙 )
𝟐 ∑𝒏𝒊=𝟏(𝒙𝒊 − ̅
𝒙)𝟐
𝒊=𝟏 𝒊 𝒊=𝟏 𝒊
and
∑𝒏𝒊=𝟏 𝒚𝒊 − 𝒃𝟏 ∑𝒏𝒊=𝟏 𝒙𝒊
𝒃𝟎 𝒃𝟎 = ̅ − 𝒃𝟏 𝒙
=𝒚 ̅
𝒏

Practical Application
Example 01:
In statistics, it’s hard to stare at a set of random numbers in a table and try to make any sense of it. For
example, global warming may be reducing average snowfall in your town and you are asked to predict
how much snow you think will fall this year. Looking at the following table you might guess somewhere
around 10-20 inches. That’s a good guess, but you could make a better guess, by using regression.

Year Amount 45
(inches)
2000 40 40
2001 39 35
2002 41
2003 29 30
2004 32
25
2005 30
2006 33 20
2007 15
2008 10 15
2009 11 10
2020 20
y = -1.4622x + 2959.9
2011 24 5
2012 10
0
2013 15
1995 2000 2005 2010 2015 2020 2025

Essentially, regression is the “best guess” at using a set of data to make some kind of prediction.
It’s fitting a set of points to a graph. There’s a whole host of tools that can run regression for you,
including Excel, which I used here to help make sense of that snowfall data:
STAT 201: Statistical Methods I

Just by looking at the regression line running down through the data, you can fine tune your
best guess a bit. You can see that the original guess (20 inches or so) was way off. For 2015, it looks
like the line will be somewhere between 5 and 10 inches! That might be “good enough”, but regression
also gives you a useful equation, which for this chart is: 𝑦 = −1.4622𝑥 + 2959.9.

What that means is you can plug in an x value (the year) and get a pretty good estimate of
snowfall for any year. For example, 2005: 𝑦 = −1.4622(2005) + 2959.9 = 28.189 inches, which
is pretty close to the actual figure of 30 inches for that year.

97
CHAPTER 5: REGRESSION ANALYSIS

Best of all, you can use the equation to make predictions. For example, how much snow will
fall in 2017? 𝑦 = −1.4622(2017) + 2959.9 = 10.6426 inches.

Regression also gives you an R squared value, which for this graph is 0.49. This number tells
you how good your model is. The values range from 0 to 1, with 0 being a terrible model and 1 being a
perfect model. As you can probably see, 0.49 is not a decent model so you cannot be fairly confident in
your weather prediction!

Example 02:
A statistics instructor at a large western Table 1: Course grade versus the number of
university would like to examine the relationship optional homework problems completed.
(if any) between the number of optional Problems Course Grade
homework problems students do during the
51 62
semester and their final course grade. She
58 68
randomly selects 12 students for study and asks
62 66
them to keep track of the number of these
problems completed during the course of the 65 66
semester. At the end of the class each student’s 68 67
total is recorded along with their final grade. 76 72
77 73
78 72
78 78
84 73
85 76
91 75

1. Identify the response variable. Course Grade

2. Identify the predictor variable. Number of optional homework


problems completed

3. Compute the linear correlation coefficient, r. 𝑟 = 0.885

4. Classify the direction and strength of the correlation. Moderate positive

5. Test the hypothesis for a significant linear correlation at 0.05 level of significance.

• Statement of hypotheses:
Ho: There is no significant relationship between the number of optional homework
problems students do during the semester and their final course grade.
Ha: There is a significant relationship between the number of optional homework
problems students do during the semester and their final course grade.
STAT 201: Statistical Methods I

• At 0.05 level of significance, df = 10; tcrit = ± 2.228


𝑟 0.885
• 𝑡= 2
= 2
= 6.01
√1−𝑟 √1−0.885
𝑛−2 12−2

• Since the computed value of t is greater than the critical value, reject Ho.

• At a significance level of 0.05 we can conclude that there is a significant linear correlation
between the number of homework assignments and a student’s final grade. Furthermore, we
can conclude that this correlation is moderately positive.

98
CHAPTER 5: REGRESSION ANALYSIS

6. Determine the valid prediction range. The valid prediction range is the
range of the “predictor” variable.
In this case it’s from 51 - 91

7. Determine the regression equation. 𝐶𝑜𝑢𝑟𝑠𝑒 𝐺𝑟𝑎𝑑𝑒 = 44.8 + (0.355 ∗


𝑃𝑟𝑜𝑏𝑙𝑒𝑚𝑠 𝐶𝑜𝑚𝑝𝑙𝑒𝑡𝑒𝑑)

8. Use the regression equation to predict a student’s final 𝐺𝑟𝑎𝑑𝑒 = 44.8 + 0.355(75)
course grade if 75 optional homework assignments are 𝐺𝑟𝑎𝑑𝑒 = 71.4
done.

9. Use the regression equation to compute the number of 85 = 44.827 + 0.355(𝑥)


optional homework assignments that need to be 𝑥 ≈ 113
completed if a student expects an 85. This value is out of the prediction
range so we have no confidence in
it.

SPSS Outputs
Descriptive Statistics

Mean Std. Deviation N

Course Grade 70.6667 4.81160 12


Problems 72.7500 11.99337 12

Problems

Course Grade 0.885


Pearson Correlation Sig. (1-tailed) .000
N 12

ANOVAa

Model Sum of Squares df Mean Square F Sig.

Regression 199.617 1 199.617 36.261 .000b

1 Residual 55.050 10 5.505


Total 254.667 11

a. Dependent Variable: Course Grade


b. Predictors: (Constant), Problems
STAT 201: Statistical Methods I

Coefficientsa

Model Unstandardized Coefficients Standardized t Sig.


Coefficients

B Std. Error Beta

(Constant) 44.827 4.344 10.319 .000


1
Problems .355 .059 .885 6.022 .000

a. Dependent Variable: Course Grade


99
CHAPTER 5: REGRESSION ANALYSIS

HYPOTHESIS TESTS IN SIMPLE LINEAR REGRESSION


Aside from merely estimating the linear relationship between x and Y for purposes of prediction,
the experimenter may also be interested in drawing certain inferences about the slope and intercept. In
order to allow for the testing of hypotheses and the construction of confidence intervals on β0 and β1,
one must be willing to make the further assumption that each 𝝐i, i = 1, 2, . . . , n, is normally distributed.
This assumption implies that Y1, Y2, . . . , Yn are also normally distributed, each with probability
distribution n(yi; β0 + β1xi, σ).

USE OF T-TESTS
Confidence A 100(1 − α)% confidence interval for the parameter β1 in the regression line μY
Interval |x = β0 + β1x is
for the Slope, 𝑠 𝑠
β1 (𝑏1 − 𝑡𝛼 ) < 𝛽1 < (𝑏1 + 𝑡𝛼 )
2 √𝑆𝑥𝑥 2 √𝑆𝑥𝑥

where 𝑡𝛼 is a value of the t-distribution.


2

Hypothesis To test the null hypothesis H0 that β1 = β10 against a suitable alternative, we again
Testing on the use the t-distribution with n−2 degrees of freedom to establish a critical region
Slope, β1 and then base our decision on the value of
𝒃𝟏 − 𝜷𝟏𝟎
𝒕= 𝒔
√𝑺𝒙𝒙

Confidence A 100(1 − α)% confidence interval for the parameter β1 in the regression line μY
Interval |x = β0 + β1x is
for the 𝑛 𝑛
Intercept, β0 𝑠 2
𝑠
(𝑏0 − 𝑡𝛼 √∑ 𝑥𝑖 ) < 𝛽0 < (𝑏0 + 𝑡𝛼 √∑ 𝑥𝑖 2 )
2 √𝑛𝑆𝑥𝑥 2 √𝑛𝑆𝑥𝑥
𝑖=1 𝑖=1

where 𝑡𝛼 is a value of the t-distribution with n−2 degrees of freedom.


2

Statistical To test the null hypothesis H0 that β0 = β00 against a suitable alternative, we can
Inference on use the t-distribution with n−2 degrees of freedom to establish a critical region
the Intercept, β0 and then base our decision on the value of
𝒃𝟎 − 𝜷𝟎𝟎
𝒕=
∑𝒏 𝒙 𝟐
𝒔√ 𝒊=𝟏 𝒊
𝒏𝑺𝒙𝒙
STAT 201: Statistical Methods I

100
CHAPTER 5: REGRESSION ANALYSIS

ANALYSIS OF VARIANCE APPROACH TO TEST SIGNIFICANCE OF REGRESSION


Often the problem of analyzing the quality of the estimated regression line is handled by an
analysis-of-variance (ANOVA) approach: a procedure whereby the total variation in the dependent
variable is subdivided into meaningful components that are then observed and treated in a systematic
fashion. The analysis of variance is a powerful resource that is used for many applications.

ANOVA for Testing 𝜷𝟏 = 𝟎


Source of Sum of Degrees of Mean Computed
Variation Squares Freedom Square F
Regression SSR 1 SSR 𝑆𝑆𝑅
𝑠2
Error SSE n–2 𝑆𝑆𝐸
𝑠2 =
𝑛−2
Total SST n–1

When the null hypothesis is rejected, that is, when the computed F-statistic exceeds the critical value
fα(1, n − 2), we conclude that there is a significant amount of variation in the response accounted
for by the postulated model, the straight-line function. If the F-statistic is in the fail to reject region,
we conclude that the data did not reflect sufficient evidence to support the model postulated.

PREDICTION OF NEW OBSERVATIONS


The equation 𝑦̂ = b0 + b1x may be used to predict or estimate the mean response μY |x0 at x = x0,
where x0 is not necessarily one of the pre-chosen values, or it may be used to predict a single value y0
of the variable Y0, when x = x0. We would expect the error of prediction to be higher in the case of a
single predicted value than in the case where a mean is predicted. This, then, will affect the width of
our intervals for the values being predicted.

Confidence A 100(1 − α)% confidence interval for the mean response μY |x0 is
Interval for μY
1 (𝑥0 − 𝑥̅ )2 1 (𝑥0 − 𝑥̅ )2
|x0 (𝑦̂0 − 𝑡𝛼 𝑠√ + ) < 𝜇𝑌|𝑥0 < (𝑦̂0 + 𝑡𝛼 𝑠√ + )
2 𝑛 𝑆𝑥𝑥 2 𝑛 𝑆𝑥𝑥

where 𝑡𝛼 is a value of the t-distribution with n−2 degrees of freedom.


2
Statistics for ̂ 𝟎 − 𝝁𝒀|𝒙
𝒀 𝟎
the Main 𝑻=
Response 𝝁𝒀|𝒙𝟎 ̅)𝟐
𝟏 (𝒙 − 𝒙
𝒔√𝒏 + 𝟎𝑺
𝒙𝒙

Confidence A 100(1 − α)% prediction interval for a single response y0 is given by


Interval
1 (𝑥0 − 𝑥̅ )2 1 (𝑥0 − 𝑥̅ )2
for the (𝑦̂0 − 𝑡𝛼 𝑠√1 + + ) < 𝑦0 < (𝑦̂0 + 𝑡𝛼 𝑠√1 + + )
Predicted 2 𝑛 𝑆𝑥𝑥 2 𝑛 𝑆𝑥𝑥
STAT 201: Statistical Methods I

Value y0 where 𝑡𝛼 is a value of the t-distribution with n−2 degrees of freedom.


2
Statistics for ̂𝟎 − 𝒀𝟎
𝒀
the Predicted 𝑻=
Value y0 ̅)𝟐
𝟏 (𝒙 − 𝒙
𝒔√𝟏 + 𝒏 + 𝟎𝑺
𝒙𝒙

101
CHAPTER 5: REGRESSION ANALYSIS

ADEQUACY OF THE REGRESSION MODEL


Coefficient of The coefficient of determination is a measure of the proportion of variability
Determination explained by the fitted model.
𝑆𝑆𝐸
𝑅2 = 1 −
𝑆𝑆𝑇
The reliability of R2 is a function of the size of the regression data set and the type of application.
Clearly, 0 ≤ R2 ≤ 1 and the upper bound is achieved when the fit to the data is perfect (i.e., all of the
residuals are zero).

What is an acceptable value for R2? This is a difficult question to answer. A chemist, charged
with doing a linear calibration of a high-precision piece of equipment, certainly expects to experience
a very high R2-value (perhaps exceeding 0.99), while a behavioral scientist, dealing in data impacted
by variability in human behavior, may feel fortunate to experience an R2 as large as 0.70. An
experienced model fitter senses when a value is large enough, given the situation confronted. Clearly,
some scientific phenomena lend themselves to modeling with more precision than others.

The R2 criterion is dangerous to use for comparing competing models for the same data set.
Adding additional terms to the model (e.g., an additional regressor) decreases SSE and thus increases
R2 (or at least does not decrease it). This implies that R2 can be made artificially high by an unwise
practice of overfitting (i.e., the inclusion of too many model terms). Thus, the inevitable increase in R2
enjoyed by adding an additional term does not imply the additional term was needed. In fact, the simple
model may be superior for predicting response values.

ACTIVITY 12: SIMPLE LINEAR REGRESSION ANALYSIS


Solve the problems following the procedures in Example 02.
1. A study was made on the amount of converted sugar in a certain process at various temperatures.
The data were coded and recorded as follows:

Temperature, x 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0
Converted Sugar, y 8.1 7.8 8.5 9.8 9.5 8.9 8.6 10.2 9.3 9.2 10.5

a. Estimate the linear regression line.


b. Estimate the mean amount of converted sugar produced when the coded temperature is 1.75.

2. The grades of a class of 9 students on a midterm report (x) and on the final examination (y) are
as follows:
x 77 50 71 72 81 94 96 99 67
y 82 66 78 34 47 85 99 99 68

a. Estimate the linear regression line.


b. Estimate the final examination grade of a student who received a grade of 85 on the midterm
report.
STAT 201: Statistical Methods I

102
CHAPTER 5: REGRESSION ANALYSIS

TOPIC 2: MULTIPLE LINEAR REGRESSION ANALYSIS

Multiple regression analysis is used to see if there is a statistically significant relationship


between sets of variables. It’s used to find trends in those sets of data. Multiple regression analysis
is almost the same as simple linear regression. The only difference between simple linear regression
and multiple regression is in the number of predictors (“x” variables) used in the regression.
• Simple regression analysis uses a single x variable for each dependent “y” variable. For
example: (x1, Y1).
• Multiple regression uses multiple “x” variables for each independent variable: (x1)1, (x2)1,
(x3)1, Y1).

In one-variable linear regression, you would input one dependent variable (i.e. “sales”) against
an independent variable (i.e. “profit”). But you might be interested in how different types of sales
effect the regression. You could set your X1 as one type of sales, your X2 as another type of sales and
so on.

When to Use Multiple Regression Analysis


Ordinary linear regression usually isn’t enough to take into account all of the real-life factors
that have an effect on an outcome. For example, the following graph plots a single variable (number of
doctors) against another variable (life-expectancy of women).

Image: Columbia University

From this graph it might appear there is a relationship between life-expectancy of women and
the number of doctors in the population. In fact, that’s probably true and you could say it’s a simple fix:
put more doctors into the population to increase life expectancy. But the reality is you would have to
look at other factors like the possibility that doctors in rural areas might have less education or
experience. Or perhaps they have a lack of access to medical facilities like trauma centers.

The addition of those extra factors would cause you to add additional dependent variables to
your regression analysis and create a multiple regression analysis model.
STAT 201: Statistical Methods I

Multiple Regression Analysis Output


Regression analysis is always performed in software, like Excel or SPSS. The output differs
according to how many variables you have but it’s essentially the same type of output you would find
in a simple linear regression. There’s just more of it:
• Simple regression: 𝑌 = 𝑏0 + 𝑏1 𝑥.
• Multiple regression: 𝑌 = 𝑏0 + 𝑏1 𝑥1 + 𝑏0 + 𝑏1 𝑥2 … 𝑏0 … 𝑏1 𝑥𝑛 .

The output would include a summary, similar to a summary for simple linear regression, that
includes:
• R (the multiple correlation coefficient),

103
CHAPTER 5: REGRESSION ANALYSIS

• R squared (the coefficient of determination),


• adjusted R-squared,
• The standard error of the estimate.

These statistics help you figure out how well a regression model fits the data.
The ANOVA table in the output would give you the p-value and f-statistic.

MULTIPLE LINEAR REGRESSION MODELS


In most research problems where regression analysis is applied, more than one independent
variable is needed in the regression model. The complexity of most scientific mechanisms is such that
in order to be able to predict an important response, a multiple regression model is needed. When this
model is linear in the coefficients, it is called a multiple linear regression model. For the case of k
independent variables 𝑥1 , 𝑥2 , ⋯ 𝑥𝑘 , the mean of 𝑌|𝑥1,𝑥2,⋯𝑥𝑘 is given by the multiple linear regression
model.

Multiple Linear 𝝁𝒀|𝒙 = 𝜷𝟎 + 𝜷𝟏 𝒙𝟏 + · · · + 𝜷𝒌 𝒙𝒌


𝟏 ,𝒙𝟐 ,⋯𝒙𝒌
Regression
Model or
𝒚𝒊 = 𝒚̂𝒊 + 𝒆𝒊 = 𝒃𝟎 + 𝒃𝟏 𝒙𝟏 + · · · + 𝒃𝒌 𝒙𝒌 + 𝒆𝒊

where 𝜖𝑖 and ei are the random error and residual, respectively, associated with
the response 𝒚𝒊 and fitted value 𝒚̂𝒊 .
Normal These equations can be solved for b0, b1, b2, . . . , bk by any appropriate method
Estimation for solving systems of linear equations.
𝒏 𝒏 𝒏 𝒏
Equations for
Multiple Linear 𝒏𝒃𝟎 + 𝒃𝟏 ∑ 𝒙𝟏𝒊 + 𝒃𝟐 ∑ 𝒙𝟐𝒊 + ⋯ + +𝒃𝒌 ∑ 𝒙𝒌𝒊 = ∑ 𝒚𝒊
Regression 𝒊=𝟏 𝒊=𝟏 𝒊=𝟏 𝒊=𝟏
𝒏 𝒏 𝒏 𝒏 𝒏
𝟐
𝒃𝟎 ∑ 𝒙𝟏𝒊 + 𝒃𝟏 ∑ 𝒙𝟏𝒊 + 𝒃𝟐 ∑ 𝒙𝟏𝒊 𝒙𝟐𝒊 + ⋯ + +𝒃𝒌 ∑ 𝒙𝟏𝒊 𝒙𝒌𝒊 = ∑ 𝒙𝟏𝒊 𝒚𝒊
𝒊=𝟏 𝒊=𝟏 𝒊=𝟏 𝒊=𝟏 𝒊=𝟏

𝒏 𝒏 𝒏 𝒏 𝒏
𝟐
𝒃𝟎 ∑ 𝒙𝒌𝒊 + 𝒃𝟏 ∑ 𝒙𝒌𝒊 𝒙𝟏𝒊 + 𝒃𝟐 ∑ 𝒙𝒌𝒊 𝒙𝟐𝒊 + ⋯ + +𝒃𝒌 ∑ 𝒙𝒌𝒊 = ∑ 𝒙𝒌𝒊 𝒚𝒊
𝒊=𝟏 𝒊=𝟏 𝒊=𝟏 𝒊=𝟏 𝒊=𝟏

Example: A study was done on a diesel-powered light-duty pickup truck to see if humidity,
air temperature, and barometric pressure influence emission of nitrous oxide (in
ppm). Emission measurements were taken at different times, with varying
experimental conditions. The data are given in the table below. The model is
𝝁𝒀|𝒙 ,𝒙 ,𝒙 = 𝜷𝟎 + 𝜷𝟏 𝒙𝟏 + 𝜷𝟐 𝒙𝟐 + 𝜷𝟑 𝒙𝟑
𝟏 𝟐 𝟑

or, equivalently,
STAT 201: Statistical Methods I

𝒚𝒊 = 𝜷𝟎 + 𝜷𝟏 𝒙𝟏 + 𝜷𝟐 𝒙𝟐 + 𝜷𝟑 𝒙𝟑 + 𝝐𝒊
for i= 1, 2, . . . , 20.

Fit this multiple linear regression model to the given data and then estimate the
amount of nitrous oxide emitted for the conditions where humidity is 50%,
temperature is 76◦F, and barometric pressure is 29.30.

104
CHAPTER 5: REGRESSION ANALYSIS

Nitrous Humidity, Temp., Pressure, Nitrous Humidity, Temp., Pressure,


Oxide, y x1 x2 x3 Oxide, y x1 x2 x3
0.90 72.4 76.3 29.18 1.07 23.2 76.8 29.38
0.91 41.6 70.3 29.35 0.94 47.4 86.6 29.35
0.96 34.3 77.1 29.24 1.10 31.5 76.9 29.63
0.89 35.1 68.0 29.27 1.10 10.6 86.3 29.56
1.00 10.7 79.0 29.78 1.10 11.2 86.0 29.48
1.10 12.9 67.4 29.39 0.91 73.3 76.3 29.40
1.15 8.3 66.8 29.69 0.87 75.4 77.9 29.28
1.03 20.1 76.9 29.48 0.78 96.6 78.7 29.29
0.77 72.2 77.7 29.09 0.82 107.4 86.8 29.03
1.07 24.0 67.7 29.60 0.95 54.9 70.9 29.37

The solution of the set of estimating equations yields the unique estimates
b0 = −3.507778,
b1 = −0.002625,
b2 = 0.000799,
b3 = 0.154155.

Polynomial Now suppose that we wish to fit the polynomial equation


Regression 𝝁𝒀|𝒙 = 𝜷𝟎 + 𝜷𝟏 𝒙 + 𝜷𝟐 𝒙𝟐 + ⋯ + 𝜷𝒓 𝒙𝒓
to the n pairs of observations {(xi, yi); i = 1, 2, . . . , n}. Each observation, yi,
satisfies the equation
𝒚𝒊 = 𝜷𝟎 + 𝜷𝟏 𝒙 + 𝜷𝟐 𝒙𝟐 + ⋯ + 𝜷𝒓 𝒙𝒓 + 𝝐𝒊
or
𝒚𝒊 = 𝒚̂𝒊 + 𝒆𝒊 = 𝒃𝟎 + 𝒃𝟏 𝒙 + 𝒃𝟐 𝒙𝟐 + ⋯ + 𝒃𝒓 𝒙𝒓 + 𝒆𝒊

where 𝑟 is the degree of the polynomial 𝜖𝑖 and ei are again the random error and
residual associated with the response yi and fitted value 𝑦̂𝑖 respectively. Here, the
number of pairs, n, must be at least as large as r+1, the number of parameters to
be estimated.

Example 1: Given the data

x 0 1 2 3 4 5 6 7 8 9
y 9.1 7.3 3.2 4.6 4.8 2.9 5.7 7.1 8.8 10.2

fit a regression curve of the form


𝝁𝒀|𝒙 = 𝜷𝟎 + 𝜷𝟏 𝒙 + 𝜷𝟐 𝒙𝟐
and then estimate μY |2.

Solving these normal equations, we obtain


b0 = 8.698,
b1 = −2.341,
STAT 201: Statistical Methods I

b2 = 0.288.

Example 2: The data in the table below represents the percent of impurities that resulted for
various temperatures and sterilizing times during a reaction associated with the
manufacturing of a certain beverage. Estimate the regression coefficients in the
polynomial model

𝒚𝒊 = 𝜷𝟎 + 𝜷𝟏 𝒙𝟏𝒊 + 𝜷𝟐 𝒙𝟐𝒊 + 𝜷𝟏𝟏 𝒙𝟏𝒊 𝟐 + 𝜷𝟐𝟐 𝒙𝟐𝒊 𝟐 + 𝜷𝟏𝟐 𝒙𝟏𝒊 𝒙𝟐𝒊 + 𝝐𝒊

105
CHAPTER 5: REGRESSION ANALYSIS

for i = 1, 2, . . . , 18.

Sterilizing Time, Temperature, 𝒙𝟏 (°C)


𝒙𝟐 (min) 75 100 125
14.05 10.55 7.55
15
14.93 9.48 6.59
16.56 13.63 9.23
20
15.85 11.75 8.78
22.41 18.55 5.93
25
21.66 17.89 16.44

Using the normal equations, we obtain


b0 = 56.4411, b1 = −0.36190, b2 = −2.75299,
b11 = 0.00081, b22 = 0.08173, b12 = 0.00314,

HYPOTHESIS TESTS IN MULTIPLE LINEAR REGRESSION


Hypotheses The test for significance of regression is a test to determine whether a linear
for ANOVA relationship exists between the response variable y and a subset of the regressor
Test variables x1, x2, … , xk. The appropriate hypotheses are

Ho: 𝛽1 = 𝛽2 = ⋯ = 𝛽𝑘 = 0
Ha: 𝛽𝑗 ≠ 0 for at least one j

Rejection of Ho implies that at least one of the regressor variables contributes


significantly to the model.

Test Statistic 𝑺𝑺𝑹


for ANOVA 𝒌 𝑴𝑺𝑹
𝑭𝒐 = =
𝑺𝑺𝑬 𝑴𝑺𝑬
(𝒏 − 𝒑)
We should reject Ho if the computed value of the test statistic F0, is greater than
f,k,n-p.

Analysis of Variance for Testing Significance of Regression in Multiple Regression


Source of Variation Sum of Degrees of Mean F0
Squares Freedom Square
Regression 𝑺𝑺𝑹 𝑘 𝑴𝑺𝑹 𝑴𝑺𝑹
Error or Residual 𝑺𝑺𝑬 𝑛−𝑝 𝑴𝑺𝑬 𝑴𝑺𝑬
Total 𝑺𝑺𝑻 𝑛−1

Coefficient of The coefficient of multiple determination, R2 can also be used as a global


STAT 201: Statistical Methods I

multiple statistic to assess the fit of the model. Computationally,


determination, 𝑺𝑺𝑹 𝑺𝑺𝑬
𝑹𝟐 = =𝟏−
R2 𝑺𝑺𝑻 𝑺𝑺𝑻

Adjusted R2 Many regression users prefer to use an adjusted R2 statistic:


𝑺𝑺𝑬
(𝒏 − 𝒑)
𝑹𝟐𝒂𝒅𝒋 = 𝟏 −
𝑺𝑺𝑻
(𝒏 − 𝟏)

106
CHAPTER 5: REGRESSION ANALYSIS

Because is the error or residual mean square and is a constant, 𝑹𝟐𝒂𝒅𝒋 will only increase when a
variable is added to the model if the new variable reduces the error mean square. The adjusted R2
statistic essentially penalizes the analyst for adding terms to the model. It is an easy way to guard against
overfitting, that is, including regressors that are not really useful. Consequently, it is very useful in
comparing and evaluating competing regression models.

PREDICTION OF NEW OBSERVATIONS


A regression model can be used to predict new or future observations on the response variable
Y corresponding to particular values of the independent variables, say, x01, x02, … , x0k. If 𝑥0′ =[x01, x02,
… , x0k], a point estimate of the future observation Y0 at the point x01, x02, … , x0k is 𝑦̂0 = 𝑥0′ 𝛽̂

Prediction A 100(1 − )% prediction interval for this future observation is


Interval
̂ 𝟐(𝟏 + 𝒙′𝟎 (𝑿′𝑿) −𝟏 𝒙𝟎 ) ≤ 𝒀𝟎
̂𝟎 − 𝒕𝜶,𝒏−𝒑 √𝝈
𝒚
𝟐

̂ 𝟐 (𝟏 + 𝒙′𝟎 (𝑿′𝑿)−𝟏 𝒙𝟎 )
̂𝟎 + 𝒕𝜶,𝒏−𝒑 √𝝈
≤𝒚
𝟐

TOPIC 3: MULTICOLLINEARITY

Multicollinearity generally occurs when there are


high correlations between two or more predictor
variables. In other words, one predictor variable
can be used to predict the other. This creates
redundant information, skewing the results in a
regression model. Examples of correlated
predictor variables (also called multicollinear
predictors) are: a person’s height and weight, age
and sales price of a car, or years of education and
annual income.

An easy way to detect multicollinearity is to calculate correlation coefficients for all pairs of
predictor variables. If the correlation coefficient, r, is exactly +1 or -1, this is called perfect
multicollinearity. If r is close to or exactly -1 or +1, one of the variables should be removed from the
model if at all possible.

It’s more common for multicollinearity to rear its ugly head in observational studies; it’s less
common with experimental data. When the condition is present, it can result in unstable and unreliable
regression estimates. Several other problems can interfere with analysis of results, including:
• The t-statistic will generally be very small and coefficient confidence intervals will be
very wide. This means that it is harder to reject the null hypothesis.
• The partial regression coefficient may be an imprecise estimate; standard errors may be
STAT 201: Statistical Methods I

very large.
• Partial regression coefficients may have sign and/or magnitude changes as they pass
from sample to sample.
• Multicollinearity makes it difficult to gauge the effect of independent
variables on dependent variables.

What Causes Multicollinearity?


The two types are:
• Data-based multicollinearity: caused by poorly designed experiments, data that is 100%
observational, or data collection methods that cannot be manipulated. In some
107
CHAPTER 5: REGRESSION ANALYSIS

cases, variables may be highly correlated (usually due to collecting data from purely
observational studies) and there is no error on the researcher’s part. For this reason, you
should conduct experiments whenever possible, setting the level of the predictor variables
in advance.
• Structural multicollinearity: caused by you, the researcher, creating new predictor
variables.

Causes for multicollinearity can also include:


• Insufficient data. In some cases, collecting more data can resolve the issue.
• Dummy variables may be incorrectly used. For example, the researcher may fail to
exclude one category, or add a dummy variable for every category (e.g. spring, summer,
autumn, winter).
• Including a variable in the regression that is actually a combination of two other
variables. For example, including “total investment income” when total investment
income = income from stocks and bonds + income from savings interest.
• Including two identical (or almost identical) variables. For example, weight in pounds
and weight in kilos, or investment income and savings/bond income.

References:
• D.C. Montgomery and G.C. Runger, Applied Statistics and Probability for Engineers, 5th
Edition, John Wiley & Sons, Inc., 2011.
• R.E. Walpole. R.H. Myers, S.L. Myers and K. Ye, Probability and Statistics for Engineers
and Scientists, 9th Edition, Pearson International Edition, 2012.
• Zulueta, F. M. and Nestor Edilberto B. Costales, Jr. (2005). Methods of Research: Thesis
Writing and Applied Statistics. Mandaluyong City: National Bookstore, Inc.
• https://www.statisticshowto.com/probability-and-statistics/regression-analysis/
• https://www.statisticshowto.com/multicollinearity/

STAT 201: Statistical Methods I

108

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy