0% found this document useful (0 votes)
29 views

3.1-Multivariate-Analysis

The document discusses multivariate analysis and regression techniques used in engineering data analysis, focusing on constructing scatter diagrams, applying linear regression, and understanding the method of least squares. It explains the concepts of deterministic relationships, correlation coefficients, and the coefficient of determination, along with practical examples and calculations. The document emphasizes the importance of estimating parameters and interpreting regression results in various contexts.

Uploaded by

nay shiii
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views

3.1-Multivariate-Analysis

The document discusses multivariate analysis and regression techniques used in engineering data analysis, focusing on constructing scatter diagrams, applying linear regression, and understanding the method of least squares. It explains the concepts of deterministic relationships, correlation coefficients, and the coefficient of determination, along with practical examples and calculations. The document emphasizes the importance of estimating parameters and interpreting regression results in various contexts.

Uploaded by

nay shiii
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Multivariate Analysis

Engineering Data Analysis


Objectives
Design experiments involving several factors.

• At the end of the lesson, the students are expected


to
• Construct a scatter diagram;
• Use simple linear regression for building empirical
models to engineering and scientific data;
• Understand how the method of least squares is
used to estimate the parameters in a linear
regression model; and
• Interpret the different values obtained.
Deterministic Relationship
A model that predicts variable perfectly
Example:
• The displacement (dt) of a particle at a certain
time is related to its velocity.
dt = d0 + vt
where
• d0 = displacement of the particle from the origin at
time
• t = 0; and
• v = velocity.
Regression Analysis
• The collection of statistical tools that are used to • Single regressor variable or predictor variable x
model and explore relationships between and a dependent or response variable Y
variables that are related in a nondeterministic
• The expected value of Y for each value of x is
manner
E(Y|x) = β0 + β1x,
• Used because there are many situations where
the relationship between variables is not • where the intercept β0 and slope β1 are unknown
deterministic regression coefficients.
• We assume Y can be described by the model
𝑌 = 𝛽0 + 𝛽1 𝑥 + 𝜖
Examples:
11-2
- The electrical energy consumption of a house
(y) is related to the size of the house (x, in ft2).
- The fuel usage of an automobile (y) is related to • where 𝜖 is a random error with mean zero and
the vehicle weight (x). (unknown) variance σ2.
Method of Least Squares
• The random errors corresponding to different
observations are also assumed to be
uncorrelated random variables.
• Regression model may be thought as an
empirical model.

• Suppose that we have n pairs of observations


(x1, y1), (x2, y2), …, (xn, yn). See Fig. 11-3.
• The estimates of β0 and β1 should result in a line
that is (in some sense) a “best fit” to the data.
• German scientist Karl Gauss (1777-1855)
proposed estimating the parameters β0 and β1 in
Equation 11-2 to minimize the sum of squares of
the vertical deviations in Fig. 11-3.
• This criterion for estimating the regression
coefficients is called the method of least
squares.
Method of Least Squares
• Using Equation 11-2 (𝑌 = 𝛽0 + 𝛽1 𝑥 + 𝜖), we may • The least squares estimators of 𝛽0 and 𝛽1 , say 𝛽መ0
express the n observations in the sample as and 𝛽መ1 , must satisfy

𝑛
𝑦𝑖 = 𝛽0 + 𝛽1 𝑥𝑖 + 𝜖𝑖 , 𝑖 = 1, 2, … , 𝑛 𝜕𝐿
ቤ = −2 ෍ 𝑦𝑖 − 𝛽መ0 − 𝛽መ1 𝑥𝑖 = 0
11-3 𝜕𝛽0 𝛽෡ ෡
0 ,𝛽1 𝑖=1

𝑛
• and the sum of the squares of the deviations of 𝜕𝐿
ቤ = −2 ෍ 𝑦𝑖 − 𝛽መ0 − 𝛽መ1 𝑥𝑖 𝑥𝑖 = 0
the observations
𝑛
from
𝑛
the true regression line is 𝜕𝛽1 𝛽෡ ෡
0 ,𝛽1 𝑖=1
𝐿 = ෍ 𝜖𝑖2 = ෍ 𝑦𝑖 − 𝛽0 − 𝛽1 𝑥𝑖 2
11-5
𝑖=1 𝑖=1
11-4
Method of Least Squares
• Simplifying Equations𝑛(11-5) 𝑛 𝛽መ0 = 𝑦ത − 𝛽መ1 𝑥ҧ
𝑛𝛽መ0 + 𝛽መ1 ෍ 𝑥𝑖 = ෍ 𝑦𝑖 11-7

𝑖=1 𝑖=1
σ𝑛𝑖=1 𝑦𝑖 σ𝑛𝑖=1 𝑥𝑖
σ𝑛𝑖=1 𝑦𝑖 𝑥𝑖−
𝑛 𝑛 𝑛
𝛽መ1 = 𝑛
𝛽መ0 ෍ 𝑥𝑖 + 𝛽መ1 ෍ 𝑥𝑖2 = ෍ 𝑦𝑖 𝑥𝑖 σ 𝑛
𝑥
2
𝑖=1 𝑖=1 𝑖=1 σ𝑛𝑖=1 𝑥𝑖2 − 𝑖=1 𝑖
𝑛
11-8
• Equations 11-6 (least squares normal equations)
where 𝑦ത = 1Τ𝑛 σ𝑛𝑖=1 𝑦𝑖 and 𝑥ҧ = 1Τ𝑛 σ𝑛𝑖=1 𝑥𝑖 .
Least Squares Estimates
• Notationally, it is occasionally convenient to give 𝑦ො = 𝛽መ0 + 𝛽መ1 𝑥
special symbols to the numerator and
denominator of Equation 11-8. Given data (x1, Equation 11-9
y1), (x2, y2),
𝑛
…, (xn, yn), let
𝑛 𝑛
σ𝑖=1 𝑥𝑖
𝑆𝑥𝑥 = ෍ 𝑥𝑖 − 𝑥ҧ 2 = ෍ 𝑥𝑖2 − Note that each pair of observations satisfies the
𝑛
𝑖=1 𝑖=1 relationship
• Equation
𝑛
11-10 (denominator)
𝑛
and 𝑦𝑖 = 𝛽መ0 + 𝛽መ1 𝑥 + 𝑒𝑖 , 𝑖 = 1, 2, … , 𝑛
σ𝑛𝑖=1 𝑥𝑖 σ𝑛𝑖=1 𝑦𝑖
𝑆𝑥𝑦 = ෍ 𝑦𝑖 − 𝑦ത 𝑥𝑖 − 𝑥ҧ = ෍ 𝑥𝑖 𝑦𝑖 −
𝑛
𝑖=1 𝑖=1
• where 𝑒𝑖 = 𝑦𝑖 − 𝑦ො𝑖 is called the residual.
Equation 11-11 (numerator)

𝑆𝑥𝑦
𝛽መ1 =
𝑆𝑥𝑥
Fitted or Estimated Regression Line
11.2/398 The grades of a class of 9 students on
a midterm report (x) and on the final
examination (y) are as follows:
x 77 50 71 72 81 94 96 99 67
y 82 66 78 34 47 85 99 99 68

Estimate the linear regression line.


Estimate the final examination grade of a
student who received a grade of 85 on the
midterm report.
Fitted or Estimated Regression Line
•10-11/424 An article in the Journal of Monetary
Economics assesses the relationship between
percentage growth in wealth over a decade and a
half of savings for baby boomers of age 40 to 55
with these people’s income quartiles. The article
presents a table showing five income quartiles, and
for each quartile there is a reported percentage
growth in wealth. The data are as follows.
Income quartile 1 2 3 4 5
Wealth growth (%) 17.3 23.6 40.2 45.8 56.8
Run a simple linear regression of these five pairs of
numbers and estimate a linear relationship between
income and percentage growth in wealth.
Fitted or Estimated Regression Line
• 10-12/424 A financial analyst at Goldman Sachs
ran a regression analysis of monthly returns on a
certain investment (Y) versus returns for the
same month on the Standard & Poor’s index (X).
The regression results included Sxx = 765.98 and
Sxy = 934.49. Give the least-squares estimate of
the regression slope parameter.
Correlation
• The degree of linear association between
the two random variables X and Y
• Indicated by the correlation coefficient
• ρ is the population (true) correlation
coefficient, estimated by r, the sample
correlation coefficient or Pearson product-
moment correlation coefficient
• ρ can take on any value from −1, through 0,
to 1.
Possible Interpretations of ρ
When ρ is equal to zero, there is no correlation. • When the value of ρ is between 0 and 1 in absolute
That is, there is no linear relationship between the value, it reflects the relative strength of the linear
two random variables. relationship between the two variables. For
example, a correlation of 0.90 implies a relatively
When ρ = 1, there is a perfect, positive, linear
strong positive, relationship between the two
relationship between the two variables. That is,
variables. A correlation of −0.70 implies a weaker,
whenever one of the variables, X or Y, increases,
negative (as indicated by the minus sign), linear
the other variable also increases; and whenever
relationship. A correlation ρ = 0.30 implies a
one of the variables decreases, the other one must
relatively weak (positive) linear relationship
also decrease.
between X and Y.
When ρ = −1, there is a perfect negative linear
relationship between X and Y. When X or Y
increases, the other variable decreases; and when
one decreases, the other one must increase.
Correlation
Sample Correlation Coefficient
• The estimate of ρ
• Also referred to as the Pearson product-moment correlation coefficient

𝑆𝑥𝑥 𝑆𝑥𝑦
𝑟 = 𝛽መ1 =
𝑆𝑦𝑦 𝑆𝑥𝑥 𝑆𝑦𝑦
Interpretations of r
±1.00 perfect positive (negative) correlation
±0.91 - ±0.99 very high positive (negative) correlation
±0.71 - ±0.90 high positive (negative) correlation
±0.51 - ±0.70 moderate positive (negative) correlation
±0.31 - ±0.50 low positive (negative) correlation
±0.01 - ±0.30 negligible positive (negative) correlation
0.00 no correlation
Coefficient of Determination
Denoted by r2
• A descriptive measure of the strength of the
regression relationship, a measure of how
well the regression line fits the data
• Ordinarily, we do not use r2 for inference
about ρ2.
Coefficient of Determination
11-13/400 A study of the amount of rainfall and the
quantity of air pollution removed produced the
following data:
Find the equation of the regression line to predict
the particulate removed from the amount of daily
rainfall.
Estimate the amount of particulate removed when
the daily rainfall is x = 4.8 units.
Calculate r. Daily Rainfall, x (0.01 Particulate Removed, y (μg/m3)
cm)
4.3 126
4.5 121
5.9 116
5.6 118
6.1 114
5.2 118
3.8 132
2.1 141
7.5 108
Estimating σ2
This is actually another unknown parameter in our Computing SSE using Equation 11-12 would be
regression model, σ2 (the variance of the error term fairly tedious. A more convenient computing
ϵ). The residuals 𝑒𝑖 = 𝑦𝑖 − 𝑦ො𝑖 are used to obtain an formula can be obtained by substituting 𝑦ො𝑖 =
estimate of σ2. The sum of squares of the residuals,
often called the error sum of squares, is
𝛽መ0 + 𝛽መ𝑖 𝑥𝑖 into Equation 11-12 and simplifying.
The resulting computing formula is
𝑛 𝑛

𝑆𝑆𝐸 = ෍ 𝑒𝑖2 = ෍ 𝑦𝑖 − 𝑦ො𝑖 2


𝑆𝑆𝐸 = 𝑆𝑆𝑇 − 𝛽መ1 𝑆𝑥𝑦
𝑖=1 𝑖=1
(11-14)

(11-12)
where 𝑆𝑆𝑇 = σ𝑛𝑖=1 𝑦𝑖 − 𝑦ത 2 = σ𝑛𝑖=1 𝑦𝑖2 − 𝑛𝑦ത 2 is
the total sum of squares of the response
variable y. Formulas such as this are presented
in Section 11-4.
Estimator of Variance
Recall: We can show that the expected value of the error
σ𝑛 2 σ𝑛 ത 2
sum of squares is E(SSE) = (n − 2)σ2. Therefore, an
𝑖=1 𝑥𝑖 −𝑥ҧ 𝑖=1 𝑦𝑖 −𝑦 unbiased estimator of σ2 is
𝜎𝑥2 = , 𝜎𝑦2 = ,
𝑛 𝑛
𝑆𝑥𝑦
𝑟= , 2
𝑆𝑆𝐸
𝑆𝑥𝑥 𝑆𝑦𝑦 𝜎 =
𝑛−2
and
𝑆𝑆𝐸 = 𝑆𝑆𝑇 − 𝛽መ1 𝑆𝑥𝑦 (11-13)
From calculator:
𝑛 From calculator:
2
𝑛𝜎𝑦 − 𝐵𝑛𝑟𝜎𝑥 𝜎𝑦
𝑆𝑆𝑇 = ෍ 𝑦𝑖 − 𝑦ത 2 = 𝑆𝑦𝑦 = 𝑛𝜎𝑦2 𝜎2 =
𝑛−2
𝑖=1
𝛽መ1 = 𝐵
𝑆𝑥𝑦 = 𝑟 𝑺𝒙𝒙 𝑺𝒚𝒚 = 𝑟 𝒏𝝈𝟐𝒙 𝒏𝝈𝟐𝒚 = 𝑛𝑟𝜎𝑥 𝜎𝑦

Finally,
𝑺𝑺𝑬 = 𝒏𝝈𝟐𝒚 − 𝑩𝒏𝒓𝝈𝒙 𝝈𝒚
TABLE · 11-1 Oxygen and Hydrocarbon Levels
Observation Hydrocarbon Purity Observation Hydrocarbon Purity
Number Level x(%) y(%) Number Level y(%)
x(%)

1 0.99 90.01 11 1.19 93.54

2 1.02 89.05 12 1.15 92.52

3 1.15 91.43 13 0.98 90.56

4 1.29 93.74 14 1.01 89.54

5 1.46 96.73 15 1.11 89.85

6 1.36 94.45 16 1.20 90.39

7 0.87 87.59 17 1.26 93.25

8 1.23 91.77 18 1.32 93.41

9 1.55 99.42 19 1.43 94.98

10 1.40 93.65 20 0.95 87.33


TABLE · 11 Software Output for the Oxygen Purity Data in Example 11-1
Purity = 74.3 + 14.9 HC Level • Analysis of Variance

Predictor Coef SE Coef T P Source DF SS MS F P


Constant ෡𝟎
74.283←𝜷 1.593 46.62 0.000 Regression 1 152.13 152.13 128.86 0.000

HC level ෡𝟏
14.947←𝜷 1.317 11.35 0.000 Residual error 18 21.25←SSE 𝝈𝟐
1.18←ෝ

S = 1.087 R-Sq = 87.7% R-Sq (adj) = 87.1 % Total 19 173.38


Software Output for the Oxygen Purity Data
Predicted Values for New
Observations

New obs Fit SE Fit 95.0% CI 95.0% PI


1 89.231 0.354 (88.486, 89.975) (86.830, 91.632)
Values of Predictors for New
Observations
New obs HC Level
1 1.00
Example
•11-1/435 Diabetes and obesity are serious health • 11-1/435
concerns in the United States and much of the developed
• (a) Calculate the least squares estimates of the
world. Measuring the amount of body fat a person
carries is one way to monitor weight control progress, slope and intercept. Graph the regression line.
but measuring it accurately involves either expensive X- • (b) Use the equation of the fitted line to
ray equipment or a pool in which to dunk the subject. predict what body fat would be observed, on
Instead body mass index (BMI) is often used as a proxy average, for a man with a BMI of 30.
for body fat because it is easy to measure: BMI = mass
(kg)/(height (m))2 = 703 mass(lb)/(height (in))2. In a study • (c) Suppose that the observed body fat of a
of 250 men at Bingham Young University, both BMI and man with a BMI of 25 is 25%. Find the residual
body fat were measured. Researchers found the for that observation.
following summary statistics: • (d) Was the prediction for the BMI of 25 in part
• 11-1/435 (c) an overestimate or underestimate? Explain
• σ𝑛𝑖=1 𝑥𝑖 = 6322.28 σ𝑛𝑖=1 𝑥𝑖2 = 162674.18 briefly.
• σ𝑛𝑖=1 𝑦𝑖 = 4757.90 σ𝑛𝑖=1 𝑦𝑖2 = 107679.27
• σ𝑛𝑖=1 𝑥𝑖 𝑦𝑖 = 125471.10
t-Tests
Test Statistic for the Slope Test Statistic for the Intercept

𝛽መ1 − 𝛽1,0 𝛽መ1 − 𝛽1,0 𝛽መ0 − 𝛽0,0 𝛽መ0 − 𝛽0,0


𝑇0 = = 𝑇0 = =
𝜎ො 2 Τ𝑆𝑥𝑥 se 𝛽መ1 se 𝛽መ0
1 𝑥ҧ 2
(11-19) 𝜎ො 2 +
𝑛 𝑆𝑥𝑥
(11-19)
𝑣 =𝑛−2
se 𝛽መ1 is the standard error of slope. 𝑣 =𝑛−2
H0 : β1 = β1,0 H1: β1 ≠ β1,0
se 𝛽መ0 is the standard error of intercept.
Reject H0 if |t0| > tα/2,n−2.
H0 : β0 = β0,0 H1: β0 ≠ β0,0
Reject H0 if |t0| > tα/2,n−2.
t-Tests
Special Case Test for Zero Correlation
𝛽መ1 𝛽መ1 𝜌ො 𝑛 − 2
𝑇0 = = 𝑇0 =
𝜎ො 2 Τ𝑆𝑥𝑥 se 𝛽መ1
1 − 𝜌ො 2
H0 : β1 = 0 H1: β1 ≠ 0

•These hypothesis relate to the significance of 𝛽መ1 − 𝛽1,0 𝛽መ1 − 𝛽1,0


𝑇0 = =
regression. Failure to reject H0 : β1 = 0 is equivalent to 𝜎ො 2 Τ𝑆𝑥𝑥 se 𝛽መ1
concluding that there is no linear relationship between x
and Y.
H0 : ρ =0 H1: ρ ≠ 0
Reject H0 if |t0| > tα/2,n−2.
Example
11-41/446 An article in The Journal of Clinical
Endocrinology and Metabolism [“Simultaneous and
Continuous 24-Hour Plasma and Cerebrospinal Fluid
Leptin Measurements: Dissociation of Concentrations in
Central and Peripheral Compartments” (2004, Vol. 89, pp.
258−265)] reported on a study of the demographics of
simultaneous and continuous 24-hour plasma and
cerebrospinal fluid leptin measurements. The data follow:
y = BMI (kg/m2): 19.92, 20.59, 29.02, 20.78, 25.97, 20.39,
23.29, 17.27, 35.24
x = Age (yr): 45.5, 34.6, 40.6, 32.9, 28.2, 30.1, 52.1, 33.3,
47.0
(a) Test for significance of regression using α = 0.05. Find
the P-value for this test. Can you conclude that the model
specifies a useful linear relationship between these two
variables?
(b) Estimate σ2 and the standard deviation of 𝛽෠1 .
(c) What is the standard error of the intercept in this
model?
Summary
A scatter diagram displays observations on two where
𝑛 𝑛
variables, x and y. Each observation is represented 1 1
by a point showing its x-y coordinates. The scatter 𝑦ത = ෍ 𝑦𝑖 𝑥ҧ = ෍ 𝑥𝑖
diagram can be very effective in revealing the joint 𝑛 𝑛
𝑖=1 𝑖=1
variability of x and y or the nature of relationship
between them. 𝑛 𝑛
σ𝑛
The method of least squares is used to estimate the 2 2 𝑖=1 𝑥𝑖
𝑆𝑥𝑥 = ෍ 𝑥𝑖 − 𝑥ҧ = ෍ 𝑥𝑖 −
parameters of a system by minimizing the sum of 𝑛
𝑛 𝑖=1 𝑛 𝑖=1
the squares of the differences between the σ𝑛𝑖=1 𝑥𝑖 σ𝑛𝑖=1 𝑦𝑖
observed values and the fitted or predicted values 𝑆𝑥𝑦 = ෍ 𝑦𝑖 − 𝑦ത 𝑥𝑖 − 𝑥ҧ = ෍ 𝑥𝑖 𝑦𝑖 −
𝑛
from the system. 𝑖=1 𝑖=1

Fitted Simple Linear Regression Model


𝑆𝑆𝐸 = 𝑆𝑆𝑇 − 𝛽෠1 𝑆𝑥𝑦
𝑦ො = 𝛽መ𝑛0 + 𝛽መ1 𝑥 𝑛 𝑛 𝑛
σ 𝑦 σ𝑖=1 𝑥𝑖
σ𝑛𝑖=1 𝑦𝑖 𝑥𝑖 − 𝑖=1 𝑖 𝑆𝑥𝑦 𝑆𝑆𝑇 = ෍ 𝑦𝑖 − 𝑦ത 2
= ෍ 𝑦𝑖2 − 𝑛𝑦ത 2 = 𝑆𝑦𝑦
𝛽መ1 = 𝑛 =
2 𝑖=1 𝑖=1
σ𝑛
𝑥 𝑆𝑥𝑥
𝑛 2 𝑖=1 𝑖
σ𝑖=1 𝑥𝑖 −
𝑛 𝑆𝑆𝐸
መ መ
𝛽0 = 𝑦ത − 𝛽1 𝑥ҧ 𝜎2 =
𝑛−2

𝑒𝑖 = 𝑦𝑖 − 𝑦ො𝑖
Summary
• Generally, correlation is a measure of the Sample Correlation Coefficient
interdependence among data. The concept may
𝑆𝑥𝑥 𝑆𝑥𝑦
include more than two variables. The term is 𝑟 = 𝛽1መ =
most commonly used in a narrow sense to 𝑆𝑦𝑦 𝑆𝑥𝑥 𝑆𝑦𝑦
express the relationship between quantitative
variables or ranks. Where 𝑛 𝑛
σ𝑛𝑖=1 𝑦𝑖
• The correlation coefficient (r) is a dimensionless 𝑆𝑦𝑦 = ෍ 𝑦𝑖 − 𝑦ത 2
= ෍ 𝑦𝑖2 −
measure of the linear association between two 𝑛
𝑖=1 𝑖=1
variables, usually lying in the interval from ─1 to
+1, with zero indicating the absence of
correlation (but not necessarily the The coefficient of determination (r2) is often used to judge
independence of the two variables.) the adequacy of a regression mode. Its value tells that the
model accounts for r2×% of the variability in the data.
Summary
Test Statistic for the Slope Test Statistic for the Intercept

෡1 −𝛽1,0
𝛽 ෡1 −𝛽1,0
𝛽 ෡0 −𝛽0,0
𝛽 ෡0 −𝛽0,0
𝛽
• 𝑇0 = = ෡1 • 𝑇0 = = ෡0
ෝ 2 Τ𝑆𝑥𝑥
𝜎 se 𝛽 1 𝑥ഥ2 se 𝛽
ෝ2 +
𝜎
𝑛 𝑆𝑥𝑥

• 𝑣 =𝑛−2
• 𝑣 =𝑛−2
• se 𝛽መ1 is the standard error of slope.
• se 𝛽መ0 is the standard error of intercept.
References
• Aczel-Sounderpandian. Business Statistics, 7th Ed. © 2008
• Montgomery and Runger. Applied Statistics and Probability for Engineers, 6th Ed. © 2014
• Walpole, et al. Probability and Statistics for Engineers and Scientists 9th Ed. © 2012, 2007, 2002

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy