Regression and Correlation

Download as pdf or txt
Download as pdf or txt
You are on page 1of 37

Regression And Correlation

Introduction
In many statistical investigations, the main goal
is to establish the relationships that make it
possible to predict one or more variables in
terms of others. For example studies are made to
predict the future product of say hard disks in
terms of its price, peak power load verses
maximum temperature and water level in a
particular river verses distances after each 50
km. It would be ideal if we could predict one
quantity exactly in terms of another, but this is
seldom possible. The problem of predicting the
value of a variable is called Regression.
Introduction
The regression line, also known as curve fitting, is found
by the method of least squares, which is discussed in the
coming sections. Having learned how to fit a regression
line to paired data, the problem of determining how well
such a line fits the data accurately. Of course we can get
some idea of this by scatter diagram which shows the line
together with the data. For example, if X represents the
amount of money spent yearly on advertising by any firm
and Y represents its total yearly sale, we might ask
ourselves whether a decrease in the advertising budget
will decrease the yearly sales. Correlation analysis
attempts to measure the strength of such relationships
between the two variables by means of a single number
here by called correlation co-efficient.
Scatter Diagram
Very often in practice a relationship is found to
exist between two or more variables, for example
cost depends on sales, expenditure depends on
income, price depends in demand.
It is often desirable to express this relationship in
mathematical form by determining an equation
connecting these variables. For this purpose we
prepare a scatter diagram.
On the scatter diagram, the original figures, for
whom the relationship is to be found, are plotted.
This enables the investigator to obtain a visual
impression of the distribution of the values (X and
Y)
Example
The following data shows the peak lower level
load Y and maximum temperature X for 7 days.

Day 1 2 3 4 5 6 7

Maximum 95 82 90 81 99 100 93
Temp: (X)
Peak Power 214 152 156 129 254 266 210
Load (Y)
Example
Scatter Diagram
300

250

200

150

100

50

0
0 20 40 60 80 100 120
Example
These points can be approximated by a straight
line, hence we say that linear relationship exist
between the two variables. It can be
represented by so many lines or curve by
different individuals. To avoid individuals
judgement in representing lines or curves, a
definition of best fitting line or curve is
needed.
The Method of Least Squares
If all curves approximate a given set of data points,
having the property that
𝑆 = 𝐷12 + 𝐷22 + ⋯ + 𝐷𝑛2
is minimum is called a least squares line. The
general form of a linear equation with one
independent variable can be written as
𝑌 = 𝑎 + 𝑏𝑋
where a and b are constants, X is the independent
variable and Y is the dependent variable.
To find constants “a” and “b” we use normal
equations which are given by:
σ 𝑌 = 𝑛𝑎 + 𝑏 σ 𝑋 σ 𝑋𝑌 = 𝑎 σ 𝑋 + 𝑏 σ 𝑋 2
The Method of Least Squares
To find constants “a” and “b” we use normal
equations which are given by:
σ 𝑌 = 𝑛𝑎 + 𝑏 σ 𝑋 σ 𝑋𝑌 = 𝑎 σ 𝑋 + 𝑏 σ 𝑋 2
These equations are solved simultaneously to
obtain
and
Some times a can also be computed by using the
formula
𝑎 = 𝑌ത − 𝑏𝑋,
ത where 𝑋ത and 𝑌ത are the means of X
and Y values respectively.
Regression Line
The regression line expresses the trend of two
observed values. On the basis of sample data,
we are interested to estimate or predict the
value of Y corresponding to the given value of
X. The resulting line is called regression line of
Y on X. It is the same as the least squares line
𝑌 = 𝑎 + 𝑏𝑋 where a denotes intercept and b
stands for slop of this line.
Regression Line
For example, we are interested to know the charges
of electricity consumed. We note that meter rent is
fixed irrespective of the electricity consumed. Let
it be Rs. 1.50. It is denoted by ‘a’. The charge for
unit is the rate of change and is denoted by ‘b’.
Now the total electricity charges Y and the units
consumed X can be fitted by the formula
Y=1.50+0.40X. With the help of this formula, we
can find the charges of electricity, for the
electricity consumed. For a person consumes 85
units of electricity. He will pay Rs. 1.50+0.40(85)
or Rs. 35.50
Testing the Significance of b
While dealing with the regression equations, we
have not considered their statistical
significance. The statistical significance
regression equations depends on a and b, where
a and b represents estimates of the
corresponding the intercept and slope co-
efficient for the population from the sample has
been drawn. Most commonly we test the
significance of b because the intercept a is of
little interest.
Testing the Significance of b
Since the population coefficients are unknown
and the sample coefficients represents only one
result that could have been obtained depending
on the sample. However we still can to use the
sample results to test the significance results of
the relationship between the variables X and Y
Test of Hypothesis Concerning Linear
Regression
Assumptions:
Step 1: State the Null and Alternative Hypothesis
𝐻0 : 𝐵 = 0,Where B is the slope of the population
regression line is zero (horizontal line) or (X is not
useful for predicting Y)
i. 𝐻1 : 𝐵 ≠ 0, that is, the slope of population line is
not zero.
ii. 𝐻1 : 𝐵 > 0 , that is, the slope of population
regression line is upward.
iii. 𝐻1 : 𝐵 < 0 , that is, the slope of population
regression line is downward.
Test of Hypothesis Concerning Linear
Regression
Step 2: Decide the level of significance 𝛼.
Step 3: Find the critical values using t test with
𝑑f = 𝑛 − 2
i. Critical values for a two tailed test are ±(𝑡𝛼 Τ2 )
ii. Critical values for a right tailed test is 𝑡𝛼 .
iii. Critical values for a left tailed test is −𝑡𝛼 .
Test of Hypothesis Concerning Linear
Regression
Step 4: Compute the value of test statistic.
𝑡 = 𝑏Τ𝑆𝑏 ; Where 𝑆𝑏 stands for standard error of
b given by

If the value of test statistic falls in the rejection


region, then reject 𝐻0 , otherwise accept 𝐻0 .
Step 5: State the conclusion in words.
Correlation
When two variables are so related that increase
or decrease in the value of one variable is
affected by increase or decrease in the value of
other, such variables are said to be correlated
and correlation is said to be simple or linear in
case of two variables. In other words,
correlation measures the degree of
interdependence between two variables.
Correlation
Correlation may be direct or positive if an
increase or decrease in the value of one
variable is associated with increase or
decrease of the other set. If reverse is the case,
they are said to be negatively or inversely
correlated. For Example, heights and weights
of the students define positive correlation
where as price and demands are negatively
correlated.
Correlation
Correlation always lies between ±1 . The
quantity ±1.0 indicates perfect positive
correlation, −1.0 indicates perfect negative
correlation and zero indicates no relationship
exist between the variables.
Correlation
If in a particular problem, the original values
are plotted on the scatter diagram falls as do in
either graph One or graph Two, the value
would be correlated. Although calculation is
needed to know the extent of correlation, it is
clear on visual inspection that some
association exists.
Coefficient of Correlation
For measuring the degree of relationship
between two variables when there is a
reasonable ground to believe that they are
correlated. Karl Pearson has developed a
formula called co-efficient of correlation
denoted by r.
Co-efficient of Determination
One useful way of looking at r is in terms of co-
efficient of determination and the co-efficient of
non-determination. The co-efficient of
determination 𝑟 2 when multiplied by 100 gives
the percentage of variance in Y which is
associated with the variance in X. The co-efficient
of non-determination ( 1 − 𝑟 2 ) indicates the
amount of variance in one variable, which is
independent of change in second variable.
Probable Error (P.E.) and Its
Interpretation
In the calculation of co-efficient of correlation only a few
items are used, the results will not be sufficiently reliable
for the purpose of basing comparisons. Therefore, it is
useful to calculate the probable error in co-efficient of
correlation to guard against false conclusions.
1−𝑟 2
𝑃. 𝐸 = ±0.6745 𝑛
Now,
i. If co-efficient of correlation (r) is less than P.E; there is
no correlation between two variables.
ii. If r is more than P.E. then correlation exists. However, if
r is less than 0.20, the correlation is not appreciable.
iii. If r is more than 6 times the size of P.E. the correlation
is highly significant.
Testing the Significance of r
The sample correlation co-efficient r is a value
computed from a random sample of size n.
Different random samples of size n from the same
population will generally produce different values
of r. Hence r is an estimate of 𝜌, where 𝜌 (row)
stands for population correlation coefficient. In
the testing of significance of r we attempt to test
whether there is no difference between the values
of r and 𝜌, by taking 𝜌 = 0 that is, there is no
relationship exists between the variables.
Test of Hypothesis Concerning Linear
Correlation
Assumptions:
Step 1: State the Null and Alternative Hypothesis
𝐻0 : 𝑆𝑒𝑡 𝜌 = 0, that is the variables are not linearly
correlated or there is no linear relationship exists
between the variables.
i. 𝐻1 : 𝜌 ≠ 0 , that is, the variables are linearly
correlated.
ii. 𝐻1 : 𝜌 > 0, that is, the variables are positively
linearly correlated.
iii. 𝐻1 : 𝜌 < 0, that is, the variables are negatively
linearly correlated.
Test of Hypothesis Concerning Linear
Regression
Step 2: Decide the level of significance 𝛼.
Step 3: Find the critical values using t-test with
𝑑f = 𝑛 − 2
i. Critical values for a two tailed test are ±(𝑡𝛼 Τ2 )
ii. Critical values for a right tailed test is 𝑡𝛼 .
iii. Critical values for a left tailed test is −𝑡𝛼 .
Test of Hypothesis Concerning Linear
Regression
Step 4: Compute the value of test statistic.
𝑟
𝑡=
1−𝑟2
𝑛−2

If the value of test statistic falls in the rejection


region, then reject 𝐻0 , otherwise accept 𝐻0 .
Step 5: State the conclusion in words.
Example#02
The following data shows the peak power load Y
and maximum temperature X for 7 days.
Day 1 2 3 4 5 6 7

Maximum 95 82 90 81 99 100 93
Temp: (X)
Peak Power 214 152 156 129 254 266 210
Load (Y)

i. Find the estimated peak power load for all days


when X=92 that is, maximum temperature is 92°F
ii. Compute the co-efficient of correlation of peak
power load and day’s maximum temperature.
Example#02
iii. At the 5% significance level, do the data
provide sufficient evidence to conclude that peak
power load is useful as a predictor of day’s
maximum temperature.
iv. At the 5% significance level, do the data
provide sufficient evidence to conclude that peak
power load and day’s maximum temperature are
linearly correlated.
Solution
(i)

Days Max: temp: Peak Power 𝑋𝑌 𝑋2 𝑌2


X Load Y
1 95 214 20330 9025 45796
2 82 152 12464 6724 23104
3 90 156 14040 8100 24336
4 81 129 10449 6561 16641
5 99 254 25146 9801 64516
6 100 266 26600 10000 70756
7 93 210 19530 8649 44100

Sum 640 1381 128559 58860 289249


Solution
Equation of regression line of Y on X is given by
𝑌 = 𝑎 + 𝑏𝑋
Where

= 6.641
and 𝑎 = 𝑌ത − 𝑏𝑋ത
,
Solution

Hence for the value of 𝑋 = 92


𝑌 = 𝑎 + 𝑏𝑋
𝑌 = −409.891 + 6.641(92)
𝑌 = 201.081
Interpretation: When the maximum temperature
is 92℉ , the peak power load is 201.081
megawatts.
Solution
(ii) Co-efficient of Correlation

7 128559 −(640)(1381)
𝑟=
7 58860 −(409600) 7 289249 −1907161
𝑟 = 0.990

= ±0.6745 (1 − (0.99)2 ) ∕ 7)
= ±0.6745( 1 − 0.9801 Τ2.645)
= ±0.6745(0.0075)
𝑃. 𝐸 = ±0.005
Solution
Comparing P.E. with r, we note that not only r is
greater than P.E, but it is also greater than 6(P.E).
Therefore, correlation between the two variables
is very highly positive and highly significant.
Limits of correlation are 0.99 ± 0.005 that is
from 0.985 to 0.995. The value of 𝑟 2 = 0.9801
indicated 98% of variance in X (Maximum
temperature) is associated with variance in Y
(Peak power load).
Solution
(iii) 1. 𝐻0 : 𝐵 = 0
𝐻1 : 𝐵 ≠ 0
2. 𝛼 = 0.05
3. Critical values under 𝐻1 : Reject 𝐻0 if
𝑡 > 𝑡0.025,5 = 2.571 or 𝑡 < 𝑡0.025,5 = −2.571
4. Computation of test statistic
𝑡 = 𝑏 ∕ 𝑆𝑏

Where
Thus, 𝑡 = 6.641Τ0.946 = 7.02
5. Hence reject the null hypothesis. That is sample evidence
indicates that the peak power load tends to increase as a
day’s temperature increases.
Solution
(iv)
1. 𝐻0 : 𝜌 = 0
𝐻1 : 𝜌 ≠ 0
2. 𝛼 = 0.05
3. Critical values under 𝐻1 : Reject 𝐻0 if
𝑡 > 𝑡0.025,5 = 2.571 or 𝑡 < 𝑡0.025,5 = −2.571
4. Computation of test statistic
Solution

5. We reject the null hypothesis. Thus the test


results are statistically significant at 5% level; at
5% significance level, the data provide sufficient
evidence to conclude that two variable are
positively linearly correlated.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy