Regression and Correlation
Regression and Correlation
Regression and Correlation
Introduction
In many statistical investigations, the main goal
is to establish the relationships that make it
possible to predict one or more variables in
terms of others. For example studies are made to
predict the future product of say hard disks in
terms of its price, peak power load verses
maximum temperature and water level in a
particular river verses distances after each 50
km. It would be ideal if we could predict one
quantity exactly in terms of another, but this is
seldom possible. The problem of predicting the
value of a variable is called Regression.
Introduction
The regression line, also known as curve fitting, is found
by the method of least squares, which is discussed in the
coming sections. Having learned how to fit a regression
line to paired data, the problem of determining how well
such a line fits the data accurately. Of course we can get
some idea of this by scatter diagram which shows the line
together with the data. For example, if X represents the
amount of money spent yearly on advertising by any firm
and Y represents its total yearly sale, we might ask
ourselves whether a decrease in the advertising budget
will decrease the yearly sales. Correlation analysis
attempts to measure the strength of such relationships
between the two variables by means of a single number
here by called correlation co-efficient.
Scatter Diagram
Very often in practice a relationship is found to
exist between two or more variables, for example
cost depends on sales, expenditure depends on
income, price depends in demand.
It is often desirable to express this relationship in
mathematical form by determining an equation
connecting these variables. For this purpose we
prepare a scatter diagram.
On the scatter diagram, the original figures, for
whom the relationship is to be found, are plotted.
This enables the investigator to obtain a visual
impression of the distribution of the values (X and
Y)
Example
The following data shows the peak lower level
load Y and maximum temperature X for 7 days.
Day 1 2 3 4 5 6 7
Maximum 95 82 90 81 99 100 93
Temp: (X)
Peak Power 214 152 156 129 254 266 210
Load (Y)
Example
Scatter Diagram
300
250
200
150
100
50
0
0 20 40 60 80 100 120
Example
These points can be approximated by a straight
line, hence we say that linear relationship exist
between the two variables. It can be
represented by so many lines or curve by
different individuals. To avoid individuals
judgement in representing lines or curves, a
definition of best fitting line or curve is
needed.
The Method of Least Squares
If all curves approximate a given set of data points,
having the property that
𝑆 = 𝐷12 + 𝐷22 + ⋯ + 𝐷𝑛2
is minimum is called a least squares line. The
general form of a linear equation with one
independent variable can be written as
𝑌 = 𝑎 + 𝑏𝑋
where a and b are constants, X is the independent
variable and Y is the dependent variable.
To find constants “a” and “b” we use normal
equations which are given by:
σ 𝑌 = 𝑛𝑎 + 𝑏 σ 𝑋 σ 𝑋𝑌 = 𝑎 σ 𝑋 + 𝑏 σ 𝑋 2
The Method of Least Squares
To find constants “a” and “b” we use normal
equations which are given by:
σ 𝑌 = 𝑛𝑎 + 𝑏 σ 𝑋 σ 𝑋𝑌 = 𝑎 σ 𝑋 + 𝑏 σ 𝑋 2
These equations are solved simultaneously to
obtain
and
Some times a can also be computed by using the
formula
𝑎 = 𝑌ത − 𝑏𝑋,
ത where 𝑋ത and 𝑌ത are the means of X
and Y values respectively.
Regression Line
The regression line expresses the trend of two
observed values. On the basis of sample data,
we are interested to estimate or predict the
value of Y corresponding to the given value of
X. The resulting line is called regression line of
Y on X. It is the same as the least squares line
𝑌 = 𝑎 + 𝑏𝑋 where a denotes intercept and b
stands for slop of this line.
Regression Line
For example, we are interested to know the charges
of electricity consumed. We note that meter rent is
fixed irrespective of the electricity consumed. Let
it be Rs. 1.50. It is denoted by ‘a’. The charge for
unit is the rate of change and is denoted by ‘b’.
Now the total electricity charges Y and the units
consumed X can be fitted by the formula
Y=1.50+0.40X. With the help of this formula, we
can find the charges of electricity, for the
electricity consumed. For a person consumes 85
units of electricity. He will pay Rs. 1.50+0.40(85)
or Rs. 35.50
Testing the Significance of b
While dealing with the regression equations, we
have not considered their statistical
significance. The statistical significance
regression equations depends on a and b, where
a and b represents estimates of the
corresponding the intercept and slope co-
efficient for the population from the sample has
been drawn. Most commonly we test the
significance of b because the intercept a is of
little interest.
Testing the Significance of b
Since the population coefficients are unknown
and the sample coefficients represents only one
result that could have been obtained depending
on the sample. However we still can to use the
sample results to test the significance results of
the relationship between the variables X and Y
Test of Hypothesis Concerning Linear
Regression
Assumptions:
Step 1: State the Null and Alternative Hypothesis
𝐻0 : 𝐵 = 0,Where B is the slope of the population
regression line is zero (horizontal line) or (X is not
useful for predicting Y)
i. 𝐻1 : 𝐵 ≠ 0, that is, the slope of population line is
not zero.
ii. 𝐻1 : 𝐵 > 0 , that is, the slope of population
regression line is upward.
iii. 𝐻1 : 𝐵 < 0 , that is, the slope of population
regression line is downward.
Test of Hypothesis Concerning Linear
Regression
Step 2: Decide the level of significance 𝛼.
Step 3: Find the critical values using t test with
𝑑f = 𝑛 − 2
i. Critical values for a two tailed test are ±(𝑡𝛼 Τ2 )
ii. Critical values for a right tailed test is 𝑡𝛼 .
iii. Critical values for a left tailed test is −𝑡𝛼 .
Test of Hypothesis Concerning Linear
Regression
Step 4: Compute the value of test statistic.
𝑡 = 𝑏Τ𝑆𝑏 ; Where 𝑆𝑏 stands for standard error of
b given by
Maximum 95 82 90 81 99 100 93
Temp: (X)
Peak Power 214 152 156 129 254 266 210
Load (Y)
= 6.641
and 𝑎 = 𝑌ത − 𝑏𝑋ത
,
Solution
7 128559 −(640)(1381)
𝑟=
7 58860 −(409600) 7 289249 −1907161
𝑟 = 0.990
= ±0.6745 (1 − (0.99)2 ) ∕ 7)
= ±0.6745( 1 − 0.9801 Τ2.645)
= ±0.6745(0.0075)
𝑃. 𝐸 = ±0.005
Solution
Comparing P.E. with r, we note that not only r is
greater than P.E, but it is also greater than 6(P.E).
Therefore, correlation between the two variables
is very highly positive and highly significant.
Limits of correlation are 0.99 ± 0.005 that is
from 0.985 to 0.995. The value of 𝑟 2 = 0.9801
indicated 98% of variance in X (Maximum
temperature) is associated with variance in Y
(Peak power load).
Solution
(iii) 1. 𝐻0 : 𝐵 = 0
𝐻1 : 𝐵 ≠ 0
2. 𝛼 = 0.05
3. Critical values under 𝐻1 : Reject 𝐻0 if
𝑡 > 𝑡0.025,5 = 2.571 or 𝑡 < 𝑡0.025,5 = −2.571
4. Computation of test statistic
𝑡 = 𝑏 ∕ 𝑆𝑏
Where
Thus, 𝑡 = 6.641Τ0.946 = 7.02
5. Hence reject the null hypothesis. That is sample evidence
indicates that the peak power load tends to increase as a
day’s temperature increases.
Solution
(iv)
1. 𝐻0 : 𝜌 = 0
𝐻1 : 𝜌 ≠ 0
2. 𝛼 = 0.05
3. Critical values under 𝐻1 : Reject 𝐻0 if
𝑡 > 𝑡0.025,5 = 2.571 or 𝑡 < 𝑡0.025,5 = −2.571
4. Computation of test statistic
Solution