ECN 652 Handout 9 Student

Download as pdf or txt
Download as pdf or txt
You are on page 1of 46

ECN 652: Quantitative Economics

Correlation and Regression


Analysis
Handout # 9
Dr. T. Vinayagathasan
Department of Economics & Statistics
Faculty of Arts
University of Peradeniya

Dr. T. Vinayagathasan, Dept. of Econ. & Statistics, Univ. of Peradeniya 1


Correlation Analysis

Dr. T. Vinayagathasan, Dept. of Econ. & Statistics, Univ. of Peradeniya 2


Correlation
• Correlation is a statistical technique that is used to measure and
describe a relationship between two variables (X and Y).
• That is, finding the relationship between two quantitative
variables without being able to infer causal relationships.
• 3 Characteristics:
1. The direction of the relationship
2. The Form of the Relationship
3. The Degree of the Relationship

Dr. T. Vinayagathasan, Dept. of Econ. & Statistics, Univ. of Peradeniya 3


1. The direction of the relationship
i. Positive correlation ( +)
ii. Negative Correlation (-)
iii. No correlation (0)
iv. Non-linear Correlation
• A scatter plot can be used to determine the direction of
correlation between two variables. y
Example: 2

x
2 4 6

–2

–4
Dr. T. Vinayagathasan, Dept. of Econ. & Statistics, Univ. of Peradeniya 4
Direction of the Relationships
y As x increases, y y
tends to decrease. As x increases, y
tends to increase.

x x
Negative Linear Correlation Positive Linear Correlation
y y

x x
No Correlation Nonlinear Correlation
Dr. T. Vinayagathasan, Dept. of Econ. & Statistics, Univ. of Peradeniya 5
2. The Form of the Relationship
• Relationships tend to have a linear relationship. A line can be
drawn through the middle of the data points in each figure.
• The most common use of regression is to measure straight-line
relationships.
• Not always the case
• Example

Weight 67 69 85 83 74 81 97 92 114 85
(Kg)
SBP 120 125 140 160 130 180 150 140 200 130
(mHG)

Dr. T. Vinayagathasan, Dept. of Econ. & Statistics, Univ. of Peradeniya 6


2. The Form of the Relationship
SBP (mmHg)
220

200

180

160

140

120

100

80
Wt (kg)
60 70 80 90 100 110 120

Scatter diagram of weight and systolic blood pressure


Dr. T. Vinayagathasan, Dept. of Econ. & Statistics, Univ. of Peradeniya 7
3. The Degree of the Relationship

• Measures how well the data fit the specific form being
considered.
• The degree of relationship is measured by the numerical value
of the correlation (0 to 1.00)
– A perfect correlation is always identified by a correlation of
1.00 and indicates a perfect fit.
– A correlation value of 0 indicates no fit or relationship at all.

Dr. T. Vinayagathasan, Dept. of Econ. & Statistics, Univ. of Peradeniya 8


Correlation Coefficient
• Correlation coefficient: is a Statistic that shows the
degree of relation between two variables
• Correlation ≠ Causation
• In order to infer causality: manipulate independent
variable and observe effect on dependent variable

Dr. T. Vinayagathasan, Dept. of Econ. & Statistics, Univ. of Peradeniya 9


Simple Correlation coefficient (r)
• It is also called Pearson's correlation or product moment
correlation coefficient.
• It measures the nature and strength between two
variables of the quantitative type.
• The sign of r denotes the nature of association
• while the value of r denotes the strength of association.
• If the sign is +ve this means the relation is direct (an
increase in one variable is associated with an increase in
the other variable and a decrease in one variable is
associated with a decrease in the other variable).
• While if the sign is -ve this means an inverse or indirect
relationship (which means an increase in one variable is
associated with a decrease in the other).
Dr. T. Vinayagathasan, Dept. of Econ. & Statistics, Univ. of Peradeniya 10
Simple Correlation coefficient (𝜸)..
• The value of 𝛄 ranges between ( -1) and ( +1) (−1 < 𝛄 < 𝟏)
• The value of 𝛄 denotes the strength of the association as
illustrated by the following diagram.

strong intermediate weak weak intermediate strong

-1 -0.75 -0.25 0 0.25 0.75 1

indirect Direct

perfect perfect
correlation no relation correlation

Dr. T. Vinayagathasan, Dept. of Econ. & Statistics, Univ. of Peradeniya 11


Simple Correlation coefficient (𝜸)..
• The value of 𝛄 ranges between ( -1) and ( +1) (−1 < 𝛄 < 𝟏)
• The value of 𝛄 denotes the strength of the association as
illustrated by the following diagram.

gykhdJ kj;jpkk; gytPdkhdJ gytPdkhdJ kj;jpkk; gykhdJ

-1 -0.75 -0.25 0 0.25 0.75 1

vjpHf;fzpaj; njhlHG NeHf;fzpaj; njhlHG

G+uz vjpHj; G+uz NeHj


njhlHG njhlHG ,y;iy njhlHG

Dr. T. Vinayagathasan, Dept. of Econ. & Statistics, Univ. of Peradeniya 12


Variance vs Covariance
• First, a note on your sample:
– If you are assume that your sample is representative of the
general population (Random Effects Model), use the degrees
of freedom (𝒏 – 𝟏) in your calculations of variance or
covariance.
– But if you’re simply wanting to assess your current sample
(Fixed Effects Model), substitute 𝒏 for the degrees of
freedom
• Variance: Gives information on variability of a single variable.
n 2
(𝑥 −𝑥 )
S𝑥2 = 𝑖=1 𝑖
𝑛

Dr. T. Vinayagathasan, Dept. of Econ. & Statistics, Univ. of Peradeniya 13


Variance vs Covariance
Covariance:
• Gives information on the degree to which two variables vary
together.
• Note how similar the covariance is to variance: the equation
simply multiplies x’s error scores by y’s error scores as opposed
to squaring x’s error scores.
n
𝑖=1[(X𝑖 −X)(Y𝑖 −Y)]
𝑐𝑜𝑣 X, Y = SXY =
𝑛

– When X ↑ and Y ↑ : 𝑐𝑜𝑣 (X, Y) = positive.


– When X ↑ and Y ↓ : 𝑐𝑜𝑣 (X, Y) = Negative.
– When no constant relationship: 𝑐𝑜𝑣 (X, Y) = 0

Dr. T. Vinayagathasan, Dept. of Econ. & Statistics, Univ. of Peradeniya 14


Variance vs Covariance
Covariance: Example
X𝑖 Y𝑖 X𝑖 − X Y𝑖 − Y (X 𝑖 −X) ∗ (Y𝑖 − Y)
0 3 −3 0 0
2 2 −1 −1 1
3 4 0 1 0
4 0 1 −3 −3
6 6 3 3 9
n
X=3 Y=3
[(X 𝑖 − X)(Y𝑖 − Y)] = 7
𝑖=1

n
𝑖=1[(X𝑖 −X)(Y𝑖 −Y)] 7
SXY = = = 1.75 → what does this
𝑛−1 4
number tells us
Dr. T. Vinayagathasan, Dept. of Econ. & Statistics, Univ. of Peradeniya 15
Variance vs Covariance

Problem with Covariance


• The value obtained by covariance is dependent on the size
of the data’s standard deviations: if large, the value will be
greater than if small… even if the relationship between x
and y is exactly the same in the large versus small standard
deviation datasets.

Dr. T. Vinayagathasan, Dept. of Econ. & Statistics, Univ. of Peradeniya 16


Example of how covariance value relies on variance
High variance data Low variance data
Subject x y x error * x y X error *
y error y error

1 101 100 2500 54 53 9


2 81 80 900 53 52 4
3 61 60 100 52 51 1
4 51 50 0 51 50 0
5 41 40 100 50 49 1
6 21 20 900 49 48 4
7 1 0 2500 48 47 9
Mean 51 50 51 50
Sum of x error * y error : 7000 Sum of x error * y error : 28
Covariance: 1166.67 Covariance: 4.67

Dr. T. Vinayagathasan, Dept. of Econ. & Statistics, Univ. of Peradeniya 17


How to compute the simple correlation coefficient (r)/
How do we calculate the Pearson Correlation?
Definitional formula (Deviation Method)
n [(X −X)(Y −Y)]
𝑖=1 𝑖 𝑖
𝑛 cov(X,Y) SXY
γXY = n (X −X)2 n (Y −Y)2 = =
𝑖=1 𝑖 𝑖=1 𝑖 𝑣𝑎𝑟 X 𝑣𝑎𝑟(Y) SX S𝑌
𝑛 𝑛
Where, SX : standard deviation of X, SY : Standard deviation of Y
Computational Formula
𝑛 X 𝑛 Y
𝑛 𝑖=1 𝑖 𝑖=1 𝑖
𝑖=1 X𝑖 Y𝑖 − 𝑛
γXY =
𝑛 X 2 𝑛 Y 2
𝑛 2− 𝑖=1 𝑖 𝑛 2− 𝑖=1 𝑖
X
𝑖=1 𝑖 𝑛
Y
𝑖=1 𝑖 𝑛

Dr. T. Vinayagathasan, Dept. of Econ. & Statistics, Univ. of Peradeniya 18


Example:
• A sample of 6 children was selected, data about their age
in years and weight in kilograms was recorded as shown in
the following table . It is required to find the correlation
between age and weight.
serial Age Weight These 2 variables are of the
No (years) (Kg) quantitative type, one variable
(Age) is called the independent
1 7 12 and denoted as (X) variable and
2 6 8 the other (weight) is called the
dependent and denoted as (Y)
3 8 12
variables to find the relation
4 5 10 between age and weight
5 6 11 compute the simple correlation
coefficient using the following
6 9 13
formula:
Dr. T. Vinayagathasan, Dept. of Econ. & Statistics, Univ. of Peradeniya 19
• Example:

No Age (x) Weight (y) 𝐗𝐘 𝐗𝟐 𝐘𝟐


1 7 12
2 6 8
3 8 12
4 5 10
5 6 11
6 9 13
Total 𝐗 = 𝟒𝟏 𝐘 = 𝟔𝟔 𝐗𝐘 = 𝐗𝟐 = 𝐘𝟐 =

Dr. T. Vinayagathasan, Dept. of Econ. & Statistics, Univ. of Peradeniya 20


Example:
𝑛 X 𝑛 Y
𝑛
X Y − 𝑖=1 𝑖 𝑖=1 𝑖
𝑖=1 𝑖 𝑖 𝑛
γXY =
𝑛 X 2 𝑛 Y 2
𝑛 X2 − 𝑖=1 𝑖 𝑛 Y2 − 𝑖=1 𝑖
𝑖=1 𝑖 𝑛 𝑖=1 𝑖 𝑛

γXY =

r = 0.759
strong direct correlation

Dr. T. Vinayagathasan, Dept. of Econ. & Statistics, Univ. of Peradeniya 21


Example: Relationship between Anxiety and Test Scores
Anxiety (X) Test score (Y) X2 Y2 XY
10 2
8 3
2 9
1 7
5 6
6 5
∑X = 32 ∑Y = 32 ∑X2 = ∑Y2 = ∑XY=

Dr. T. Vinayagathasan, Dept. of Econ. & Statistics, Univ. of Peradeniya 22


Calculating Correlation Coefficient

(6)(129)  (32)(32) 774  1024


r   .94
6(230)  32 6(204)  32 
2 2
(356)(200)
Or
32 ∗ 32
129 − −41.6667
γXY = 6 =
322 322 7.702813 ∗ 5.773503
230 − 204 −
6 6
= 0.93691
r = - 0.94
Indirect strong correlation

Dr. T. Vinayagathasan, Dept. of Econ. & Statistics, Univ. of Peradeniya 23


Pearson’s R continued…
Standardized Formula
n

 ( x  x)( y
n

 ( x  x)( y
i i  y) i i  y)
cov(x, y )  i 1 rxy  i 1

n 1 (n  1) s x s y

Z xi * Z yi
rxy  i 1
n 1

Dr. T. Vinayagathasan, Dept. of Econ. & Statistics, Univ. of Peradeniya 24


Using and Interpreting r
Uses of r
• Prediction
• Validity
• Reliability
• Theory Verification
Interpretation of r
• r2 measures the proportion of variability in one variable
that can be determined from the relationship with the
other variable.
• A correlation of r = .80 means that r2 = .64 or 64% of the
variability in Y scores can be predicted from the
relationship with X.

 “Correlation does not mean Causation”


Dr. T. Vinayagathasan, Dept. of Econ. & Statistics, Univ. of Peradeniya 25
Outliers
• An individual with X and/or Y values that are substantially
different (larger or smaller) from the values obtained for
the other individuals in the data set.
• An outlier can dramatically influence the value obtained
for the correlation.
• Always look at scatter plots to determine if there are
outliers.

Dr. T. Vinayagathasan, Dept. of Econ. & Statistics, Univ. of Peradeniya 26


Testing a Population Correlation Coefficient
• Once the sample correlation coefficient r has been
calculated, we need to determine whether there is enough
evidence to decide that the population correlation
coefficient ρ is significant at a specified level of significance.
• One way to determine this is to use Table value given
below.
• If |r| is greater than the critical value, there is enough
evidence to decide that the correlation coefficient ρ is
significant.
n  = 0.05  = 0.01
For a sample of size n = 6,
4 0.950 0.990
5 0.878 0.959
ρ is significant at the 5%
6 0.811 0.917 significance level, if |r| >
7 0.754 0.875 0.811.

Dr. T. Vinayagathasan, Dept. of Econ. & Statistics, Univ. of Peradeniya 27


Testing a Population Correlation Coefficient
Finding the Correlation Coefficient ρ

In Words In Symbols
1. Determine the number of Determine n.
pairs of data in the sample.
2. Specify the level of Identify .
significance.
3. Find the critical value. Use Table given below.
4. Decide if the correlation is If |r| > critical value, the
significant. correlation is significant.
Otherwise, there is not enough
5. Interpret the decision in the
evidence to support that the
context of the original claim.
correlation is significant.

Dr. T. Vinayagathasan, Dept. of Econ. & Statistics, Univ. of Peradeniya 28


Testing a Population Correlation Coefficient
Example: The following data represents the number of hours
12 different students watched television during the weekend
and the scores of each student who took a test the following
Monday.
The correlation coefficient r  0.831.
Hours, x 0 1 2 3 3 5 5 5 6 7 7 10
Test score, y 96 85 82 74 95 68 76 84 58 65 75 50

Is the correlation coefficient significant at  = 0.01?

Dr. T. Vinayagathasan, Dept. of Econ. & Statistics, Univ. of Peradeniya 29


Testing a Population Correlation Coefficient
Example continued:
Table 1: Critical value
r  0.831 n  = 0.05  = 0.01
4 0.950 0.990
n = 12
5 0.878 0.959
 = 0.01 6 0.811 0.917
10 0.632 0.765
11 0.602 0.735
12 0.576 0.708 |r| > 0.708
13 0.553 0.684
Because, the population correlation is significant, there is enough
evidence at the 1% level of significance to conclude that there is a
significant linear correlation between the number of hours of
television watched during the weekend and the scores of each
student who took a test the following Monday.
Dr. T. Vinayagathasan, Dept. of Econ. & Statistics, Univ. of Peradeniya 30
Hypothesis Testing for ρ
• A hypothesis test can also be used to determine whether
the sample correlation coefficient r provides enough
evidence to conclude that the population correlation
coefficient ρ is significant at a specified level of significance.
• A hypothesis test can be one tailed or two tailed.
H0: ρ  0 (no significant negative correlation)
Left-tailed test
Ha: ρ < 0 (significant negative correlation)

H0: ρ  0 (no significant positive correlation)


Ha: ρ > 0 (significant positive correlation)
Right-tailed test

H0: ρ = 0 (no significant correlation)


Two-tailed test
Ha: ρ  0 (significant correlation)

Dr. T. Vinayagathasan, Dept. of Econ. & Statistics, Univ. of Peradeniya 31


Hypothesis Testing for ρ
The t-Test for the Correlation Coefficient
• A t-test can be used to test whether the correlation
between two variables is significant. The test statistic is r
and the standardized test statistic
t r  r
σr 1 r2
n 2
• follows a t-distribution with n – 2 degrees of freedom.
• In this text, only two-tailed hypothesis tests for ρ are
considered.

Dr. T. Vinayagathasan, Dept. of Econ. & Statistics, Univ. of Peradeniya 32


Hypothesis Testing for ρ
Using the t-Test for the Correlation Coefficient ρ
In Words In Symbols
1. State the null and State H0 and Ha.
alternative hypothesis.
2. Specify the level of Identify .
significance.
3. Identify the degrees of d.f. = n – 2
freedom.
4. Determine the critical Use Table 5 in Appendix B.
value(s) and rejection
region(s).

Dr. T. Vinayagathasan, Dept. of Econ. & Statistics, Univ. of Peradeniya 33


Hypothesis Testing for ρ
Using the t-Test for the Correlation Coefficient ρ
In Words In Symbols
5. Find the standardized test t r
statistic. 1 r2
n 2

6. Make a decision to reject or If t is in the rejection


fail to reject the null region, reject H0.
hypothesis. Otherwise fail to reject
H0.
7. Interpret the decision in the
context of the original claim.

Dr. T. Vinayagathasan, Dept. of Econ. & Statistics, Univ. of Peradeniya 34


Hypothesis Testing for ρ
Example:The following data represents the number of hours
12 different students watched television during the weekend
and the scores of each student who took a test the following
Monday.

The correlation coefficient r  0.831.

Hours, x 0 1 2 3 3 5 5 5 6 7 7 10
Test score, y 96 85 82 74 95 68 76 84 58 65 75 50

Test the significance of this correlation coefficient significant at  =


0.01?

Dr. T. Vinayagathasan, Dept. of Econ. & Statistics, Univ. of Peradeniya 35


Hypothesis Testing for ρ
Example continued:
H0: ρ = 0 (no correlation) Ha: ρ  0 (significant correlation)
The level of significance is  = 0.01.
Degrees of freedom are df. = 12 – 2 = 10.
The critical values are t0 = 3.169 and t0 = 3.169.
The standardized test statistic is
t r 0.831
 The test statistic falls in the
1 r2 1  (0.831)2 rejection region, so H0 is
n 2 12  2 rejected.
 4.72. t
t0 = 3.169 0 t0 = 3.169

At the 1% level of significance, there is enough evidence to conclude that


there is a significant linear correlation between the number of hours of TV
watched over the weekend and the test scores on Monday morning.
Dr. T. Vinayagathasan, Dept. of Econ. & Statistics, Univ. of Peradeniya 36
Correlation and Causation
• The fact that two variables are strongly correlated does
not in itself imply a cause-and-effect relationship between
the variables.
• If there is a significant correlation between two variables,
you should consider the following possibilities.
1. Is there a direct cause-and-effect relationship between the
variables? : Does x cause y?
2. Is there a reverse cause-and-effect relationship between the
variables?: Does y cause x?
3. Is it possible that the relationship between the variables can
be caused by a third variable or by a combination of several
other variables?
4. Is it possible that the relationship between two variables
may be a coincidence?
Dr. T. Vinayagathasan, Dept. of Econ. & Statistics, Univ. of Peradeniya 37
Other Correlation Coefficients
• Spearman rank correlation (𝑟s )
– Two ranked (ordinal) variables
• Point-biserial r
– Pearson r between dichotomous and continuous
variable
• Phi Coefficient
– Pearson r between two dichotomous variables

Dr. T. Vinayagathasan, Dept. of Econ. & Statistics, Univ. of Peradeniya 38


Spearman Rank Correlation Coefficient (rs)
• It is a non-parametric measure of correlation.
• This procedure makes use of the two sets of ranks that
may be assigned to the sample values of x and Y.
• Used for non-linear relationships
• Ordinal (ranked) Data
• Can be used as an alternative to the Pearson
• Measure of consistency
• Spearman Rank correlation coefficient could be computed
in the following cases:
 Both variables are quantitative.
 Both variables are qualitative ordinal.
 One variable is quantitative and the other is qualitative
ordinal.
Dr. T. Vinayagathasan, Dept. of Econ. & Statistics, Univ. of Peradeniya 39
𝐫𝐬 Estimation Procedure:
1. Rank the values of X from 1 to n where n is the numbers of
pairs of values of X and Y in the sample.
2. Rank the values of Y from 1 to n.
3. Compute the value of 𝒅𝒊 for each pair of observation by
subtracting the rank of 𝐘𝒊 from the rank of 𝐗 𝒊 .
4. Square each 𝑑𝑖 and compute 𝒅𝒊𝟐 which is the sum of the
squared values.
5. Apply the following formula
6 (𝑑𝑖)2
𝑟s = 1 −
𝑛(𝑛2 −1)
where, 𝑑𝑖 = rank of X𝑖 − rank of Y𝑖
The value of 𝑟s denotes the magnitude and nature of association
giving the same interpretation as simple r.
Dr. T. Vinayagathasan, Dept. of Econ. & Statistics, Univ. of Peradeniya 40
Example
• In a study of the relationship between level education and
income the following data was obtained. Find the
relationship between them and comment.
sample level education Income
numbers (X) (Y)
A Preparatory. 25
B Primary. 10
C University. 8
D secondary 10
E secondary 15
F illiterate 50
G University. 60

Dr. T. Vinayagathasan, Dept. of Econ. & Statistics, Univ. of Peradeniya 41


Answer:
(X) (Y) Rank (X) Rank (Y) 𝑑𝑖 𝑑𝑖 2

A Preparatory 25 5 3 2 4

B Primary 10 6 5.5 0.5 0.25


C University 8 1.5 7 -5.5 30.25
D secondary 10 3.5 5.5 -2 4
E secondary 15 3.5 4 -0.5 0.25
F illiterate 50 7 2 5 25
G university 60 1.5 1 0.5 0.25
𝑑𝑖 2 = 64

6 𝑑𝑖 2 6 × 64
𝑟s = 1 − 2
=1− = −0.1
𝑛 𝑛 −1 7 49 − 1
• There is an indirect weak correlation between level of education
and income.
Dr. T. Vinayagathasan, Dept. of Econ. & Statistics, Univ. of Peradeniya 42
Hypothesis testing of Spearman Rank Correlation
– The hypotheses are:
H0: ρs = 0
H1: ρs ≠ 0
– The test statistic is: rs  cov( a, b )
s a sb
a and b are the ranks of the data.
– For a large sample (n > 30) rs is approximately normally
distributed
z  rs n  1

Dr. T. Vinayagathasan, Dept. of Econ. & Statistics, Univ. of Peradeniya 43


Example:
– A production manager wants to examine the relationship
between aptitude test score given prior to hiring, and
performance rating three months after starting work.
– A random sample of 20 production workers was selected. The
test scores as well as performance rating was recorded.
Aptitude Performance
Employee test rating
1 59 3
2 47 2
3 58 4
4 66 3
5 77 2
. . .
. . .
. . .

Scores range from 0 to 100 Scores range from 1 to 5


44
Dr. T. Vinayagathasan, Dept. of Econ. & Statistics, Univ. of Peradeniya
Solution
• The problem objective is to analyze the relationship between
two variables.
• The hypotheses are: H0: ρs = 0
H1: ρs ≠ 0
• The test statistic is rs and the rejection region is rcritical (taken
from the Spearman rank correlation table)
Aptitude Performance
Employee test Rank(a) rating Rank(b)
1 59 9 3 10.5
2 47 3 2 3.5 Ties are broken
3 58 8 4 17 by averaging the
4 66 14 3 10.5 ranks.
5 77 20 2 3.5
. . . . .
. . . . .
. . . . .
45
Dr. T. Vinayagathasan, Dept. of Econ. & Statistics, Univ. of Peradeniya
Solving by hand
• Rank each variable separately.
• Calculate sa = 5.92; sb =5.50; cov(a, b) = 12.34
• Thus rs = cov(a, b)/[sasb] = .379.
• The critical value for  = .05 and n = 20 is .450.

Conclusion:
• Do not reject the null hypothesis. At 5% significance
level there is insufficient evidence to infer that the two
variable are related to one another.

46
Dr. T. Vinayagathasan, Dept. of Econ. & Statistics, Univ. of Peradeniya

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy