7d_Correlation With % Correlation
7d_Correlation With % Correlation
John Arudo
Defining Correlation
http://www.youtube.com/watch?v=ahp7QhbB8G4
Correlation Coefficient
X X
Y Y
X X
Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall
Linear Correlation
Strong relationships Weak relationships
Y Y
X X
Y Y
X X
from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall
Linear Correlation
No relationship
X
from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall
Positive Correlation
200
live births
150
100
50
0
20 40 60 80 100
DPT Immunization Percentage
Suppose we wished to graph the relationship between foot length
and height of 20 subjects.
74
72
70
Height
68
66
64
62
60
58
4 6 8 10 12 14
Foot Length
1. Find 12 inches on the x-axis.
2. Find 70 inches on the y-axis.
3. Locate the intersection of 12 and 70.
4. Place a dot at the intersection of 12 and 70.
74
72
70
Height
68
66
64
62
60
58
4 6 8 10 12 14
Foot Length
5. Find 8 inches on the x-axis.
6. Find 62 inches on the y-axis.
7. Locate the intersection of 8 and 62.
8. Place a dot at the intersection of 8 and 62.
9. Continue to plot points for each pair of scores.
72
70
68
66
64
62
60
58
4 6 8 10 12 14
Notice how the scores cluster to form a pattern.
The more closely they cluster to a line that is drawn through them,
the stronger the linear relationship between the two
variables is (in this case foot length and height).
74
72
70
68
66
64
62
60
58
4 6 8 10 12 14
If the points on the scatterplot have 74
62
60
58
4 6 8 10 12 14
72
have a downward movement
70 from left to right,we say the
68
relationship between the
variables is negative.
66
64
62
60
58
4 6 8 10 12 14
A positive relationship means that high scores on one variable
are associated with high scores on the other variable
74
72
70
68
66
64
62
60
58
4 6 8 10 12 14
A negative relationship means that high scores on one variable
are associated with low scores on the other variable.
74
72
70
68
66
64
62
60
58
4 6 8 10 12 14
Not only do relationships have direction (positive and negative), they
also have strength (from 0.00 to 1.00 and from 0.00 to –1.00).
r = 1.00
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
A set of scores with r= –0.60 has the same strength as a set of
scores with r= 0.60 because both sets cluster similarly.
For this procedure, we use Pearson’s r. This statistical procedure
can only be used when BOTH variables are measured on a
continuous scale and you wish to measure a linear relationship.
NO
Pearson
Linear
Relationship
r
Curvilinear
Relationship
Correlation
Measure of linear relationship
between two continuous random
variables
It measures the closeness of the
association
The table in next slide shows the
body weight and plasma volume of
eight healthy men
Correlation - rho
Correlation measures how close the
observations are to the straight line
that best describes their linear
relationship by
– Calculating the Pearson product
moment correlation coefficient
– This is called the correlation
coefficient (r - rho)
Explanation on scatter diagrams
r is always a number between -1 and + 1
r = 0 if the variables are not associated
It is positive if x and y tend to be high or
low together
The larger the value of r is the closer the
association
The maximum value of 1 is obtained if the
relationship is perfect i.e. exactly on the
straight line
The r is negative if high values of y tend to
go with low values of x, and vice versa
Plasma volume and body weight in
8 healthy men
subject Body wt (kg) Plasma vol (l)
1 58.0 2.75
2 70.0 2.86
3 74.0 3.37
4 63.5 2.76
5 62.0 2.62
6 70.5 3.49
7 71.0 3.05
8 66.0 3.12
Scatter diagram of the body wt and
plasma volume relationship
y, plasma volume (litres)
3.25
2.75
2.25
55 60 65 70 75
x, body weight (kg)
Plasma volume and body weight in
5 healthy men
subject Body wt (kg) Plasma vol (l)
1 58.0 2.7
2 70.0 2.9
3 74.0 3.4
4 63.5 2.8
5 62.0 2.6
Correlation
The scatter diagram shows that the
high plasma volume tends to be
associated with increasing weight
and vice versa
This association is measured by the
correlation coefficient, r:
Where x denotes wt,
r= (x-x)(y-y) y denotes plasma volume
x and y are the
[(x-x)2 (y-y)2] corresponding means
Formula for r
r
[( X M X )(Y M Y )]
( SS X )( SSY )
Correlation Co-efficient :
Correlation(r) =[NΣXY - (ΣX)(ΣY) / Sqrt([NΣX2 -
(ΣX)2][NΣY2 - (ΣY)2])]
where
N = Number of values or elements
X = First Score
Y = Second Score
ΣXY = Sum of the product of first and
Second Scores
ΣX = Sum of First Scores
ΣY = Sum of Second Scores
ΣX2 = Sum of square First Scores
ΣY2 = Sum of square Second Scores
N X (Kg) Y (Plasma) XY X2 Y2
1 58 2.7 156.6 3364 7.29
2 70 2.9 203 4900 8.41
3 74 3.4 251.6 5476 11.56
4 63 2.8 176.4 3969 7.84
5 62 2.6 161.2 3844 6.76
ΣX=327 ΣY=14.4 ΣXY=948.8 ΣX2=21553 ΣY2=41.86
Correlation(r) = [35.2/Sqrt(836)(1.94)
subject Body wt Plasma
(kg) vol (l)
Correlation(r) = 35.2/40.27
1 58.0 2.7
Correlation(r) = 0.87
2 70.0 2.9
3 74.0 3.4
4 63.5 2.8
5 62.0 2.6
Plasma volume and body weight in
5 healthy men
subject Body wt Plasma vol
(kg) (l)
1 58.0 2.7
2 70.0 2.9
3 74.0 3.4
4 63.5 2.8
5 62.0 2.6
Correlation(r) = 0.99
HO: There is no correlation between the weight of the broiler and its price
HA : There is a correlation between the weight of the broiler and its price
x y
60 3.1
61 3.6
62 3.8
63 4
65 4.1
63 4 63 * 4 = 252 63 * 63 = 3969 4 * 4 = 16
= ((5)*(1159.7)-(311)*(18.6))/sqrt([(5)*(19359)-(311) 2]*[(5)*(69.82)-(18.6)2])
= 13.9/sqrt(74*3.14)
= 13.9/sqrt(232.36)
= 13.9/15.24336
r = 0.9119
Significance of the Correlation
Coefficient
Test for the significance of relationships
between two CONTINUOUS variables
We introduced Pearson correlation as a
measure of the STRENGTH of a
relationship between two variables
But any relationship should be assessed for its
SIGNIFICANCE as well as its strength.
Significance of the Correlation
Coefficient
Factors in relationships between two variables
– The strength of the relationship:
is indicated by the correlation coefficient: r
but is actually measured by the coefficient of
determination: r2
– The significance of the relationship
is expressed in probability levels: p (e.g., significant at p
=.05)
This tells how unlikely a given correlation
coefficient, r, will occur given no relationship in the
population
– NOTE! NOTE! NOTE! The smaller the p-level, the more
significant the relationship
– BUT! BUT! BUT! The larger the correlation, the
stronger the relationship
Coefficient of determination (r2)
The coefficient of determination is a statistical
measurement that examines how differences in
one variable can be explained by the difference in
a second variable, when predicting the outcome
of a given event.
R2 assesses how strong the linear relationship is
between two variables, and is heavily relied on by
researchers when conducting trend analysis.
Coefficient of determination
Research Question: If a woman
becomes pregnant on a certain day,
what is the likelihood that she would
deliver her baby on a particular date
in the future?
In this scenario, this metric aims to
calculate the correlation between
two related events: conception and
birth.
Back to our Example on Correlation: The fundamental
question: is the difference between what you observe
and what you expect given the assumption of the
population large enough to be significant -- to reject
the assumption?
The greater the difference -- the more the sample
statistic deviates from the population parameter --
the more significant it is
That is, the less likely (small probability values) that
the population assumption is true.
54
The Limitations of Correlation
Three Possible
Causal
Explanations for a
Correlation
Pearson correlation coefficient
–r
– Linear relationship
r
[( X M X )(Y M Y )]
( SS X )( SSY )
Correlation Hypothesis Testing
Step 1. Identify the population,
distribution, and assumptions
Step 2. State the null and research
hypotheses.
Step 3. Determine the characteristics of
the comparison distribution.
Step 4. Determine the critical values.
Step 5. Calculate the test statistic
Step 6. Make a decision.
Reliability
A reliable measure is one that is
consistent.
One particular type of reliability is test–
retest reliability.
Correlation is used by psychometricians
to help professional sports teams
assess the reliability of athletic
performance, such as how fast a
pitcher can throw a baseball.
Validity