0% found this document useful (0 votes)
0 views59 pages

7d_Correlation With % Correlation

The document explains correlation, defining it as the co-variation between two variables, which can be quantified using the correlation coefficient that ranges from -1.00 to 1.00. It discusses positive and negative correlations, linear relationships, and the use of scatterplots to visually represent these relationships. Additionally, it provides examples and calculations for correlation coefficients, emphasizing the significance of the strength and direction of relationships between variables.

Uploaded by

Michael Ogello
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views59 pages

7d_Correlation With % Correlation

The document explains correlation, defining it as the co-variation between two variables, which can be quantified using the correlation coefficient that ranges from -1.00 to 1.00. It discusses positive and negative correlations, linear relationships, and the use of scatterplots to visually represent these relationships. Additionally, it provides examples and calculations for correlation coefficients, emphasizing the significance of the strength and direction of relationships between variables.

Uploaded by

Michael Ogello
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 59

Correlation

John Arudo
Defining Correlation

Co-variation or co-relation between


two variables
These variables change together
Usually scale (interval or ratio)
variables

http://www.youtube.com/watch?v=ahp7QhbB8G4
Correlation Coefficient

A statistic that quantifies a relation


between two variables
Can be either positive or negative
Falls between -1.00 and 1.00
The value of the number (not the
sign) indicates the strength of the
relation
Linear Correlation
Linear Curvilinear
relationships relationships
Y Y

X X

Y Y

X X
Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall

Linear Correlation
Strong relationships Weak relationships

Y Y

X X

Y Y

X X
from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall
Linear Correlation
No relationship

X
from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall
Positive Correlation

Association between variables such


that high scores on one variable
tend to have high scores on the
other variable
A direct relation between the variables
Negative Correlation
Association between variables such
that high scores on one variable
tend to have low scores on the
other variable
An inverse relation between the
variables
A Perfect Positive Correlation
A Perfect Negative Correlation
Relationship between two
continuous variables
A correlation is a statistical
measure of the relationship between
two variables.
The measure is best used in
variables that demonstrate a linear
relationship between each other.
The fit of the data can be visually
represented in a scatterplot.
Under five infant mortality per 1,000 live
births and percentage immunized against
DPT for 16 countries
250
5 year mortality rate per 1,000

200
live births

150

100

50

0
20 40 60 80 100
DPT Immunization Percentage
Suppose we wished to graph the relationship between foot length
and height of 20 subjects.

In order to create the graph, which is called a scatterplot or


scattergram, we need the foot length and height for each of
our subjects.

74

72
70
Height

68

66
64

62
60

58
4 6 8 10 12 14

Foot Length
1. Find 12 inches on the x-axis.
2. Find 70 inches on the y-axis.
3. Locate the intersection of 12 and 70.
4. Place a dot at the intersection of 12 and 70.

Assume our first subject had a 12 inch foot


and was 70 inches tall.

74

72

70
Height

68

66
64

62
60

58
4 6 8 10 12 14

Foot Length
5. Find 8 inches on the x-axis.
6. Find 62 inches on the y-axis.
7. Locate the intersection of 8 and 62.
8. Place a dot at the intersection of 8 and 62.
9. Continue to plot points for each pair of scores.

Assume that our second


subject had an 8 inch foot
and was 62 inches tall.
74

72

70

68

66
64

62
60

58
4 6 8 10 12 14
Notice how the scores cluster to form a pattern.

The more closely they cluster to a line that is drawn through them,
the stronger the linear relationship between the two
variables is (in this case foot length and height).

74

72
70

68

66
64

62
60

58
4 6 8 10 12 14
If the points on the scatterplot have 74

an upward movement from left to 72


70
right,we say the relationship 68
between the variables is positive. 66
64

62
60

58
4 6 8 10 12 14

If the points on the scatterplot


74

72
have a downward movement
70 from left to right,we say the
68
relationship between the
variables is negative.
66
64

62
60

58
4 6 8 10 12 14
A positive relationship means that high scores on one variable
are associated with high scores on the other variable

It also indicates that low scores on one variable


are associated with low scores on the other variable.

74

72
70

68

66
64

62
60

58
4 6 8 10 12 14
A negative relationship means that high scores on one variable
are associated with low scores on the other variable.

It also indicates that low scores on one variable


are associated with high scores on the other variable.

74

72
70

68

66
64

62
60

58
4 6 8 10 12 14
Not only do relationships have direction (positive and negative), they
also have strength (from 0.00 to 1.00 and from 0.00 to –1.00).

The more closely the points cluster toward a straight line,


the stronger the relationship is.

r = 1.00
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
A set of scores with r= –0.60 has the same strength as a set of
scores with r= 0.60 because both sets cluster similarly.
For this procedure, we use Pearson’s r. This statistical procedure
can only be used when BOTH variables are measured on a
continuous scale and you wish to measure a linear relationship.

NO
Pearson
Linear
Relationship
r
Curvilinear
Relationship
Correlation
Measure of linear relationship
between two continuous random
variables
It measures the closeness of the
association
The table in next slide shows the
body weight and plasma volume of
eight healthy men
Correlation - rho
Correlation measures how close the
observations are to the straight line
that best describes their linear
relationship by
– Calculating the Pearson product
moment correlation coefficient
– This is called the correlation
coefficient (r - rho)
Explanation on scatter diagrams
r is always a number between -1 and + 1
r = 0 if the variables are not associated
It is positive if x and y tend to be high or
low together
The larger the value of r is the closer the
association
The maximum value of 1 is obtained if the
relationship is perfect i.e. exactly on the
straight line
The r is negative if high values of y tend to
go with low values of x, and vice versa
Plasma volume and body weight in
8 healthy men
subject Body wt (kg) Plasma vol (l)
1 58.0 2.75
2 70.0 2.86
3 74.0 3.37
4 63.5 2.76
5 62.0 2.62
6 70.5 3.49
7 71.0 3.05
8 66.0 3.12
Scatter diagram of the body wt and
plasma volume relationship
y, plasma volume (litres)

3.25

2.75

2.25
55 60 65 70 75
x, body weight (kg)
Plasma volume and body weight in
5 healthy men
subject Body wt (kg) Plasma vol (l)
1 58.0 2.7
2 70.0 2.9
3 74.0 3.4
4 63.5 2.8
5 62.0 2.6
Correlation
The scatter diagram shows that the
high plasma volume tends to be
associated with increasing weight
and vice versa
This association is measured by the
correlation coefficient, r:
Where x denotes wt,
r= (x-x)(y-y) y denotes plasma volume
x and y are the
[(x-x)2 (y-y)2] corresponding means
Formula for r

r
 [( X  M X )(Y  M Y )]
( SS X )( SSY )
Correlation Co-efficient :
Correlation(r) =[NΣXY - (ΣX)(ΣY) / Sqrt([NΣX2 -
(ΣX)2][NΣY2 - (ΣY)2])]
where
N = Number of values or elements
X = First Score
Y = Second Score
ΣXY = Sum of the product of first and
Second Scores
ΣX = Sum of First Scores
ΣY = Sum of Second Scores
ΣX2 = Sum of square First Scores
ΣY2 = Sum of square Second Scores
N X (Kg) Y (Plasma) XY X2 Y2
1 58 2.7 156.6 3364 7.29
2 70 2.9 203 4900 8.41
3 74 3.4 251.6 5476 11.56
4 63 2.8 176.4 3969 7.84
5 62 2.6 161.2 3844 6.76
ΣX=327 ΣY=14.4 ΣXY=948.8 ΣX2=21553 ΣY2=41.86

Correlation(r) =[NΣXY - (ΣX)(ΣY) / Sqrt([NΣX 2 - (ΣX)2][NΣY2 - (ΣY)2])]


Correlation(r) = [5*948.8 – (327*14.4)/Sqrt(5*21553-106929)(5*41.86-207.36)

Correlation(r) = [4744 – (4708.8)/Sqrt(107765-106929)(209.3-207.36)

Correlation(r) = [35.2/Sqrt(836)(1.94)
subject Body wt Plasma
(kg) vol (l)
Correlation(r) = 35.2/40.27
1 58.0 2.7
Correlation(r) = 0.87
2 70.0 2.9
3 74.0 3.4
4 63.5 2.8
5 62.0 2.6
Plasma volume and body weight in
5 healthy men
subject Body wt Plasma vol
(kg) (l)
1 58.0 2.7

2 70.0 2.9

3 74.0 3.4

4 63.5 2.8

5 62.0 2.6
Correlation(r) = 0.99
HO: There is no correlation between the weight of the broiler and its price
HA : There is a correlation between the weight of the broiler and its price

t Test OR Student t Test


p ≤ 0.05
t = 0.99 SQRT (7 – 2/1.0 – 0.992)
t = 0.99 SQRT (7 – 2/0.02)
t = 0.99 SQRT (250)
t = 0.99 * 15.81
t = 15.65 (Observed t value)
df = n-2
df = 7 – 2 = 5
Observed t value = 15.65
Expected t value in the table
at p = 0.05 (two tailed test)= 2.571
at df = 5
Compare observed t value and expected t value from the t test table
15.65 > 2.571: p < 0.05 or to be a bit more precise p < 0.001
Interpretation: The correlation between wt and price of broiler is highly
statistically significant and the the correction is very strong (r = 0.99)
Correlation(r) = 0.87
HO: There is no correlation between the weight of an individual and the plasma vol
HA : There is correlation between the weight of an individual and the plasma vol

t Test OR Student t Test


p ≤ 0.05
t = 0.87 SQRT (5-2/1.0 – 0.872)
t = 0.87 SQRT (3/0.24)
t = 0.87 *3.53
t =3.07
t = 3.07 (Observed t value)
df = n-2
df = 5 – 2 = 3

Observed t value = 3.07


Expected t value in the table
at p = 0.05 (two tailed test)= 3.182
at df = 3
Compare observed t value and expected t value from the t test table
3.07 < 3.182: p > 0.05
Interpretation: The correlation between wt and blood plasma volume is not
statistically significant (r = 0.87)
Correlation Co-efficient Example:
To find the Correlation of

x y
60 3.1
61 3.6
62 3.8
63 4
65 4.1

Step 1: Count the number of values.


N=5

Step 2: Find XY, X2, Y2


See the below table
x y X*Y X*X Y*Y

60 3.1 60 * 3.1 = 186 60 * 60 = 3600 3.1 * 3.1 = 9.61

61 3.6 61 * 3.6 = 219.6 61 * 61 = 3721 3.6 * 3.6 = 12.96

62 3.8 62 * 3.8 = 235.6 62 * 62 = 3844 3.8 * 3.8 = 14.44

63 4 63 * 4 = 252 63 * 63 = 3969 4 * 4 = 16

65 4.1 65 * 4.1 = 266.5 65 * 65 = 4225 4.1 * 4.1 = 16.81

Step 3: Find ΣX, ΣY, ΣXY, ΣX2, ΣY2.


ΣX = 311
ΣY = 18.6
ΣXY = 1159.7
ΣX2 = 19359
ΣY2 = 69.82
Step 4: Now, Substitute in the above formula given.
Correlation(r) =[ NΣXY - (ΣX)(ΣY) / Sqrt([NΣX2 - (ΣX)2][NΣY2 - (ΣY)2])]

= ((5)*(1159.7)-(311)*(18.6))/sqrt([(5)*(19359)-(311) 2]*[(5)*(69.82)-(18.6)2])

= (5798.5 - 5784.6)/sqrt([96795 - 96721]*[349.1 - 345.96])

= 13.9/sqrt(74*3.14)

= 13.9/sqrt(232.36)

= 13.9/15.24336

r = 0.9119
Significance of the Correlation
Coefficient
Test for the significance of relationships
between two CONTINUOUS variables
We introduced Pearson correlation as a
measure of the STRENGTH of a
relationship between two variables
But any relationship should be assessed for its
SIGNIFICANCE as well as its strength.
Significance of the Correlation
Coefficient
Factors in relationships between two variables
– The strength of the relationship:
is indicated by the correlation coefficient: r
but is actually measured by the coefficient of
determination: r2
– The significance of the relationship
is expressed in probability levels: p (e.g., significant at p
=.05)
This tells how unlikely a given correlation
coefficient, r, will occur given no relationship in the
population
– NOTE! NOTE! NOTE! The smaller the p-level, the more
significant the relationship
– BUT! BUT! BUT! The larger the correlation, the
stronger the relationship
Coefficient of determination (r2)
The coefficient of determination is a statistical
measurement that examines how differences in
one variable can be explained by the difference in
a second variable, when predicting the outcome
of a given event.
R2 assesses how strong the linear relationship is
between two variables, and is heavily relied on by
researchers when conducting trend analysis.
Coefficient of determination
Research Question: If a woman
becomes pregnant on a certain day,
what is the likelihood that she would
deliver her baby on a particular date
in the future?
In this scenario, this metric aims to
calculate the correlation between
two related events: conception and
birth.
Back to our Example on Correlation: The fundamental
question: is the difference between what you observe
and what you expect given the assumption of the
population large enough to be significant -- to reject
the assumption?
The greater the difference -- the more the sample
statistic deviates from the population parameter --
the more significant it is
That is, the less likely (small probability values) that
the population assumption is true.

The r² value is 0.83 (the square of the correlation


coefficient) {where r = 0.9119 }, indicating that
83.2% of the variation in one variable may be
explained by the other
The classical model makes some assumptions about
the population parameter:
– Population parameters are expressed as Greek
letters, while corresponding sample statistics are
expressed in lower-case Roman letters:
r = correlation between two variables in the
sample
(rho) = correlation between the same two
variables in the population
– A common assumption is that there is NO
relationship between X and Y in the population: =
0.0
– Under this common null hypothesis in
correlational analysis: r = 0.0
Testing for the significance of the correlation
coefficient, r
When the test is against the null hypothesis: r xy = 0.0
– What is the likelihood of drawing a sample with r xy ­0.0?
– The sampling distribution of r is
approximately normal (but bounded at -1.0 and
+1.0) when N is large
and distributes t when N is small.
– The simplest formula for computing the appropriate t
value to test significance of a correlation coefficient
employs the t distribution:

The degrees of freedom for


entering the t-distribution is N - 2
Student t test
Example: Suppose you observe that r= .50 between
literacy rate and political stability in 5 nations
– Is this relationship "strong"?
Coefficient of determination = r-squared = .25
Means X variable explains 25% of the
variability in the Y variable
– Is the relationship "significant"?
– That remains to be determined using the formula above
r = .50 and N=5 set level of significance (assume .05)
– determine one-or two-tailed test (aim for one-tailed)

– t=r x sqrt (n-2/1-r2)=1


– The x-variable explains 25% of the
variability in the y-variable.
Alternative ways of testing significance of r against the null
hypothesis
– Look up the values in a table
– Read them off the SPSS output:
check to see whether SPSS is making a one-tailed test
or a two-tailed test
Misleading Correlations

Something to think about


– There is a 0.91 correlation between
ice cream consumption and drowning
deaths.
Does eating ice cream cause drowning?
Does grief cause us to eat more ice
cream?
Correlation
Correlation is
NOT causation
-e.g., armspan
and height

54
The Limitations of Correlation

Correlation is not causation.


– Invisible third variables

Three Possible
Causal
Explanations for a
Correlation
Pearson correlation coefficient
–r
– Linear relationship

r
 [( X  M X )(Y  M Y )]
( SS X )( SSY )
Correlation Hypothesis Testing
Step 1. Identify the population,
distribution, and assumptions
Step 2. State the null and research
hypotheses.
Step 3. Determine the characteristics of
the comparison distribution.
Step 4. Determine the critical values.
Step 5. Calculate the test statistic
Step 6. Make a decision.
Reliability
A reliable measure is one that is
consistent.
One particular type of reliability is test–
retest reliability.
Correlation is used by psychometricians
to help professional sports teams
assess the reliability of athletic
performance, such as how fast a
pitcher can throw a baseball.
Validity

A valid measure is one that


measures what it was designed or
intended to measure.
Correlation is used to calculate
validity, often by correlating a new
measure with existing measures
known to assess the variable of
interest.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy