Correlation and Regression
Correlation and Regression
Correlation and Regression
Correlation coefficient
The degree of association is measured by a correlation coefficient, denoted by r. It is sometimes
called Pearson's correlation coefficient after its originator and is a measure of linear association.
If a curved line is needed to express the relationship, other and more complicated measures of
the correlation must be used.
The words "independent" and "dependent" could puzzle the beginner because it is sometimes
not clear what is dependent on what. This confusion is a triumph of common sense over
misleading terminology, because often each variable is dependent on some third variable,
which may or may not be mentioned. It is reasonable, for instance, to think of the height of
children as dependent on age rather than the converse but consider a positive correlation
between mean tar yield and nicotine yield of certain brands of cigarette.' The nicotine liberated
is unlikely to have its origin in the tar: both vary in parallel with some other factor or factors in
the composition of the cigarettes. The yield of the one does not seem to be "dependent" on the
other in the sense that, on average, the height of a child depends on his age. In such cases it
often does not matter which scale is put on which axis of the scatter diagram. However, if the
intention is to make inferences about one variable from the other, the observations from which
the inferences are to be made are usually put on the baseline. As a further example, a plot of
monthly deaths from heart disease against monthly sales of ice cream would show a negative
association. However, it is hardly likely that eating ice cream protects from heart disease! It is
simply that the mortality rate from heart disease is inversely related - and ice cream
consumption positively related - to a third factor, namely environmental temperature.
Figure 11.2 Scatter diagram of relation in 15 children between height and pulmonary
anatomical dead space.
The calculation of the correlation coefficient is as follows, with x representing the values of
the independent variable (in this case height) and y representing the values of the dependent
variable (in this case anatomical dead space). The formula to be used is:
Calculator procedure
Find the mean and standard deviation of x, as described in
This gives us the denominator of the formula. (Remember to exit from "Stat" mode.)
For the numerator multiply each value of x by the corresponding value of y, add these values
together and store them.
110 x 44 = Min
116 x 31 = M+
etc.
r = 5426.6/6412.0609 = 0.846.
The correlation coefficient of 0.846 indicates a strong positive correlation between size of
pulmonary anatomical dead space and height of child. But in interpreting correlation it is
important to remember that correlation is not causation. There may or may not be a causative
connection between the two correlated variables. Moreover, if there is a connection it may be
indirect.
A part of the variation in one of the variables (as measured by its variance) can be thought of
as being due to its relationship with the other variable and another part as due to undetermined
(often "random") causes. The part due to the dependence of one variable on the other is
measured by Rho . For these data Rho= 0.716 so we can say that 72% of the variation between
children in size of the anatomical dead space is accounted for by the height of the child. If we
wish to label the strength of the association, for absolute values of r, 0-0.19 is regarded as very
weak, 0.2-0.39 as weak, 0.40-0.59 as moderate, 0.6-0.79 as strong and 0.8-1 as very strong
correlation, but these are rather arbitrary limits, and the context of the results should be
considered.
Significance test
To test whether the association is merely apparent, and might have arisen by chance use
the t test in the following calculation:
For example, the correlation coefficient for these data was 0.846.
The number of pairs of observations was 15. Applying equation 11.1, we have:
where d is the difference in the ranks of the two variables for a given individual. Thus we can
derive table 11.2 from the data in table 11.1 .
The relationship can be represented by a simple equation called the regression equation. In this
context "regression" (the term is a historical anomaly) simply means that the average value of
y is a "function" of x, that is, it changes with x.
The regression equation representing how much y changes with any given change of x can be
used to construct a regression line on a scatter diagram, and in the simplest case this is assumed
to be a straight line. The direction in which the line slopes depends on whether the correlation
is positive or negative. When the two sets of observations increase or decrease together
(positive) the line slopes upwards from left to right; when one set decreases as the other
increases the line slopes downwards from left to right. As the line must be straight, it will
probably pass through few, if any, of the dots. Given that the association is well described by
a straight line we have to define two features of the line if we are to place it correctly on the
diagram. The first of these is its distance above the baseline; the second is its slope. They are
expressed in the following regression equation :
With this equation we can find a series of values of the variable, that correspond to each of
a series of values of x, the independent variable. The parameters α and β have to be estimated
from the data. The parameter signifies the distance above the baseline at which the regression
line cuts the vertical (y) axis; that is, when y = 0. The parameter β (the regression coefficient)
signifies the amount by which change in x must be multiplied to give the corresponding average
change in y, or the amount y changes for a unit increase in x. In this way it represents the degree
to which the line slopes upwards or downwards.
The regression equation is often more useful than the correlation coefficient. It enables us to
predict y from x and gives us a better summary of the relationship between the two variables.
If, for a particular value of x, x i, the regression equation predicts a value of y fit , the prediction
error is . It can easily be shown that any straight line passing through the mean values x
and y will give a total prediction error of zero because the positive and negative
terms exactly cancel. To remove the negative signs we square the differences and the regression
equation chosen to minimise the sum of squares of the prediction errors, We
denote the sample estimates of Alpha and Beta by a and b. It can be shown that the one straight
line that minimises , the least squares estimate, is given by
and
which is of use because we have calculated all the components of equation (11.2) in the
calculation of the correlation coefficient.
The calculation of the correlation coefficient on the data in table 11.2 gave the following:
Applying these figures to the formulae for the regression coefficients, we have:
This means that, on average, for every increase in height of 1 cm the increase in anatomical
dead space is 1.033 ml over the range of measurements made.
The line representing the equation is shown superimposed on the scatter diagram of the data in
figure 11.2. The way to draw the line is to take three values of x, one on the left side of the
scatter diagram, one in the middle and one on the right, and substitute these in the equation, as
follows:
Although two points are enough to define the line, three are better as a check. Having put them
on a scatter diagram, we simply draw the line through them.
Figure 11.3 Regression line drawn on scatter diagram relating height and pulmonaiy anatomical
dead space in 15 children
We already have to hand all of the terms in this expression. Thus is the square root of
. The denominator of (11.3) is 72.4680. Thus SE(b)
= 13.08445/72.4680 = 0.18055.
We can test whether the slope is significantly different from zero by:
Again, this has n - 2 = 15 - 2 = 13 degrees of freedom. The assumptions governing this test are:
1. That the prediction errors are approximately Normally distributed. Note this does not mean that
the x or y variables have to be Normally distributed.
2. That the relationship between the two variables is linear.
3. That the scatter of points about the line is approximately constant - we would not wish the
variability of the dependent variable to be growing as the independent variable increases. If
this is the case try taking logarithms of both the x and y variables.
Note that the test of significance for the slope gives exactly the same value of P as the test of
significance for the correlation coefficient. Although the two tests are derived differently, they
are algebraically equivalent, which makes intuitive sense.
where the tstatistic from has 13 degrees of freedom, and is equal to 2.160.
Regression lines give us useful information about the data they are collected from. They show
how one variable changes on average with another, and they can be used to find out what one
variable is likely to be when we know the other - provided that we ask this question within the
limits of the scatter diagram. To project the line at either end - to extrapolate - is always risky
because the relationship between x and y may change or some kind of cut off point may exist.
For instance, a regression line might be drawn relating the chronological age of some children
to their bone age, and it might be a straight line between, say, the ages of 5 and 10 years, but
to project it up to the age of 30 would clearly lead to error. Computer packages will often
produce the intercept from a regression equation, with no warning that it may be totally
meaningless. Consider a regression of blood pressure against age in middle aged men. The
regression coefficient is often positive, indicating that blood pressure increases with age. The
intercept is often close to zero, but it would be wrong to conclude that this is a reliable estimate
of the blood pressure in newly born male infants!
Common questions
If two variables are correlated are they causally related?
It is a common error to confuse correlation and causation. All that correlation shows is that the
two variables are associated. There may be a third variable, a confounding variable that is
related to both of them. For example, monthly deaths by drowning and monthly sales of ice-
cream are positively correlated, but no-one would say the relationship was causal!
How do I test the assumptions underlying linear regression?
Firstly always look at the scatter plot and ask, is it linear? Having obtained the regression
equation, calculate the residuals A histogram of will reveal departures from
Normality and a plot of versus will reveal whether the residuals increase in size as
increases.
References
1. Russell MAH, Cole PY, Idle MS, Adams L. Carbon monoxide yields of cigarettes and their
relation to nicotine yield and type of filter. BMJ 1975; 3:713.
2. Bland JM, Altman DG. Statistical methods for assessing agreement between two methods of
clinical measurement. Lancet 1986; i:307-10.
3. Brown RA, Swanson-Beck J. Medical Statistics on Personal Computers , 2nd edn. London:
BMJ Publishing Group, 1993.
4. Armitage P, Berry G. In: Statistical Methods in Medical Research , 3rd edn. Oxford: Blackwell
Scientific Publications, 1994:312-41.