Handout 07 - Correlation and Regression
Handout 07 - Correlation and Regression
Handout 07 - Correlation and Regression
Independent variable
Interval/ratio with lots of points Nominal, ordinal, or interval/ratio with few points
Compare groups using summary statistics (mean,
median, s.d. etc),
The data display a consistent, positive relationship between unpaid overtime and absenteeism, that
is very strong (Somers’ d = 0.69)
The general principle behind finding the line-of-best-fit to a linear relationship is to make the gap
(residuals) between the line and the individual plots as small as possible
This principle in formal terms is known as Ordinary Least Squares: minimize the sum of the
squared residuals
There are two features of the OLS regression line that we use to determine it for any given set of
points
1. The OLS regression will pass through the point that is the mean for each of the two variables.
2. The slop of the regression line - the regression coefficient - will always equal the following formula.
X i X Yi Y
b =
X i X
2
Determining the line-of-best-fit
The OLS regression will pass through the point
that is the mean for the two variables
Weekly unpaid Annual days lost
overtime
Mean 4.3 4.8
b =
X i X Yi Y
2
Xi X
0.8
Determining the line-of-best-fit (cont.)
Once we know the slope of the regression line
and one of the points through which it passes,
we can then extend the line back to the Y-axis
to the point at which X = 0
From this I can make predictions about the number of days lost from levels of unpaid overtime
If I receive information that a particular employee works 5 hours unpaid overtime per week I will
predict:
– Annual days of absenteeism = 1.5 + 0.8(5)
– Annual days of absenteeism = 1.5 + 4
– Annual days of absenteeism = 5.5 days
The effect of units of measurement
The value of b is extremely useful because we can predict changes in one variable based on the
value of another variable, expressed in their ‘natural’ units.
The regression coefficient is also limited by this. For example, if I measured absenteeism in days
lost per week rather than per year, everything is divided by 52
The coefficient now seems much smaller, but really the relationship has not changed
Because if this, we do not use the value of b to say that the relationship is weak, moderate, or
strong.
1. It is positive when the relationship it measures is positive, and negative when the relationship it
measures is negative
3. The extreme values of -1/+1 indicate perfect association, and 0 indicates no association.
4. The square of r, called the Coefficient of determination, r2, can be interpreted as a PRE measure
of association that measures how closely the data points fall near the regression line, and the
therefore the confidence we can have in our predictions based on the OLS regression line: it
measures how much variance in the dependent variable is explained by the regression line.
Using SPSS
Command: Analyze/Regresssion/Linear
2. Prediction based on OLS line is unreliable where the error terms are heteroscedastic (i.e. the
error terms ‘fan out’ as we move along the line.
3. Stability: OLS line can be misused if the relationship is extrapolated beyond the range of values
used in determining the equation.
R2 = 0.92
Considerations when using multivariate analysis
Can handle dichotomous independent variables