Handout 07 - Correlation and Regression

Bivariate relationships: Correlation and regression
Types of data and choice of bivariate methods
Independent variable
Interval/ratio with lots of points Nominal, ordinal, or interval/ratio with few points
 Compare groups using summary statistics (mean,
median, s.d. etc),
 Scatterplot  Calculate eta to measure strength

Interval/ratio with
lots of points  Regression analysis maybe also
 Collapse dependent variable into groups and construct

crosstab/stacked bar chart
Dependent
variable
 Collapse independent variable into groups and

 Crosstab or stacked bar chart
Nominal, ordinal, construct crosstab
or interval/ratio
 Measure of association appropriate to model and levels
with few points  Measure of association appropriate to model and
of measurement
levels of measurement
Analyzing a bivariate relationship: Review
 Research question: Does unpaid overtime increase worker absenteeism?
 Data are gathered from 90 employees asking:

– how many hours a week are normally spent doing unpaid overtime (independent variable)
– how many days in the last year were you absent for sickness reasons? (dependent variable)
 Responses are grouped and crosstabulated
 The data display a consistent, positive relationship between unpaid overtime and absenteeism, that
is very strong (Somers’ d = 0.69)
Annual days lost Weekly unpaid overtime Total

2 hours or less More than 2-4 hours More than 4-6 hours More than 6 hours
1-2 days 64% 4% 0% 0% 21%
More than 2-4 days 29% 46% 43% 5% 30%
More than 4-6 days 4% 38% 21% 14% 19%
More than 6 days 4% 12% 36% 82% 30%
Total 28 26 14 22 90
Scatterplots for two interval/ratio scales
 In the previous example, to make these distributions fit
into a crosstab, we recoded the interval/ratio scales into
broader categories
 With two interval/ratio scales with many values,

however, we do not need to collapse each down into
broader categories and crosstabulate them
 This is an unnecessary loss of information; there is an

alternative form of data description for this particular
combination of scales
 These data can be displayed in a scatterplot with:

– the independent variable across the horizontal X-axis, and
– the dependent variable along the vertical Y-axis
 A scatterplot allows each unique value on an

interval/ratio scale to ‘stand alone’ rather than forcing
them into broad categories within a crosstab
 The axes of the scatterplot also capture the quantity of

the variables measured, e.g. the difference between 2
and 4 hours is captured
Interpreting a scatterplot: Direction and pattern
 If a change in a particular direction on the scale for one
variable is associated with a change in the same direction
on the scale of the other variable, we have a positive
relationship: if high scores for one variable are associated
with high scores for the other variable, and low scores
are associated with low scores, the direction of the
relationship is positive
 This scatterplot describes a positive relationship: as

overtime increases (decreases) absenteeism increase
(decrease)
 If a change in a particular direction on the scale for one

variable is associated with a change in the opposite
direction on the scale of the other variable, we have a
negative relationship
 We can also observe that the coordinates in the

scatterplot all seem to fall around an imaginary straight
line: we describe this as a positive, linear correlation;
analogous to the concept of consistency when looking at
the pattern of a relationship in a crosstab
 “From the scatterplot we can say that there is a positive,

linear relationship between weekly unpaid overtime and
absenteeism as measured by annual days lost.”
 Possible to observe coordinates that do not fall around a

straight line: non-linear relationships
Describing a relationship: The line-of-best-fit
 The relationship can be summarized quantitatively in the form of an equation that expresses the
‘best’ line that fits the coordinates
 The general principle behind finding the line-of-best-fit to a linear relationship is to make the gap
(residuals) between the line and the individual plots as small as possible
 This principle in formal terms is known as Ordinary Least Squares: minimize the sum of the
squared residuals
 There are two features of the OLS regression line that we use to determine it for any given set of
points
1. The OLS regression will pass through the point that is the mean for each of the two variables.
2. The slop of the regression line - the regression coefficient - will always equal the following formula.
X i  X Yi  Y 
b =
X i  X 
2
Determining the line-of-best-fit
The OLS regression will pass through the point
that is the mean for the two variables
Weekly unpaid Annual days lost
overtime
Mean 4.3 4.8
 The slope of the regression line - the

regression coefficient - will always equal
the following formula
 In practical terms this means that for

every 1 hour increase in unpaid overtime
per week we expect the number of annual
days of absenteeism to go up by 0.8
b =
 
 X i  X Yi  Y 
 
2
 Xi  X
 0.8
Determining the line-of-best-fit (cont.)
 Once we know the slope of the regression line
and one of the points through which it passes,
we can then extend the line back to the Y-axis
to the point at which X = 0
 This gives us the constant for the regression

equation:
a = 1.5
 In practical terms this means that if unpaid

overtime fell to 0 on average absenteeism per
year we predict will be 1.5 days
OLS Regression equation
 We can put all these aspects of the OLS regression line together to express it as an equation:
– Annual days of absenteeism = 1.5 + 0.8(Weekly hours of unpaid overtime)
– Y = 1.5 + 0.8X
 From this I can make predictions about the number of days lost from levels of unpaid overtime
 If I receive information that a particular employee works 5 hours unpaid overtime per week I will
predict:
– Annual days of absenteeism = 1.5 + 0.8(5)
– Annual days of absenteeism = 1.5 + 4
– Annual days of absenteeism = 5.5 days
The effect of units of measurement
 The value of b is extremely useful because we can predict changes in one variable based on the
value of another variable, expressed in their ‘natural’ units.
 The regression coefficient is also limited by this. For example, if I measured absenteeism in days
lost per week rather than per year, everything is divided by 52
 The coefficient now seems much smaller, but really the relationship has not changed
 Because if this, we do not use the value of b to say that the relationship is weak, moderate, or
strong.
 We use a standardized measure called Pearson’s correlation coefficient, r, to assess this.

Interpreting a scatterplot: Strength
 The strength of any observed relationship
relates to:
– the slope of the imaginary line running through
the coordinates (i.e. its ‘steepness’),
– how closely the coordinates are to the line
 Unlike a crosstab, it is difficult to assess the

strength of the relationship in a scatterplot
just with the naked eye.
 This is because the apparent slope of the

line and the gaps between the coordinates
can be altered just by changing the relative
size of the axes of the scatterplot and by
changing the order of magnitude of the scale
The strength of a linear relationship
Pearson’s r has the following properties:
1. It is positive when the relationship it measures is positive, and negative when the relationship it
measures is negative
2. It ranges ranges between –1 and +1, regardless of the units of measurement.
3. The extreme values of -1/+1 indicate perfect association, and 0 indicates no association.
4. The square of r, called the Coefficient of determination, r2, can be interpreted as a PRE measure
of association that measures how closely the data points fall near the regression line, and the
therefore the confidence we can have in our predictions based on the OLS regression line: it
measures how much variance in the dependent variable is explained by the regression line.
Using SPSS
Command: Analyze/Regresssion/Linear
Annual days of absenteeism = 1.5 + 0.8(Weekly hours of unpaid overtime)

R2 = 0.60
Reporting results
There is a linear and positive relationship between

unpaid overtime and absenteeism as measured by annual
working days lost due to sickness. This relationship can
be expressed as follows:
Annual days of absenteeism = 1.5 + 0.8(Weekly hours of unpaid overtime)
Thus for every 1 hour increase in weekly unpaid

overtime, absenteeism per employee increases by 0.8
days a year. This relationship is also very strong, and
should allow us to make reasonably accurate predictions
about the impact that unpaid overtime will have on
absenteeism (r2 = 0.60).
Use and misuse of OLS regression
1. Linearity: the OLS regression line is the line-of-best-fit for linear relationships only.
2. Prediction based on OLS line is unreliable where the error terms are heteroscedastic (i.e. the
error terms ‘fan out’ as we move along the line.
3. Stability: OLS line can be misused if the relationship is extrapolated beyond the range of values
used in determining the equation.
4. No outliers on the dependent variable, independent variable, or combination of variables.

Extending the analysis: Multiple regression
 Regression analysis can be extended to show the extent to which a combination of independent
variables, jointly and separately, explain the values of a dependent variable
 What determines house prices?
Selling price House size Age Land size

(P10,000) (squares) (in years) (meters squared)
260 20 5 420
240 15 12 640
245 20 9 600
210 13 15 590
230 18 9 700
242 14 7 720
295 28 1 624
235 16 12 590
287 24 2 710
252 20 5 630
270 23 5 700
275 25 5 710
Price (P10,000) = 221 + 2.6(house size) - 2.96(age) + 0.004(land size)
R2 = 0.92
Considerations when using multivariate analysis
 Can handle dichotomous independent variables
 Beware ‘fishing expedition’ using Stepwise regression
 The independent variables must not be correlated with each other
 How much explained variance is enough?
 Lots more assumptions

Handout 07 - Correlation and Regression

Uploaded by

Copyright:

Available Formats

Handout 07 - Correlation and Regression

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Handout 07 - Correlation and Regression

Uploaded by

Copyright:

Available Formats

Bivariate relationships: Correlation and regression

Types of data and choice of bivariate methods

 Scatterplot  Calculate eta to measure strength

 Collapse dependent variable into groups and construct

 Collapse independent variable into groups and

 Data are gathered from 90 employees asking:

 Responses are grouped and crosstabulated

Annual days lost Weekly unpaid overtime Total

 With two interval/ratio scales with many values,

 This is an unnecessary loss of information; there is an

 These data can be displayed in a scatterplot with:

 A scatterplot allows each unique value on an

 The axes of the scatterplot also capture the quantity of

 This scatterplot describes a positive relationship: as

 If a change in a particular direction on the scale for one

 We can also observe that the coordinates in the

 “From the scatterplot we can say that there is a positive,

 Possible to observe coordinates that do not fall around a

 The slope of the regression line - the

 In practical terms this means that for

 This gives us the constant for the regression

 In practical terms this means that if unpaid

 We use a standardized measure called Pearson’s correlation coefficient, r, to assess this.

 Unlike a crosstab, it is difficult to assess the

 This is because the apparent slope of the

2. It ranges ranges between –1 and +1, regardless of the units of measurement.

Annual days of absenteeism = 1.5 + 0.8(Weekly hours of unpaid overtime)

There is a linear and positive relationship between

Annual days of absenteeism = 1.5 + 0.8(Weekly hours of unpaid overtime)

Thus for every 1 hour increase in weekly unpaid

4. No outliers on the dependent variable, independent variable, or combination of variables.

 What determines house prices?

Selling price House size Age Land size

Price (P10,000) = 221 + 2.6(house size) - 2.96(age) + 0.004(land size)

 Beware ‘fishing expedition’ using Stepwise regression

 The independent variables must not be correlated with each other

 How much explained variance is enough?

 Lots more assumptions

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.