A6 Regression Challenge ANSWERS
A6 Regression Challenge ANSWERS
A6 Regression Challenge ANSWERS
Hedgerows are the main nesting habitat of the grey partridge (Perdix perdix). A survey was carried out to
establish whether the abundance of hedgerows in agricultural land had an effect on the abundance of grey
partridge. From an area of agricultural land covering several farms, twelve plots were selected which had
land uses as similar as possible but differed in the density of hedgerows (km hedgerow per km2). Plots were
deliberately selected to cover a wide range of hedgerow densities. The total hedgerow lengths, and exact
plot areas, were measured by use of large scale maps. The density of partridges was established by visiting
all fields in a study plot once immediately after dawn and once just before dusk, when partridges are feeding
and therefore most likely to be seen. Counts of birds observed were made on each visit and the dawn and
dusk data were averaged to give a value for partridge abundance for each study plot.
The data are stored in a CSV file PARTRIDG.CSV. Take note: this is a different data set than the one used
in the ‘Regression diagnostics’ chapter. The density of hedgerows (km per km2) is in the Hedgerow variable
and the density of partridges (no. per km2) is in the Partridge variable.
First steps
As always start by loading the necessary packages. You’ll want the tidyverse, and ggfortify if you want
to make your diagnostic plots with autoplot.
Start by importing the data and having a quick look at the data. How many variables are there? Is the
format appropriate for a regression (i.e. are the dependent and independent variables on an interval or ratio
scale)? This is what you should see. . . .
glimpse(partridge)
## Rows: 12
## Columns: 2
## $ Hedgerow <dbl> 8.4, 9.1, 14.0, 21.9, 27.2, 32.5, 30.1, 15.9, 27.1, 34.8,...
## $ Partridge <dbl> 1.3, 8.2, 1.2, 12.7, 21.9, 19.4, 26.4, 8.4, 7.1, 25.2, 0....
Remember that regression also requires that the residuals are independent (i.e. that the value of one residual
does not depend on the value of other residuals) and that there is negligible measurement error in the values
of x. These two assumptions should have been considered earlier, as part of the experimental design.
Next make a quick plot of the data. Remember that the independent and dependent variables should go
on the x and y axes respectively. This plot allows for crude checks of whether there is a linear relationship
between the two variables and whether the variance is constant (i.e. that the scatter in y neither increases
nor decreases substantially with increasing values of x). Construct this picture.
1
ggplot(partridge, aes(x=Hedgerow, y = Partridge)) +
geom_point()
20
Partridge
10
10 20 30
Hedgerow
Here, there does appear to be a linear relationship and there is no obvious violation of the constant variance
assumption. Can you guess the intercept and slope? Pay close attention to the x- and y-axis values when
trying to estimate the intercept!
We will come back to these assumptions using regression diagnostics, after fitting the model.
Fit the model using the lm function, which takes a formula and a data frame containing the variables in the
formula as its two arguments. Remember that in the formula the dependent and independent variables go
on the left and right side of the ~ respectively.
Check the model diagnostics before looking at the statistical output. This is important. There is no point
worrying about p-values until you’re convinced the assumptions of a model have been met. Produce this
output and review the assumptions and make sure you are satisfied with them. You might want to review
the slides/video/reading to remember which assumption is evaluated by which panel.
2
Residuals vs Fitted Normal Q−Q
Standardized residuals
2 7 2 7
5 1
Residuals
0 0
−5 −1
−10 9 −2 9
5 10 15 20 25 −1 0 1
Fitted values Theoretical Quantiles
Standardized Residuals
1.5 9 7 2
1
2 7
1.0 0
−1
0.5
−2 9
5 10 15 20 25 0.0 0.1 0.2
Fitted values Leverage
Remember that regression is fairly robust to small violations in these assumptions. If there are large viola-
tions transforming the data may help e.g. using logs, square roots (if count data), arcsine square roots (if
percentages or proportions), or squares. After fitting the model to the transformed data the model diagnos-
tics must be repeated to check whether or not the assumptions are now met. We will review transformations
in the next week.
Statistical output
If, and only if, the model assumptions were not violated, is it time to look at the model output. An F -test can
be used to determine whether the slope of the fitted model is significantly different to zero. This is carried
out using the anova function, which takes the name of the fitted model as its only argument. Produce this
table:
anova(partridge_model)
The table summarises the different parts of the F -test calculations: Df – degrees of freedom, Sum Sq – the
sum of squares, Mean Sq – the mean square, F value – the F -statistic, Pr(>F) – the p-value. The F -statistic
is the key term; larger values of the F statistic indicate a stronger relationship between the independent and
3
dependent variables. A p-value of less than 0.05 indicates the result is statistically significant. Also make a
note of the degrees of freedom, as these should be included in the results.
In this example, the slope is significantly different to zero, which suggests that there is a significant effect of
hedgerow density on partridge density.
Next use the summary function to extract some more information about the fitted model. We can use the
summary function to get precise estimates of the intercept and slope estimated from the model fitting.
As with the anova function this takes the name of the fitted model as its one argument. Produce this table:
summary(partridge_model)
##
## Call:
## lm(formula = Partridge ~ Hedgerow, data = partridge)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.4153 -2.9011 0.1224 4.0648 6.3186
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.6650 3.7905 -1.495 0.165903
## Hedgerow 0.8554 0.1661 5.150 0.000432 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 5.163 on 10 degrees of freedom
## Multiple R-squared: 0.7262, Adjusted R-squared: 0.6988
## F-statistic: 26.52 on 1 and 10 DF, p-value: 0.0004315
The first couple of lines here give the formula for the model. The next few lines gives some properties of
the residuals, which can be ignored. The next few lines gives the coefficients table. The first ((Intercept))
and second (Hedgerow) rows relate to the intercept and slope of the fitted line, respectively. The columns
(Estimate, Std. Error, t value, and Pr(>|t|)) show the estimates of each coefficient, standard error
associated with each coefficient, the corresponding t-statistics, and the p-values respectively. The estimated
intercept and slope coefficients are the useful bit here. These allow us to make predictions, i.e. for a given a
value of hedgerow density, we can predict how many partidges we expect to find.
The Multiple-R-squared value is also useful. This shows the proportion of variance in the dependent
variable that is explained by the independent variable. As it is a proportion it is always between 0 and 1,
with values of 0 or 1 indicating that none or all of the variation has been explained by the model, respectively.
The value of 0.73 here is reasonably high, but there is still some unexplained variation.
Be careful with the p-values in the summary output. Sometimes these are useful, sometimes thay are not. In
this example, the summary output indicates that, the slope, but not the intercept, is significantly different
to zero. In the case of a simple linear regression the p-value for the slope term produced by summary is the
same as that produced by the anova function above. The same equivalence is not seen with other kinds of
models (e.g. one way Analysis of Variance). In general, you should play it safe and use anova (not summary!)
to assess significance of model terms.
4
Presenting the results
There is a significant positive relationship between the density of hedgerows and partridges
(y = −5.7 + 0.86x; F=26.5, d.f=1,10, p<0.001).
The degrees of freedom (d.f) should be reported as the slope degrees of freedom first then the error degrees
of freedom.
Often the results are easier to interpret as a figure. See if you can review and implement the workflow from
the reading and video to produce this figure (e.g. you made a template with the video to do this!)
# Step 3: HOUSEKEEPING: collect and rename things to make the plotting easier
addThese <- data.frame(newX, newY) %>%
rename(Partridge = fit)
Now we are ready to use the predictions (addThese) AND the raw data together. Note how we first plot the
raw data, then overlay the predicted line and then overlay the confidence band.
5
30
20
Partridge
10
10 20 30
Hedgerow