0% found this document useful (0 votes)
4 views21 pages

S4 Correlation (3)

The document discusses various statistical concepts including correlation, sampling techniques, and linear regression. It provides examples of how to interpret data, the importance of distinguishing between correlation and causation, and the use of regression lines for predictions. Additionally, it emphasizes the significance of context in analyzing relationships between variables and the potential pitfalls of extrapolation.

Uploaded by

polykaurgill
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views21 pages

S4 Correlation (3)

The document discusses various statistical concepts including correlation, sampling techniques, and linear regression. It provides examples of how to interpret data, the importance of distinguishing between correlation and causation, and the use of regression lines for predictions. Additionally, it emphasizes the significance of context in analyzing relationships between variables and the potential pitfalls of extrapolation.

Uploaded by

polykaurgill
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 21

Correlation

RUGBY HIGH SCHOOL - MATHEMATICS DEPARTMENT 1


Starter

The table shows the length of time, to the nearest ms, that is takes a group of students to
respond to a stimulus.

(a) Give two reasons why a histogram is appropriate for displaying this data.
The bar to represent the students who took between 4 ms and 6 ms to respond is of width 1
cm and height 2.5 cm. Continuous data; classes are different widths
(b) Calculate the width and height of the bar representing students who took between 7.5 ms
and 8.5 ms to respond. Width = 0.5 cm, height = 4 cm
(c) Use linear interpolation to estimate the median time taken to respond.
25 − 12
(d) Calculate an estimate for the standard deviation. × 1+ 6=6.866


15

( )
2
2291.1875 330.25
∑ 𝑓𝑡=330.25 ∑ 𝑓 𝑡 2=2291.1875 𝜎= 50

50
=1.4824 …
2. Simon is researching the eating habits of the students in his school year.
He asks the first five people he sees on Monday morning.
(a) Write down the sampling technique Simon is using. Opportunity sampling
(b) State one advantage of this technique. Quick (and easy) to carry out.
(c) Suggest two improvements that Simon can make to the sampling technique.
Increase the number of people he asks.
Vary the time of day he asks people.

RUGBY HIGH SCHOOL - MATHEMATICS DEPARTMENT 2


Correlation

RUGBY HIGH SCHOOL - MATHEMATICS DEPARTMENT 3


Bivariate data is data which has pairs of values for two variables.
We are often interested in how one variable changes (usually ) as the other variable (usually ) changes. In
this case we refer to and as the response variable and the explanatory variable, respectively.
An explanatory (or independent) variable is one that is set independently of the other variable. It is
plotted along the -axis.
A response (or dependent) variable is one whose values are determined by the values of the
independent variable. It is plotted along the -axis.
Explanatory Variable Response Variable
Time for which a chemical reaction is allowed to Weight of chemical compound produced
proceed
Weight of chemical compound required Time taken to produce this weight
An interval of time Number of cars passing during this interval.
Number of cars passing a junction Time taken for these cars to pass

Your turn …

For each situation below, explain which quantity would be the explanatory variable, and which would be the
response variable.
1. The time spent practising the piano each week. Explanatory
The number of mistakes made in a test at the end of the week. Response
2. The age of a second hand car. Explanatory
The value of the second hand car. Response
3. The growth rate of a plant in an experiment. Response
The amount of sunlight falling on a plant in an experiment. Explanatory

RUGBY HIGH SCHOOL - MATHEMATICS DEPARTMENT 4


You should already be familiar with the idea of scatter graphs and types of correlation.

Correlation describes the nature of the linear relationship between two variables.

x x x x x
x x x x
x x x x
x x x x
x x x x x x
x x x x
x x x x x
x

Negative correlation No correlation Positive correlation


You should only use
Two variables have a causal relationship if a change in one
correlation to describe data
variable causes a change in the other. Just because two that shows a linear
variables show correlation it does not necessarily mean that relationship. Variables with no
they have a causal relationship. linear correlation could still
show a non-linear
When two variables are correlated, you need to consider the relationship.
context of the question and use your common sense to
determine whether they have a causal relationship.

RUGBY HIGH SCHOOL - MATHEMATICS DEPARTMENT 5


Correlation does not imply causation is a phrase used to emphasise that a
correlation between two variables does not necessarily imply that one causes the other.
Example 1

The faster windmills rotate, the more wind there is.


Therefore wind is caused by the rotation of windmills. (Or, simply put: windmills, as their name
indicates, are machines used to produce wind.)

In this example, the correlation between windmill activity and wind velocity does not
imply that wind is caused by windmills. It is rather the other way around, as suggested
by the fact that wind doesn’t need windmills to exist, while windmills need wind to
rotate. Wind can be observed in places where there are no windmills or non-rotating
windmills—and there are good reasons to believe that wind existed before the
invention of windmills!
Example 2

Since the 1950s, both the atmospheric CO2 level and obesity levels have increased sharply.
Hence, atmospheric CO2 causes obesity.
Richer populations tend to eat more food and consume more energy

RUGBY HIGH SCHOOL - MATHEMATICS DEPARTMENT 6


Your turn …
Page 61 Ex 4A
1. Leanne is a keen follower of Straston Town Football Club, She collects
data on the crowd attendance, in thousands, at seven home games and
records the total number of goals scored in each match. She displays
this information in the scatter graph. Positive (correlation)
(a) Describe the correlation shown in the diagram.
Leannae says that there is a causal relationship between the size of the
crowd and the total number of goals scored.
(b) Comment of Leanne’s claim.
Unlikely to be causal as you don’t know before the game how may goals will be scored. There could be a
third variable that influences the crowd size and number of goals scored, such as recent team
performance.
2. Nadine is investigating whether there is a linear
relationship between Daily Mean Pressure, p hPa,
and Daily Mean Air Temperature, t °C, in Beijing
using the 2015 data from the large data set.

Nadine chooses to use all of the data for Beijing


from 2015 and draws the scatter diagram.
(a) Explain, in context, what Nadine can infer
about the relationship between p and t using
the information shown. As pressure increases, temperature decreases
(b) Using your knowledge of the large data set, explain why it is not meaningful to look for a linear
relationship between Daily Mean Wind Speed (Beaufort Conversion) and Daily Mean Air
Temperature in Beijing
Daily in 2015.
mean wind speed (Beaufort) is a qualitative variable.

RUGBY HIGH SCHOOL - MATHEMATICS DEPARTMENT 7


Discuss each of the following.

1. Ice cream sales and the number of shark attacks on swimmers are positively correlated. Can I
conclude that a rise in ice cream sales is going to cause more shark attacks?

2. Children with bigger feet spell better. So, a better ability to spell is caused by big feet.

3. The more firefighetrs fighting a fire, the bigger the fire is going to be. Therefore, firefighters cause fires.

4. People are taller today than 500 years ago. Health and diet have improved over the last 500 years.
So, better health and diet have caused people to become taller.

5. As the number of pirates has decreased, global warming has increased. So, global warming is caused
by a lack of pirates.

1. Of course ice cream does not cause shark attacks! Ice cream sales and shark attacks both increase
during warm weather. So, the two variables are positively correlated but there is no causal relationship
between the two!

2. A child’s shoe size and their ability to spell are both related to a child’s age. Children with bigger
feet spell better because they are older, their greater age bringing about bigger feet and, not quite so
certainly, better spelling. Thus the two variables are positively correlated and there is no causal
relationship.

3. No! it’s the other way round. Fire causes firefighters.

4. It makes sense. It’s also why people live longer. We still need proof though!

5. No! We don’t need more pirates! These are completely unrelated and are a coincidence.

RUGBY HIGH SCHOOL - MATHEMATICS DEPARTMENT 8


Exam Style Question

Jerry is studying visibility for Camborne using the large data set June 1987.

Jerry drew the following scatter diagram, Figure 2, and calculated some statistics using the June 1987 data
for Camborne from the large data set.

Jerry defines an outlier as a value that is more than 1.5 times the interquartile range above Q 3 or more
than 1.5 times the interquartile range below Q 1.
(a) Show that the point circled on the scatter diagram is an outlier for visibility.
(b) Interpret the correlation between the daily mean visibility and the daily maximum relative humidity.

RUGBY HIGH SCHOOL - MATHEMATICS DEPARTMENT 9


Starter

1. The daily mean temperature, , and the daily total sunshine,


hours, were recorded on 7 days in one month in 1987 in Camborne.
The scatter diagram shows this data.

(a) Describe the type of correlation shown by the scatter diagram.


Positive correlation
(b) Interpret the correlation in context.
As the daily mean temperature increases, the total daily sunshine increases.
(c) Using your knowledge of the large data set, state which month these data might have been
sampled from. Any one of June, July, August

2. Data from the daily mean windspeed (in knots) in Leuchars is July 1987 is taken from the large data
set.
3 4 5 5 5 5 5 5 5
5 6
6 6 7 7 7 8 8 8 8
9 9 Median = 7, LQ = 5, UQ = 9, IQR = 4
9 9 10 11 11 12 15 16 19
(a) Calculate the median and the interquartile range.
An outlier is defined as a value which lies either 1.5 x interquartile
16 & 19 arerange above
the only the upper quartile or
outliers
1.5 x the interquartile range below the lower quartile.
(b) Determine whether there any outliers in the data.
(c) Draw a box plot for this data.

RUGBY HIGH SCHOOL - MATHEMATICS DEPARTMENT 10


Linear
Regression

RUGBY HIGH SCHOOL - MATHEMATICS DEPARTMENT 11


When a scatter graph shows correlation, you can draw a line of best fit. We can only fit a
straight line of best fit to a scatter graph if the points lie roughly in a straight line.
When we fit a straight line of best to fit to a scatter graph it has the form

where is the point of intercept with the -axis and is the gradient of the line (i.e. the amount by
which increases for an increase of 1 in ).

For each point on the scatter diagram we can express in terms of as , where is the vertical
distance from the line of best fit.

The values 1, 2, etc. are known as residuals.


x
3

x The line that minimises the sum of the squares the


x 4
1 residuals is called the least squares regression line.
x2 The line is called the regression line of on .

RUGBY HIGH SCHOOL - MATHEMATICS DEPARTMENT 12


Example 3

From the large data set, the daily maximum temperature () and the daily total sunshine () for
12 days in May in Heathrow in 2015 were recorded The data was plotted on a scatter graph.

(a) Describe the correlation between daily maximum


temperature and daily total sunshine.
There is a positive correlation.
Daily Total Sunshine

The equation of the regression line of on for


these 12 days is
(b) Give an interpretation of the value of the
gradient of this regression line.

If the daily maximum temperature increase by


1oC the daily total sunshine increases by 0.897
hours.
(c) Justify the use of a linear regression in this
Daily Max Temp instance.
The points generally lie close to a straight line so
Your interpretation must include the a linear regression line is a suitable model.
context and the numerical value of
the gradient. A regression line is only a valid model when
the data shows linear correlation.

RUGBY HIGH SCHOOL - MATHEMATICS DEPARTMENT 13


Your turn …

A survey of 10 two-bedroom flats in a particular town


is carried out. The value of the flats, £1000s, and the
distance, km, from the town centre is recorded.

A scatter diagram is drawn to show the data,


(a) Give a reason to support the use of linear regression
to model this data.
The data exhibits (strong) linear correlation OR the points lie close to a straight line.
The equation of the regression line of on is found to be
(b) Interpret the values and in the regression equation.
The value, in £1000s, when the flat is right in the centre of the city.
The value decreases by £10,000 for every 1 km from the city centre.
The point representing one of the flats has been circled on the graph.
(c) State, with a reason, whether you think this point is likely to represent a recording error or
a valid data value. It is likely to be valid, as the value is not unrealistic. It could be that the
flat is in a more expensive area than the others sampled, or it could
have significantly bigger rooms.

RUGBY HIGH SCHOOL - MATHEMATICS DEPARTMENT 14


A regression line can be used to estimate the value of the dependent variable for any value of
the independent variable.

Interpolation is when you estimate the value of a dependent variable within the range of
observed data values.

Extrapolation is when you estimate a values outside the range of observed data values.
Extrapolated values can be unreliable and should be viewed with caution.

RUGBY HIGH SCHOOL - MATHEMATICS DEPARTMENT 15


Example 4

The data in the table refer to a chain of shops. The figures reported are the number of sales staff () and the
average daily takings in thousands of pounds ().
17 39 32 17 25 43 25 32 48 10 48 42 36 30 19
7 17 10 5 7 15 11 13 19 3 17 15 14 12 8

Note: This regression line should only be


The least squares regression line of on is found to be used to predict values of for values of .
(a) Use the equation to estimate the average daily takings of a shop with 21 staff.
(b) The company proposes opening a new superstore with a staff of 85. Would it be sensible to use the
regression model to work out the potential takings for the new store? Explain your answer.
(c) Interpret what the values -0.324 and 0.384 tell you in the context of the question.
(a) = 7.74 (£thousand) = £7740
(b) No. 85 staff is a lot larger than any of the observed values so extrapolation is not very reliable.
(c) -0.324 is the value where the line crosses the -axis, i.e. when = 0.
In the context of the question this means that when there are no sales staff the shop can expect to
make a loss of 0.324 thousand pounds (or £324).
0.384 is the gradient of the regression line, i.e. when increase by 1 the value increases
by 0.384.
In the context of the question this means that for each additional member of staff the
takings increase by 0.384 thousand pounds (or £384).

RUGBY HIGH SCHOOL - MATHEMATICS DEPARTMENT 16


Your turn …
Page 65 Ex 4B
1. The daily mean temperatures, , and the daily mean
pressures, hPA, are recorded for a random sample of 10 days
in Beijing in June 2015. The data is displayed in a scatter diagram.
The equation of the regression line of on is calculated to
be
(a) Comment on the validity of using the regression line to
estimate the daily mean pressure when the daily mean
temperature is: Invalid: outside the range of the data
(i) (ii)
Valid: inside the range of the data
(b) Explain why it would not be appropriate to use the regression line
to predict the daily mean temperature when the daily mean pressure
is 1009 hPa. This is the regression line for on To predict temperature
from pressure you would need the regression line for on .
(c) With reference to correlation, comment on how accurately the regression model reflect the data.
The data shows some degree of linear correlation, which suggests that the linear regression model will be
a reasonable model for the data.
2. The relationship between two variables p and t is modelled by the regression line with equation
p = 22 – 1.1 t
The model is based on observations of the independent variable, t, between 1 and 10
(a) Describe the correlation between p and t implied by this model.Negative
Given that p is measured in centimetres and t is measured in days,
(b) state the units of the gradient of the regression line. cm/day
Using the model,
(c) calculate the change in p over a 3-day period.3×− 1.1=− 3.3(𝑖.𝑒 𝑑𝑒𝑐𝑟𝑒𝑎𝑠𝑒3.3𝑐𝑚)
Tisam uses this model to estimate the value of p when t = 19 19 is outside the range of the data (1-10),
(d) Comment, giving a reason, on the reliability of this estimate.i.e. involves extrapolation, so unreliable.

RUGBY HIGH SCHOOL - MATHEMATICS DEPARTMENT 17


Exam Style Question 1

A company is introducing a job evaluation scheme. Points (x) will be awarded to each job based on the
qualifications and skills needed and the level of responsibility. Pay (£y) will then be allocated to each job
according to the number of points awarded.

Before the scheme is introduced, a random sample of 8 employees was taken and the linear regression
equation of pay on points was y = 4.5x – 47

(a) Describe the correlation between points and pay.

(b) Give an interpretation of the gradient of this regression line.

(c) Explain why this model might not be appropriate for all jobs in the company.

RUGBY HIGH SCHOOL - MATHEMATICS DEPARTMENT 18


Exam Style Question 2

To test the heating of tyre material, tyres are run on a test rig at chosen speeds under given
conditions of load, pressure and surrounding temperature. The following table gives values
of , the test rig speed in miles per hour (mph), and the temperature, °C, generated in the
shoulder of the tyre for a particular tyre material.

x (mph) 15 20 25 30 35 40 45 50
y (°C) 53 55 63 65 78 83 91 101
(a) Draw a scatter diagram to represent these data.
(b) Give a reason to support the fitting of a regression line of the form
through these points.

The regression line for on is


(c) Give an interpretation for each of the values in the regression line.
(d) Use the equation of the line to estimate the temperature at 50 mph and explain
why this estimate differs from the value given in the table.
A tyre specialist wants to estimate the temperature of this tyre material at 12 mph
and 85 mph.
(e) Explain briefly whether or not you would recommend the specialist to use this
regression equation to obtain these estimates.

RUGBY HIGH SCHOOL - MATHEMATICS DEPARTMENT 19


y
100

90

Scales & labels B1


80 Points B2, 1, 0
Tem p
°C

70

60

50

x
15 25 35 45 55

S p eed (m p h )

RUGBY HIGH SCHOOL - MATHEMATICS DEPARTMENT 20


(c)

(d)

(e)

RUGBY HIGH SCHOOL - MATHEMATICS DEPARTMENT 21

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy