Statistics GCSE Revision Notes
Statistics GCSE Revision Notes
Categorical – data that can be sorted into non-overlapping categories such as gender. Used for qualitative
data so that it can be more easily processed.
Ordinal (rank) – quantitative data that can be given an order or ranked on a rating scale, e.g. marks in an
exam.
Bivariate – Involves measuring 2 variables. Can be qualitative or quantitative, grouped or ungrouped. Usually
used with scatter diagrams where the two axes represent the two different variables. One variable is often
called the explanatory variable and the other the response variable.
Multivariate – Made up of more than 2 variables e.g. comparing height, weight, age and shoe size together.
Grouping Data
Grouping data using tables makes it easier to spot patterns in the data and quickly see how the data is distributed.
Discrete data can be grouped into classes that do not overlap e.g. 0-10, 11-15… (they do not have to have
equal class width). Uses smaller intervals when there is a lot of data close together in that range and wider
classes for data that is more spread out.
Continuous data can be grouped using inequalities. The class intervals must not have gaps between them or
be overlapping so inequality symbols must be used with one of the symbols being < and the other ≤.
Pros:
o Makes the data easy to read and understand.
o Easy to spot patterns and compare data.
Cons:
o Loses accuracy of data as you no longer know exact data values.
o Calculations made from these will only be an estimate e.g. mean.
Questionnaires Databases
Newspapers/Magazines/
Interviews
Websites
Sampling Methods:
Sampling
Methods
Systematic Opportunity
Sampling Sampling
Judgement
Cluster Sanpling
Sampling
Random Sample – Every item/person in the population has an equal chance of being selected.
o Method:
Assign a number to every member in the population.
Mention the random sampling technique you are going to use e.g. a random number table
or a random number generator on a calculator.
Select the numbers chosen from your population.
Ignore any repeats and choose another number.
o Random Sampling Techniques:
Pick numbers/names out of a hat (only works for small samples)
Using a random number table
Using the random number generator function on a calculator or computer.
o Advantages:
Sample is representative as every member of the population has an equal chance of being
selected.
Unbiased
o Disadvantages:
Need a full list of population (not always easily obtainable)
Not always convenient as it can be expensive and time consuming.
Needs a large sample size
Stratified Sample – the size of each strata (group) in the sample is in proportion to the sizes of strata in the
population. E.g. if group A accounts for 10% of the population, in the sample group A will also be 10% of the
sample size.
o Method:
Split the population into groups (usually done for you in the exam)
𝒔𝒕𝒓𝒂𝒕𝒂
Use the formula 𝒔𝒕𝒓𝒂𝒕𝒊𝒇𝒊𝒆𝒅 𝒔𝒂𝒎𝒑𝒍𝒆 = × 𝒔𝒂𝒎𝒑𝒍𝒆 𝒔𝒊𝒛𝒆 to calculate sample size
𝒕𝒐𝒕𝒂𝒍
for each group. (remember to check totals if you rounded numbers and adjust accordingly if
your total sample size after stratification is bigger/smaller than sample size in the question).
Use random sampling to select members from each strata/group.
o Advantages:
Sample is in proportion to population, so sample represents the population fairly.
Best used for populations with groups of unequal sizes.
o Disadvantages:
Time consuming
Cluster Sampling – The population is divided into natural groups (clusters), groups are chosen at random
and every member of groups are sampled. Useful for large populations e.g. when surveying lots of different
towns in a country.
o Advantages:
Economically efficient – less resources required.
Can be representative if lots of small clusters are sampled.
o Disadvantages:
Clusters may not be representative of the population and may lead to a biased sample.
High sampling error.
Quota Sampling – Population is grouped by characteristics and a fixed amount is sampled from every group.
o Method:
Group population by characteristics e.g. gender and age
Select quota (amount) for each group e.g. 30 men under 25, 40 women over 30 etc.
Obtain sample by finding members of each group until quota is reached.
o Advantages:
Quick to use.
Cheap.
Do not need sample frame or full list of the population.
o Disadvantages:
NOT RANDOM – biased as interviewer is choosing who will be in the sample so every
member of the population does not have an equal chance of being selected.
Opportunity Sampling – Using the people/items that are available at the time. E.g. interviewing the first 10
people you see on a Monday morning.
o Advantages:
Quick
Cheap
Easy
o Disadvantages:
NOT RANDOM. The sample has not been collected fairly so it may not represent the
population and every member of the population has not been given an equal chance to be
selected.
Judgement Sampling – When the researcher uses their own judgement to select a sample, they think will
represent the population. E.g. A teacher choosing students to interview about their opinion on a new after
school club.
o Advantages:
Easy
Quick
o Disadvantages:
NOT RANDOM.
Quality of sample depends on the person selecting the sample. The researcher may be
biased and unreliable in the sample they select.
Petersen Capture-Recapture - Used to estimate the size of large or moving populations where it would be
impossible to count the entire population. Your answer is only an ESTIMATE.
Method:
1. Take a sample of the population
2. Mark each item
3. Put the items back into the population and ensure they are thoroughly mixed
4. Take a second sample and count how many of your sample are marked
5. The proportion of marked items in your new sample should be the same as the proportion of
marked items from the population in your first sample.
Assumptions:
Population has not changed – no births/deaths
Probability of being caught is equally likely for all individuals.
Marks/tags not lost
Sample size is large enough and is representative of the population.
Experiments – used when a researcher in how changes in one variable affect another.
Variables:
o Explanatory (Independent) Variable – The variable that is changed.
o Response (dependent) variable – The variable that is measured.
o Extraneous Variables – Variables you are not interested in but that could affect the result of
your experiment.
Laboratory Experiments – Researcher has full control over variables. Conducted in a lab or similar
environment.
o Example - measuring reaction times of people of different ages.
Explanatory variable - age
Response variable - reaction time.
Extraneous variables - gender, health condition, fitness level etc.
o Advantages:
Easy to replicate – makes results more reliable.
Extraneous variables can be controlled so results are more likely to be valid as you
can be sure that other factors are not affecting your results.
o Disadvantages:
People may behave differently under test conditions than they would under real-life
conditions – could affect validity of results.
Field Experiments – Carried out in the everyday environment. Researcher has some control over the
variables. They set up the situation and controls the explanatory variable but has less control over
extraneous variables.
o Example – Testing new methods of revision.
Explanatory variable – method of revision
Response variable – results in exam
Extraneous variables – amount of revision pupils does, ability of pupils.
o Advantages:
More accurate – reflects real life behaviour.
o Disadvantages:
Cannot control extraneous variables.
Not as easy to replicate – less reliable than lab experiments.
Natural Experiments - Carried out in the everyday environment. Researcher has no/very little
control over the variables. Explanatory variables are not changed but instead researchers look at
something that already exists in the world and how it affects other things.
o Example – the effect of education on level of income
Explanatory variable – level of education
Response variable – income
Extraneous variables – IQ, other skills people may have, personal circumstances
o Advantages :
Reflects real life behaviour
o Disadvantages:
Low validity – extraneous variables are not controlled which may affect results
instead of explanatory variable.
Difficult to replicate.
Cannot control extraneous variables.
Simulation – A way to model random events using random numbers and previously collected data. These could be
used to help you predict what could actually happen in real life.
Easier and cheaper than actually collecting the data.
Steps:
1. Choose a suitable method for getting random numbers – dice, calculator, random number tables.
2. Assign numbers to the data.
3. Generate the random numbers.
4. Match the random numbers to your outcomes.
Example:
You sell milk, dark and white chocolates in a shop. P(milk) = 3/6, P(white) = 1/6, P(dark) = 2/6.
Simulate the choice of chocolates that the next 10 customers will buy.
We are not looking at theoretical probability for each chocolate otherwise we could just work out 3/6 of 10 and so
on. We are using these to assign numbers to generate random numbers from that will tell us which chocolate each
customer will choose. So, a bit more like experimental probability/relative frequency without the real-life situation.
1. Use a dice as there are 6 numbers in this scenario.
2. 3/6 of 6 is 3 so assign numbers 1, 2, 3 on the dice to milk chocolate. 1/6 of 6 is 1 so assign the next
number, 4, to white chocolate. Assign numbers 5 and 6 on the dice to dark chocolate.
3. Roll the dice 10 times to generate the random numbers and record the results. E.g. 3,3,4,5,1,5,1,3,5,2.
4. Match the numbers to the outcomes – M, M, W, D, M, D, M, D, M.
You now know for the next ten customers you need 6 milk chocolates, 1 white chocolate and 3 dark
chocolates.
Note that these results do not match with the probabilities in the question and they won’t always as this is
mimicking real life situations. Also remember that since this is a simulation these results are not necessarily
accurate. To get a more reliable simulation repeat the simulation lots of times.
Questionnaires/Interviews:
A source of primary data
Questionnaire – A set of questions used to obtain data from the population/sample. Can be carried out via post,
email, phone or face to face. The person completing the questionnaire is called the respondent.
Questions can be open or closed.
Open questions: Allows any answer. However, the wide range of different answers makes it difficult to
analyse the data.
Closed questions: Has a fixed number of non-overlapping option boxes that only allow for specific answers
or opinion scales. This makes data easier to analyse.
Cleaning Data – fixing problems with the data. This could be done by:
Identifying and correcting/removing incorrect data values or outliers.
Removing units or symbols from the data,
Putting all the data in the same format e.g. m/cm, capital/lowercase, words/letters.
Deciding what to do about missing data.
o Use random selection to select 2 groups of people, control and experimental groups.
o Give the test group the treatment, control group no treatment
o Compare results from 2 groups to see how effective treatment is
Conditions must be exactly the same for both groups, only treatment must be different.
Matched pairs - 2 groups of equally matched (age/gender etc.) people used to test effect of a particular
factor. Everything in common except factor being studied.
The “pairs” don’t have to be different people — they could be the same individuals at different time. For
example:
The same study participants are measured before and after an intervention.
The same study participants are measured twice for two different interventions.
The purpose of matched samples is to get better statistics by controlling for the effects of other “unwanted”
variables. For example, if you are investigating the health effects of alcohol, you can control for age-related
health effects by matching age-similar participants.
Tables
Databases – Tables with a collection of data. They are a form of secondary data is the data is available
online and, in most cases, easily accessible.
These tables usually contain information from real-life statistics, and you will be asked in the exam to
extract and interpret information from it. These questions have multiple parts and many 1 marker sub-
questions. You need to be able to use these tables to
identify values, calculate totals/differences/percentages,
describe trends and explain inconsistencies. One of the
main inconsistencies will be that the percentages do not
add up to 100% and this is due to rounding errors
because individual percentages for columns/rows in the
tables have been rounded.
As the data represents real-world statistics you may be
asked to explain reasons for trends. Think about the data
in terms of real-life rather than just an exam question.
What real-life situation may affect the data you have?
Pictograms
Uses pictures or symbols to represent a particular amount of data. Always has a key to show the amount
each symbol represents.
When drawing a pictogram, make sure that:
Each symbol is the same size
The symbols represent numbers that can be easily divided
to show different frequencies, e.g. for a symbol that
represents 4, you can draw a quarter of the symbol to
show a frequency of 1.
Spacings are the same in each row.
There is a key to show the frequency that each symbol
represents.
Bar Charts
Simple Bar Charts
o Bars are equal width
o Equal gaps between bars
o Frequency on y-axis
Interpreting Pie Charts – Remember pie charts show proportion and not numbers.
If pie chart B is larger than pie chart A then pie chart B has a greater frequency.
If both pie charts then have the same angle for a sector that means that sector has a greater
frequency in pie chart B even though the proportions are the same because it has a larger area.
Population Pyramids
Shows distribution of ages in a population, in numbers or proportion/percentages.
They are used to compare two sets of data, usually genders or two geographical areas.
When comparing the data look at the shape of the distribution.
If it looks like a pyramid with smaller bars at the top that means there is a higher proportion of
younger people in the population and less older people. This could be because if short life
expectancy (how long people live), high birth rates or high death rates.
If the diagram looks more or less straight that means there is a similar proportion of older and
younger people in the population which could be because of lower birth/death rates or that the
life expectancy is increasing.
An upside-down pyramid with larger bars at the top and smaller bars at the bottom shows that
the population has a larger proportion of older people compared to younger people. This could
be because of low birth/death rates, longer life expectancy or the location might be far from
the city or a coastal area where older people are retiring to.
Choropleth Maps (not Chloropeth)
Think colour by numbers.
They split a geographical area into different regions which are then shaded.
The darker the shading the higher the frequency for that area.
Each map has a key to show what the shading represents.
Interpreting:
The area of the map which is shaded darkest has the highest proportion/percentage.
Look at the key for the shading to read off percentages/numbers.
𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦
𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝐷𝑒𝑛𝑠𝑖𝑡𝑦 =
𝐶𝑙𝑎𝑠𝑠 𝑊𝑖𝑑𝑡ℎ
Drawing Histograms:
1. Calculate class widths for each class interval
2. Calculate frequency density for each class interval using FD = F/CW formula.
3. Draw a suitable scale on y-axis labelled frequency density.
4. Draw bars using frequency density data. (Remember the bars have no gaps in between)
To compare histograms, they need to have the same class intervals and frequency density scales.
When comparing histograms, describe the shape of the distribution and what this shows.
Misleading Diagrams
Diagrams can be misleading because of their shape of because of axes and scales.
Averages
A measure of central tendency (represents the ‘centre’ of a set of data). Includes mode, median and mean.
Mode
The one that appears the most (remember the Mo in mode and Mo in most) – the most common value.
Modal Class – the class with the highest frequency (the frequency value is not the mode but the
column/row next to it).
Median
The middle value.
Discrete Data:
1. Put the numbers in order from smallest to largest.
𝟏
2. The median is the 𝟐 (𝒏 + 𝟏)𝒕𝒉 value – this means find your total frequency, add 1 to it and then
divide by 2. The answer is not your median but the position your median will be in the list of data.
Example: If total frequency is 23, the median position will be ½ (23+1) = 12th number in the list after
being put in order.
3. Find the median position from the list of values. This is your median value.
If the median position is a decimal value such as 7.5 then you would find the 7 th and 8th values in
the list and then divide by 2.
If data is in a frequency table, add the frequency values (like you would do with cumulative frequency)
until you reach a row that includes the median position in between it. The median is the category/class
1
that contains the (𝑛 + 1)𝑡ℎ value.
2
Grouped Data:
For grouped continuous data (has classes with the inequality symbols), the median is the ½nth value.
The median class is the class interval which contains the median position.
Sometimes you may be asked to work out an estimate for the median value rather than the median class.
For grouped data your median will always be an estimate as you do not know exact values.
Discrete Data:
1. Add all the values.
2. Divide by the number of values.
∑𝒙
Formula for Mean: ̅=
𝒙 𝒏
Where 𝑥̅ = mean
∑ = Greek letter sigma which is the symbol for ‘sum of’
𝑥 = data values
𝑛 = number of data values
So ∑ 𝑥 means sum of all values
∑ 𝒇𝒙
Formula: , f stands for frequency
∑𝒇
∑(𝒇 × 𝒎𝒊𝒅𝒑𝒐𝒊𝒏𝒕)
Formula:
∑𝒇
Weighted Mean
For data that has different number of values or weights in each group. It is used to combine different sets
of data where one set is more important (has more weighting) than another – example: Maths and English
for progress 8 scores has twice the weighting as other subjects or subjects where controlled assessment is
25% and exam is 75% of final mark.
∑(𝒘𝒆𝒊𝒈𝒉𝒕 𝒙 𝒗𝒂𝒍𝒖𝒆)
𝑾𝒆𝒊𝒈𝒉𝒕𝒆𝒅 𝑴𝒆𝒂𝒏 =
∑ 𝒘𝒆𝒊𝒈𝒉𝒕𝒔
Geometric Mean
The nth root of the product of all the values. Useful to compare things with different properties that
aren’t immediately comparable if they are out of different frequencies or values are dependent on
previous values. Useful for looking at average growth rates.
Transforming Data
For large data values you may want to make the numbers smaller so that it saves you time working with
large numbers even on a calculator (it is easy to make mistakes typing in large numbers).
You can find the mean by taking away the same large number from all the values (so that you’re left with
smaller values), find the mean of these numbers and then add that number back on.
This works best with whole numbers.
Median – If you add a value that is greater than the median, the median might increase.
If you add a value that is smaller than the median, the median might decrease.
If you remove a value that is greater than the median, the median might decrease.
If you remove a value that is smaller than the median, the median might increase.
If you add/remove one value that is greater and one that is smaller than the median, the
median stays the same.
Mean – If you add a value that is greater than the mean, the mean increases.
If you take away a value that is less than the mean, the mean increases.
If you add a value that is less than the mean, the mean decreases.
If you take away a value that is greater than the mean, the mean decreases.
If you replace a value in your data with another number that is greater/smaller than the
original, the mean will also change.
Advantages Disadvantages
Easy to use There may not be a mode or may be
Always a value in the data more than one mode.
Unaffected by extreme values Cannot be used to calculate measures
Mode Can be used with quantitative and of spread.
qualitative data Not always representative of the data –
can include extreme values and can be
a misleading value far from the mean.
Easy to find when data is in order May not be data value
Unaffected by outliers/extreme values Not always representative of the data.
Median Best to use with skewed data
Can be used to calculate quartiles, IQR
and skew.
Uses all the data May not be a data value
Can be used to calculate standard Always affected by extreme values or
Mean deviation and skew. outlier.
Can be distorted by open-ended
classes.
Measures of Dispersion
Range
How spread out the data is.
The difference between the biggest and smallest values.
For data from tables the largest value is the biggest number from the first column and the smallest value is
the first number from the first column.
Lower Quartile (LQ) – The value ¼ of the way through the data. 25% of the data is less than the LQ.
Upper Quartile (UQ) – The value ¾ of the way through the data. 25% of the data is above than the UQ.
Discrete Data
LQ = ¼ (n+1)th value
UQ = ¾ (n+1)th value
Grouped Data
LQ = ¼ nth value
UQ = ¾ nth value
1. Divide the percentiles you need by 100, then multiply that decimal by the total frequency. E.g. for
70th percentile with frequency 80, (70/100)*80 = 56th position.
2. Find the position on the y-axis of your graph and read across to find the corresponding x-values.
3. To find the interpercentile range, subtract the two percentiles that you calculated.
If calculating IPR from a table, carry out linear interpolation as you would when finding median from a
table.
Interdecile Range
The difference between 2 deciles – usually the difference between the first and ninth deciles.
Deciles – divides the data into 10 equal parts.
Discrete Data
𝟏 ∑ 𝒙𝟐 ∑𝒙
̅)𝟐 OR 𝝈 = √
Formulae: 𝝈 = √𝒏 ∑(𝒙 − 𝒙 − ( 𝒏 )𝟐 𝑥̅ = mean, 𝜎= standard deviation
𝒏
Grouped
For grouped frequency tables, follow the same step as for frequency table but use the midpoint for x.
You may need to create an extra column to your table for the midpoint before carrying out the above
steps.
Box Plots
Divide the data into sections that each contain approximately 25% of the data in that set.
Represents important features of the data and gives a summary of the spread/skew of the data.
Box Plots include 5 pieces of information about the data:
1. Minimum Value – the lowest score, shown at the far left of the diagram
2. Lower Quartile (LQ) – 25% of data is below this
3. Median – Mark the middle of the data – 50% of the data is
above/below this value
4. Upper Quartile (UQ) – 25% of data is above this value/75% of
data is below it.
5. Maximum Value – The highest score, shown at the far right of
the diagram
Outliers can also be found using the mean and standard deviation – they are values more than 3 SD away
from the mean.
̅ ± 𝟑𝝈
𝑶𝒖𝒕𝒍𝒊𝒆𝒓𝒔 = 𝑽𝒂𝒍𝒖𝒆𝒔 𝒐𝒖𝒕𝒔𝒊𝒅𝒆 𝒙
Interpreting box plots – Compare median for measure of average and range or IQR for measure of spread.
Remember to compare in context of the question for full marks.
Compare skewness of both box plots.
Skewness
Describes the shape of the distribution and tells you how the data is spread out.
If the data is skewed, it means most of the values are more on one side of the median.
Types of Skew:
Positive Skew – most values are at the beginning of the data set and values towards the end are
more spread out. The majority of the data is low and there are very few higher values. The tail of
the curve goes in the positive direction of the x-axis.
Mean > Median > Mode
Negative Skew - most values are at the end of the data set and values towards the beginning are
more spread out. The majority of the data is higher and there are very few low values. The tail of
the curve points towards the negative direction of the x-axis.
Mean < Median < Mode
Symmetrical – There is no skew. The data is evenly distributed on both sides of the median.
Mean = Median = Mode
Positive Value = positive skew. The larger the value, the larger the skew.
Negative Value = Negative Skew. The smaller the value, the stronger the skew.
Value of 0 = No Skew/Symmetrical.
Comparing Data Sets
Compare using a measure of average (mean/median/mode) and spread (range/IQR/SD) or skewness.
Always make reference to individual values and mention which data set is larger/smaller than the other
clearly.
Always interpret in context – link back to the scenario in the question and labels on axes.
Comparing Averages:
Mean/median/mode for data set A is larger than mean/median/mode data set B so on average data set A
is more … than data set B.
Comparing spread:
Range/IQR/SD for data set A is larger than that of data set B so the ‘results’ of data set A are more spread
out/less consistent than those of data set B.
Data A has a smaller range/IQR/SD than data set B which means the ‘results’ for ‘data set A’ are more
consistent.
Remember lower SD means values are closer to the mean and therefore similar.
Comparing Skew:
Box Plot for data set A is positively skewed so majority of ‘results’ were low with few higher ‘results’.
Box plot for data set A is negatively skewed so majority of ‘results’ were high with few lower ‘results’.
When comparing data make sure to pair the appropriate values of average and spread.
Scatter Diagrams
Used for bivariate data to show if there is a relationship between
two variables.
Explanatory variable (independent – the one that you are
changing) is plotted on the x-axis.
Response Variable (dependent – the one you are measuring) is
plotted on the y-axis.
Plot the points with crosses. Do not join them up.
Correlation
The relationship between two variables.
Causal Relationships
Causation – When one variable causes a change in another.
Correlation shows that there may be a link between two variables. Correlation does not imply causation.
Example:
Causal Relationship – increase in temperature = Increase in ice cream sales
Correlation only – Sales of chocolate and sales of clothes having a positive correlation.
Multiple Factors – In real life situations there are usually multiple factors interacting to cause variables to
change.
Example: A positive correlation between fat in liver and reaction time does not mean one causes the other.
There could be a third variable, such as amount of alcohol consumed, which both variables depend on.
Line of Best Fit (LOBF)
A straight line drawn through the middle of the points so the points are evenly scattered on either side of
the line.
Needs to be a straight line.
Needs to be close to as many points as possible.
Has to go through the mean point.
The closer the points are to the LOBF, the stronger the correlation.
Interpolation – When the LOBF is used to make predictions within the range of data given (you don’t need
to extend you LOBF more).
Tends to be reliable provided the LOBF is correct.
Extrapolation – When the LOBF is used to predict values outside of the range of values given (you may
need to extend your LOBF for this).
Not always reliable as trends may change.
Values estimated from extrapolation are less reliable the further they are from the range of data.
Equation of LOBF
LOBF is also known as Regression Line.
𝑬𝒒𝒏 𝒐𝒇 𝑳𝑶𝑩𝑭: 𝒚 = 𝒂𝒙 + 𝒃
Gradient, a – The rate of increase of the response variable in relation to explanatory variable.
Y-intercept, b – The value of the response variable when explanatory variable is 0.
𝟔 ∑ 𝒅𝟐
𝑺𝑹𝑪𝑪, 𝒓𝒔 = 𝟏 −
𝒏(𝒏𝟐 − 𝟏)
SRCC vs PMCC
SRCC PMCC
Measures the strength of correlation between 2 variables
Have correlation between -1 and 1
Tests for linear and non-linear correlation Tests for linear correlation only
Best used for data that can be ranked Can be used for data that can’t be ranked as well
If there is a non-linear positive relationship between 2 variables then the SRCC and PMCC will both be
positive but the SRCC will be closer to 1, or -1 for negative relationship.
Chapter 5 - Time Series
Trend Lines
Shows the general trend of the data.
When drawing trend lines ignore fluctuations and
follow the general pattern (is the data going
upwards or downwards)
Trend line may show a rising (upwards) trend,
falling (downwards) trend or level trend.
Moving Averages
An average worked out for a given number of successive observations
A good way to see trends in data with large variations – they smooth out fluctuations and make the trend
line more accurate.
Plot moving averages at the midpoint of the time interval. Do not join the up – use LOBF.
For 4 point moving averages the midpoint would be halfway between the 2 nd and 3rd values.
Seasonal Variations
Variations may be:
A general trend – shown by the trend line
Seasonal Variations – a pattern that repeats at a specific point every cycle.
Seasonal Variations in a time series follow a regular time period, like days of the week or seasons.
Consider real life scenarios that may cause these variations – you need to interpret these in context of the
question.
The Seasonal Variation at a point is how much the value varies from the trend.
Calculating the Seasonal Variation at a point:
Predicting Values
The trend line and estimated mean seasonal variations can be used to predict future values.
Simple Probability
Probability is a measure of how likely an event is to
happen.
Probabilities can be written as fractions, decimals or
percentages.
An event is a specific thing that has a probability of happening. Example: rolling an even number.
The expected frequency of an event is the number of times you expect the event to happen. This does not
mean it will actually happen this many times.
Example: P(Heads on a coin) = ½ so if you flip a coin 10 times you would expect the coin to land on heads 5
times. This may not always happen in real life if you try this but is what should happen in theory.
Experimental Probability
In real-life situations the outcomes of all event aren’t equally likely do you have to use results of previous
trials to predict future probabilities.
Trial – Each experiment that happens.
Risk
Probability of an event occurring for negative events.
Relative frequency can be used to predict bias and assess risk.
For Bias:
A fair coin should land on heads and tails approximately ½ the time each. If the coin is biased it will land on
one side more than the other. You can check this by increasing the number of trials and seeing if the
P(heads) is getting closer to the theoretical probability of ½.
Risk is when collected data is used to predict how likely a negative event is to happen e.g. a house being
flooded or the chance of an 18 year old having a car accident – mostly used by insurance companies to
decide how much to charge you.
2 types of risk:
1. Absolute Risk – how likely an event is to happen. This is just relative frequency.
2. Relative Risk – How much more likely an event is to happen for one group compared to another
group (e.g. comparing the probability of developing lung cancer for smokers and non-smokers).
Venn Diagrams
Uses overlapping circles to represent all the outcomes of two or three events
happening.
Each region of a Venn diagram represents a different set of data.
The whole rectangle represents all the possible outcomes.
Venn diagrams can be used to work out probabilities.
𝑷(𝑨) + 𝑷(𝒏𝒐𝒕 𝑨) = 𝟏
𝑷(𝒏𝒐𝒕 𝑨) = 𝟏 − 𝑷(𝑨)
Addition Law
Also known as the General Addition Law.
Used for events that are not mutually exclusive – events that can happen together.
When two events can happen together and you want to find
the probability of both of them happening you don’t want to
include the overlap – this is the intersection part of the Venn
diagram, P(A∩B).
𝑃(𝐴 ∩ 𝐵) = P(A and B). The intersection/overlap part of the Venn diagram.
𝑃(𝐴 ∪ 𝐵) = P(A or B). On a Venn diagram this is the union of A and B and includes everything in both
circles, including the intersection.
Independent Events
Unconnected Events. The outcome of one event does not affect the outcome of the other event.
Example: Flipping a coin and then rolling a dice. The coin landing on tails will not affect what number the
dice lands on.
For 3 independent events, A, B and C: 𝑷(𝑨 𝒂𝒏𝒅 𝑩 𝒂𝒏𝒅 𝑪) = 𝑷(𝑨) × 𝑷(𝑩) × 𝑷(𝑪)
Conditional Probability
Opposite of independent events.
When one event affects the chances of another event happening.
Example: If there are 2 green and 4 white balls in a bag and you take a white ball the first time and don’t
put it back, this changes the probability of taking a green or white ball the second time. P(white first
time)=4/6, P(white second time) = 1/5, P(green second time) = 4/5. So the chances of selecting a specific
colour ball the second time depends on which colour was chosen the first time as choosing white first time
increases the chances of selecting green the second time.
Notation:
𝑷(𝑩|𝑨) = 𝑷(𝑩 𝒈𝒊𝒗𝒆𝒏 𝒕𝒉𝒂𝒕 𝑨 𝒉𝒂𝒑𝒑𝒆𝒏𝒔). The event that happens first comes last in the bracket.
𝑷(𝑨 𝒂𝒏𝒅 𝑩)
𝑷(𝑩|𝑨) =
𝑷(𝑩)
𝑷𝒓𝒊𝒄𝒆
𝑰𝒏𝒅𝒆𝒙 𝑵𝒖𝒎𝒃𝒆𝒓 = × 𝟏𝟎𝟎
𝑩𝒂𝒔𝒆 𝒀𝒆𝒂𝒓 𝑷𝒓𝒊𝒄𝒆
Consumer Price Index (CPI) – Official measure of inflation used by the UK Government.
It is similar to RPI but does not include mortgage payments.
Pensions and benefits in the UK are updated each year in line with the CPI.
CPI is weighted to reflect importance of different items the average shopping basket. The weightings
change each year to reflect consumer spending.
Gross Domestic Product (GDP) – Value of goods and services produced in a country in a given amount of
time.
If the GDP falls in two (or more) successive quarters the economy is in recession.
Weighted Index Numbers – Takes into account proportions (similar to weighted mean).
Weightings reflect the importance of different items.
𝒑𝒓𝒊𝒄𝒆
𝑪𝒉𝒂𝒊𝒏 𝑩𝒂𝒔𝒆 𝑰𝒏𝒅𝒆𝒙 𝑵𝒖𝒎𝒃𝒆𝒓𝒔 = × 𝟏𝟎𝟎
𝒍𝒂𝒔𝒕 𝒚𝒆𝒂𝒓′ 𝒔 𝒑𝒓𝒊𝒄𝒆
RPI and CPI are chain base index numbers and show how values change monthly or annually.
CPI is published monthly by ONS.
Rates of Change
Crude Rates tell you how things change in every 1000 – usually births, deaths, marriages or
unemployment.
They need to be recorded to make plans for the future e.g. high birth rate means more schools will be
required.
Crude Rate – how many times a particular event occurs per 1000 of the population in a given time.
Crude Birth Rate – Number of births per 1000 of the population.
Crude Death Rate – Number of deaths per 1000 of the population.
𝒏𝒖𝒎𝒃𝒆𝒓 𝒐𝒇 𝒃𝒊𝒓𝒕𝒉𝒔/𝒅𝒆𝒂𝒕𝒉𝒔
𝑪𝒓𝒖𝒅𝒆 𝑹𝒂𝒕𝒆 = × 𝟏𝟎𝟎𝟎
𝒕𝒐𝒕𝒂𝒍 𝒑𝒐𝒑𝒖𝒍𝒂𝒕𝒊𝒐𝒏
Crude rates can be misleading when used for comparing against another area which has different
distribution of ages.
Standard Populations – represent the whole population. It is a hypothetical population of 1000 people
used to represent the whole population, taking into account the number of people with different
age/gender/income.
Standardised Rate – Allows you to compare the same age group in different populations by using the
standard population – allows for more realistic comparisons.
𝑪𝒓𝒖𝒅𝒆 𝑹𝒂𝒕𝒆
𝑺𝒕𝒂𝒏𝒅𝒂𝒓𝒅𝒊𝒔𝒆𝒅 𝑹𝒂𝒕𝒆 = × 𝑺𝒕𝒂𝒏𝒅𝒂𝒓𝒅 𝑷𝒐𝒑𝒖𝒍𝒂𝒕𝒊𝒐𝒏
𝟏𝟎𝟎𝟎
To find the standardised rate for the entire population, add up the rates for each group.
Chapter 8 - Probability Distributions
A probability distribution is a list of all the possible outcomes together with their expected probabilities.
Example:
When flipping a fair coin, the probability distribution, for x
X Heads Tails
outcomes, would be:
P(x) ½ ½
Binomial Distributions
A type of probability distribution where there are only two possible outcomes.
Examples:
Event = flipping a coin, Outcomes = heads or tails
Event = Rolling a six on a dice, Outcomes = Success (if it lands on 6) or Failure (if it does not and on 6).
If these conditions are met, an event can be modelled using the binomial distribution.
These conditions can also be used to explain if the binomial distribution is a suitable model – In an exam
question you would need to show if the event described in the question meets each of these conditions. If
all 4 are met then the binomial distribution can be used otherwise not.
Finding Probabilities using the Binomial Distribution: Use (𝒑 + 𝒒)𝒏 to find the probabilities.
1. Identify the 2 outcomes and their probabilities
2. Expand (𝑝 + 𝑞)𝑛 where n is the number of trials. Leave p and q as letters for now.
3. To find the probability of x successes, find the term that has p to the power of x successes. E.g. for 3
successes find the term that has p3.
4. Substitute the values of p and q (the probabilities of success and failure) into that term and
calculate e.g. for 5 trials and 3 successes you would use the 10p3q2 term. For the event rolling a six
1 5
on a dice p=1/6 and q=5/6. So your probability of landing on 6 3 times would be 10 × (6)3 × (6)2 .
Finding the Probabilities/Coefficients:
1. Pascal’s Triangle – The coefficients of a binomial distribution
follow the pattern of Pascal’s triangle.
It starts with 1 in row 0 and has 1s down both sides. The other
numbers are found by adding the 2 numbers directly above.
Memorise this triangle or how to work out the values so you
don’t have to expand (𝑝 + 𝑞)𝑛
Example: For (𝑝 + 𝑞)4 , the expansion would be:
To find a range of probabilities, work out their individual probabilities and then add them up.
Example: P (3 or more successes for 5 events), work out the probabilities for 3, 4 and 5 successes using the
above methods and add up the answers.
For questions that ask for the probability of ‘at least 1 success’ work out the probability of 0 successes and
subtract the answer from 1 rather than working out all the individual probabilities.
The mean (or expected value) of the binomial distribution, B (n, p) is np.
Example: for B (6, ½) the mean is 6 x ½ = 3. This is the expected number of times the event would happen
for n trails. In the above example if you flipped a fair coin 5 times you would expect it to land on tails an
average of 3 times.
Normal Distributions
Drawn as a smooth, bell-shaped curve.
It is a common model for real-life situations such as
weights of apple or marks in an exam.
Most of the data is in the middle with similar values and
a fewer on either end.
A larger standard deviation (SD) will result in a lower curve and smaller SD will give a curve with a higher
maximum height.
Notation: 𝑵 (𝝁, 𝝈𝟐 ) where 𝜇 = mean and 𝜎 2 = variance (the square of standard deviation - 𝜎=standard
deviation, SD).
For each property half the area lies either side of the mean.
For 1 SD, 34% lies between µ and 𝜇 + 𝜎 and 34% between
µ and 𝜇 − 𝜎
For 2 SD, 47.5% lies between µ and 𝜇 + 2𝜎 and 34%
between µ and 𝜇 − 2𝜎
For 3 SD, 49.9% lies between µ and 𝜇 + 3𝜎 and 34%
between µ and 𝜇 − 3𝜎
Standardised Scores
Used to compare 2 samples of data to see how far above or below the average individual values are.
Standardised scores tells you how many standard deviations away from the mean the data values are.
𝑺𝒄𝒐𝒓𝒆 − 𝑴𝒆𝒂𝒏
𝑺𝒕𝒂𝒏𝒅𝒂𝒓𝒅𝒊𝒔𝒆𝒅 𝑺𝒄𝒐𝒓𝒆 =
𝑺𝒕𝒂𝒏𝒅𝒂𝒓𝒅 𝑫𝒆𝒗𝒊𝒂𝒕𝒊𝒐𝒏
It is useful to compare individuals’ performance in exams against the whole class for 2 different subjects.
Example:
English: Mean=60 SD=5 Mark=54
Maths: Mean=70 SD=8 Mark=65
SS (English) = (54-60)/5=-1.2
SS (Maths) = (65-70)/8=-0.675
This person did better in Maths in terms of actual mark but also compared to the rest of the class because
they have a better standardised score.
Quality Assurance
Involves checking samples to make sure products are all of the same quality and standard.
It is about ensuring samples selected for checking quality are as close as possible to the target value so that
products are all of a similar quality.
How it works:
1. Regular samples taken (the sampling technique used by manufacturer’s will vary)
2. Sample mean, median and range calculated.
3. These are plotted on control charts to see how far they are from the value you’d expect them to be
if the manufacturing process was working correctly.
Control Chart – A time series chart used for quality assurance. It has 5 lines:
Target Value – this is the middle line – you want your sample values that you plot to be close to this
line.
Upper and Lower Warning Lines (Inner 2 lines)
– These are 2 SD above and below the target
value. 2SD=95% so only 5% of sample averages
or range should fall outside these lines. If a
sample average/range plotted is above/below
warning line another sample is taken and
checked to see if there is a problem and
production stopped if there is.
Upper and Lower Action Limits (Outer 2 lines) –
These are 3SD above and below the target
value. Almost all of the sample average/range
should fall within these lines. If a sample
average/range is outside of these lines
production is stopped immediately and machinery is reset.