2.descriptive Statistics
2.descriptive Statistics
2.descriptive Statistics
MPA
Stephan Dietrich
s.dietrich@maastrichtuniversity.nl
1
14 Sept Tutorial Data visualisation in Stata
15 Sept Optional practice Voluntary Stata exercise (video solution published on Monday morning)
Week 3
18 Sept Lecture Constructing research frameworks, hypotheses and variables
19 Sept Tutorial Using research frameworks
20 Sept Lecture Working with quantitative data
21 Sept Tutorial Summary statistics
22 Sept Optional practice Voluntary Stata exercise (video solution published on Monday morning)
Week 4
25 Sept Lecture Research methods, sampling and quality
Week 5
2 Oct Lecture Estimating population means
6 Oct Optional practice Voluntary Stata exercise (video solution published on Monday morning)
Week 6
9 Oct Lecture Hypothesis testing
2
Recap
3
Lecture 3: Descriptive statistics
Central
Tendency Spread
Descriptive
Statistics
4
Mean
• The most well-known indicator to describe data is the (arithmetic) mean
• We add up values of a variable and divide the sum by the number of observations
𝑥1 + 𝑥2 + ⋯ + 𝑥𝑛
𝑥ҧ = 𝑚𝑒𝑎𝑛 𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 𝑥 → 𝑥ҧ =
𝑁
7
Example:
6
5
Respondent 1 2 3 4 5
4
Variable 2 1 2 4 6
3
2
2+1+2+4+6
𝑥ҧ = =3 1
5 0
0 2 4 6
Application: all observations measured in same unit Respondent 5
Mean
6
Median
• The median is the middle value when the data is ordered
• At least half of the values are greater or equal to the median
• The median is less sensitive to outliers than the mean
Respondent 1 2 3 4 5 Respondent 2 1 3 4 5
Value 2 1 2 4 6 Value 1 2 2 4 6
Respondent 3 is in the middle (2 reported smaller values and 2 larger values)→ median value is 2
7
Percentiles
• The median is the middle value
• We can also select other points in the distribution than the middle value
• Percentile indicate the value below which a given percentage of observations in our data falls
Example:
Respondent 1 3 2 4 5
Variable NL NL SK IT DK
• These indicators describe central tendency, but not the distribution of values
9
Deviation from the mean
• Mean does not indicate how spread-out values are
• Same mean can come from very different distribution
Sample 1 Sample 2
10 10
9 9
8 8
7 7
6 6
5 mean 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 0 1 2 3 4 5 6
Same mean, but values in sample 1 are much more dispersed than values in sample 2!
10
Deviation from the mean
2
• Otherwise, distribution is skewed
1
0
0 1 2 3 4 5 6
13
Example
Suppose we ask 5 people for their income:
30
25
20
Income
10
10
8 Mean
7
5
0
P1 P2 P3 P4 P5
Respondent 1 2 3 4 5
𝑛 Response 10 7 5 8 25
1
𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒 = (𝑥𝑖 − 𝑥)ҧ 2 𝑥ҧ 11 11 11 11 11
𝑁
𝑖=1 (𝑥𝑖 − 𝑥)ҧ 2 1 16 36 9 196
Variance 258/5≈51.6
14
Standard Deviation
• The variance is scale variant
→ Variance may change if we measure income in other currency
• The standard deviation is very similar to the variance, but is scale invariant
• "standard" way of measuring what is a normal deviation from the mean, and what is a large deviation
15
Standard Deviation – Outlier Classification
30 Outlier:
• Data organization includes data cleaning
25 25
• Large outlier can bias our analysis
20 How can we identify outlier?
1. Rule out implausible responses
15 Income
→ e.g. 40 working hours per day
Mean
10 2. Trimming
10
8 → e.g. remove 1% of observations at the top and
bottom of a variable
5 7
5
3. Use rule-of-thumb outlier classification
0 →E.g. outlier if observations is more than 3
P1 P2 P3 P4 P5 standard deviations away from mean
16
Box Plots
• Visualizes statistical indicators
(median, 25%-75%percentiles)
graph box d11, over(d70) intensity(25) ytitle("AGE") title("Age by Life Satisfaction Category, 2019") 17
Example Box Plot
• Covariance describes the extent to which two variables move in the same direction
19
Correlation Coefficient
• Covariance is scale variant
• Correlation ranges between -1 (perfect negative correlation) and 1 (perfect positive correlation)
𝑠𝑥𝑦
𝑟=
𝑠𝑥 𝑠𝑦
20
Correlation Coefficient
Global economic growth and CO2 emissions, 1961-2015
22
Is the mean of LS greater than the median?
23
The data set comprises 24,780 observations. If we
had less observations (say 200 less), would the
standard deviation of LS be larger or smaller?
24
Questions?
25