Week 01
Week 01
Week 1: Introduction to
Statistics
1
Dealing with Uncertainty
Consider:
The price of IBM stock will be higher in six months
than it is now.
2
Dealing with Uncertainty
(continued)
3
What does “Statistics” mean?
1. Numerical data
“According to statistics, this year’s exports has marked a record!”
“We need to collect statistics for the productivity of this business.”
4
Descriptive and Inferential Statistics
Collect data
e.g., Survey
Present data
e.g., Tables and graphs
Summarize data
e.g., Sample mean = X i
n
6
Inferential Statistics
Estimation
e.g., Estimate the population
mean weight using the sample
mean weight
Hypothesis testing
e.g., Test the claim that the
population mean weight is 120
pounds
Knowledge
Experience, Theory,
Literature, Inferential
Statistics, Computers
Information
Descriptive Statistics,
Begin Here: Probability, Computers
Data
Identify the
Problem
8
Key Definitions
9
Population vs. Sample
Population Sample
a b cd
ef ghi jkl m n b c
o p q rs t u v w g i n
x y z o r u
y
11
Why “Sampling”?
Less time consuming than a census
12
Process of Statistical Data Analysis
Population
Random
Make Inferences
Sample
Describe
Sample
Statistics
13
Data Types
Data
Qualitative Quantitative
(Categorical) (Numerical)
Examples:
Marital Status
Political Party Discrete Continuous
Eye Color
(Defined categories) Examples: Examples:
Number of Children Weight
Defects per hour Voltage
(Counted items) (Measured
characteristics) 14
Data Types
15
Data Types
Cross Section
Data
16
Measurement Levels
Differences between
measurements, true Ratio Data
zero exists
Quantitative Data
Differences between
measurements but no Interval Data
true zero
Ordered Categories
(rankings, order, or Ordinal Data
scaling)
Qualitative Data
Categories (no
ordering or direction) Nominal Data
17
Measurement Levels-EXAMPLES
Nominal: sex, eye-colour
Percentages, frequency, mod (most frequent value)
20
Measurement Levels-EXERCISE
Occupation, City
Education
Price
Likeness
21
Descriptive statistics
Compute and interpret statistics describing the
location of a set of values, such as the mean
and median.
Compute and interpret statistics describing the
variability in a set of values, such as the range
and standard deviation.
Compute and interpret the measures of shape,
skewness and kurtosis.
Produce graphical displays of data.
22
Some Frequently Used Statistics and
Parameters
SAMPLE POPULATION
STATISTICS PARAMETERS
MEAN x
VARIANCE s2
STANDARD s
DEVIATION
PROPORTION ˆ
23
Measure of Location
Descriptive statistics that locate the center
of your data are called measures of
central tendency
Sample Mean
The sample mean of a set of n
measurements (x1, x2,…xn) is equal
to the sum of the measurements
divided by n.
n
xi x1 x2 ... xn
x
i 1 n n
24
Measure of Location
Median
Median: the “middle” value (also known as the 50th percentile)
The median of a set of n measurements (x , x ,…x ) is the
1 2 n
value that falls in the middle position when the
measurements are ordered from the smallest to the
largest.
x n1 if n is odd
2
~
x x n x n
2 2
1
if n is even
2
26
1 3 3 4 5 8 51 13345 8
n=3 n=3 n=3 n=3
Median=4 Median=3
.5
(3+4)/2=3.5
27
Example
A random sample of six values were
taken from a population. These values were:
28
Example (con’t)
x1 x2 x3 x4 x5 x6 7 1 10 8 4 12
x 7
n 6
Order Sample
MEDIAN = ( 7 + 8 ) / 2 = 7.5
29
Example
30
Example (con’t)
n
x i
x i 1 43.15
n
~ 45 46
x 45.5
2
the median
31
Example (con’t)
Why?
Because there is an outlier (extreme value),4 in
the data set, the mean is heavily influenced
by this single outlier.
Solution:
Trimmed mean—drop the outlier and
recalculate the mean.
n
xi 4
xtrim i 1 45.21
n 1
32
Mode
A measure of location
The value that occurs most often
Not affected by extreme values
Used for either numerical or categorical data
There may be no mode
There may be several modes
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6
Mode = 5 No Mode 33
Mode (con’t)
What is the mode for the previous example
(slide 12)?
44 (occurs twice)
49 (occurs twice)
34
Distributions
When you examine the distribution of values,
you can determine
the range of possible data values
the frequency of data values
whether the data values accumulate in the
middle of the distribution or at one end.
Median, mean and mode values have
relation with the shape of the distribution.
35
Measures of Central Tendency:
Shape of a Distribution
Describes how data is distributed
Symmetric or skewed
Apply to many unimodal distributions (not
for multimodal)
Left-Skewed Symmetric Right-Skewed
Mean < Median < Mode Mode = Mean = Median Mode < Median < Mean
(Longer tail extends to left) (Longer tail extends to right)
36
Percentiles and Quartiles
Percentiles Quartiles
37
Percentiles and Quartiles
98
95 third quartile
92 75 Percentile=91
th
90
85
81 50th Percentile=80 (median)
79 Quartiles break your data
70 up into quarters.
63 25th Percentile=59
55 first quartile
47
42
38
Weighted Mean
Used when values are grouped by frequency
or relative importance
Example: Sample of
26 Repair Projects
Weighted Mean Days
Days to
Complete
Frequency to Complete:
5 4
XW
w x
i i
(4 5) (12 6) (8 7) (2 8)
6 12 w i 4 12 8 2
7 8 164
6.31 days
8 2 26
39
Measures of Variation
Same center,
different variation
40
The Spread of a Distribution:
Variation
Measure Definition
range the difference between the maximum and minimum
data values
interquartile range the difference between the 25th and 75th
percentiles (IR or IQR)
variance a measure of dispersion of the data around the
mean
standard deviation a measure of dispersion expressed in the same units
of measurement as your data (the square root of the
variance)
coefficient of standard deviation as a percentage of
variation of the mean
41
Range
Simplest measure of variation
Difference between the largest and the
smallest observations:
Example:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Range = 14 - 1 = 13
42
Disadvantages of the Range
Ignores the way in which data are distributed
7 8 9 10 11 12 7 8 9 10 11 12
Range = 12 - 7 = 5 Range = 12 - 7 = 5
Sensitive to outliers
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5
Range = 5 - 1 = 4
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120
Range = 120 - 1 = 119
43
Interquartile Range
44
Variance and Standard Deviation
•The variance is a measure of variation (σ2 or s2).
•The square root of the variance, or standard
deviation (σ or s), is a measure of variation in
terms of the original linear scale (most commonly
used).
2 is the population standard deviation
45
Measures of Variability (Population)
Population Range
XMax-XMin
Population Variance
n n
i xi
2
( x ) 2
2 i 1 i 1 2
N N
2
46
PROOF
47
Measures of Variability (Sample)
Sample Range
XMax-XMin
Sample Variance
2
n
n
x
i 1
i
( xi x ) 2
2
n
xi
n
s
2
i 1
i 1 n 1 n 1
s s2
48
Measures of Variability (Sample)
2
Obs. xi xi x ( xi x ) Obs.
2 xi xi
1 7 0 1 7
0 49
2 1 -6 2 1
36 1
3 10 3 3 10
9 100
4 8 1 4 8
1 64
80 5 424 374
5 4 -3
9 16
6 12 5 6 12 49
Sample Variance
2
n
n n
xi
i 1
i
2 2
x x xi
n
S2 i 1 2
S i 1
n 1 n 1
374
42 2
80 6
5 5
16 16
50
Sample Variance
• Calculate the sample variance by averaging
with n-1 instead of n.
n
(x x)
i
2
s 2 i 1
n 1
• n-1 is called the degrees of freedom
associated with the variance estimate. This
depicts the number of independent pieces of
information available for computing variability.
51
Comparing Standard Deviations
Same mean, but different
Data A standard deviations:
Mean = 15.5
11 12 13 14 15 16 17 18 19 20 21
s = 3.338
Data B
Mean = 15.5
11 12 13 14 15 16 17 18 19 20 21 s = .9258
Data C
Mean = 15.5
11 12 13 14 15 16 17 18 19 20 21
s = 4.57
52
Coefficient of Variation
Measures relative variation
Always in percentage (%)
Shows variation relative to mean
It is used to compare two or more sets of data
measured in different units
Population Sample
σ s
CV 100% CV 100%
μ x
53
Comparing Coefficients
of Variation
Stock A:
Average price last year = $50
Standard deviation = $5
s $5
CVA 100% 100% 10%
x $50 Both stocks
Stock B: have the same
standard
Average price last year = $100 deviation, but
Standard deviation = $5 stock B is less
variable relative
to its price
s $5
CVB 100% 100% 5%
x $100
54
Presentation of Data
Tables
Graphs
Frequency displays and Histograms
Stem-leaf display
Stem and Leaf Diagram
Stem Leaf
12 is shown as 1 2
35 is shown as 3 5
Example:
Stem Leaf
613 would become 6 1
776 would become 7 8
...
1224 becomes 12 2
Construction of a Stem-Leaf Display
List the stem values, in order, in a vertical column
Draw a vertical line to the right of the stem values
For each observation, record the leaf portion of the
observation in the row corresponding to the appropriate
stem
Reorder the leaves from the lowest to highest within
each stem row
If the number of leaves appearing in each stem is too
large, divide the stems into two groups, the first
corresponding to leaves 0 through 4, and the second
corresponding to leaves 5 through 9. (This subdivision
can be increased to five groups if necessary).
EXAMPLE: Car Battery Life
2.2 4.1 3.5 4.5 3.2 3.7 3.0 2.6
bin).
The height of the bar
is the percent of
values in the bin.
Bins
Relative Frequency Histogram of Battery
Life
How Many Class Intervals?
Frequency
2
1.5
classes 1
0.5
Can give a poor indication of how 0
4
8
12
16
20
24
28
32
36
40
44
48
52
56
60
More
frequency varies across classes Temperature
12
10
Few (Wide class intervals) 8
Frequency
6
may compress variation too much 4
variation.
General Guidelines
Skewed Skewed
to Left Symmetric to Right
FREQUENCY
FREQUENCY
FREQUENCY
Summary
Basics of descriptive statistics
Tables and graphs
Inferential statistics
Textbook Reading
Chapter 1 (page 1-28)
Chapter 8 (page 229-243)