Descriptive Statistics

Central Tendency
Measures of Dispersion
1
Types of Variables
 Nominal Categories
 Names or types with no inherent ordering:
 Religious affiliation  Marital status
 Race
 Ordinal Categories
 Variables with a rank or order
 University Rankings  Academic position
 Level of education
Types of Variables (con’t)
 Interval/Ratio Categories
 Variables are ordered and have equal space
between them
 Salary measured in dollars
 Age of an individual
 Number of children
 Thermometer scores measure intensity on issues
 Views on abortion
 Whether someone likes China
 Feelings about US-Japan trade issues
 Any dichotomous variable
 A variable that takes on the value 0 or 1.
 A person’s gender (M=0, F=1)
• Based on their mathematical properties, data
are divided into four groups: NOIR
 Nominal
 Ordinal
 Interval
 Ratio
• They are ordered with their increasing
 accuracy
 powerfulness of measurement
 preciseness
 wide application of statistical techniques
5
• Nominal means name and count; data are
alphabetic or numerical in name only
• They are categories without order or
direction
• Their use is restricted to keeping track of
people, objects and events
• They are least powerful in measurement
with no arithmetic origin, order, direction
or distance relationship
• Hence nominal data is of restricted or
limited use
6
• Gender, marital status or any
alphabetic/ numeric code without
intrinsic order or ranking
Sl. No. Subject Code
1 Physics P
2 Chemistry C
3 Mathematics M
4 Biology B
7
• Ordinal means rank or order
• Ordinal data place events in order; They are
ordered categories like rankings or scaling
• Ordinal data allows for setting up
inequalities and nothing much
• Adjacent ranks need not be equal in their
differences
• Has no absolute value (only relative
position in the inequality)
• More precise comparisons are not possible
• Ranks or grades of students; Quality 8
rating of service or product
Sl. No. Education Code
1 Undergraduate U
2 Graduate G
3 Postgraduate P
4 Doctorate D
• The inequalities like U < G < P < D does not help
to know differences between any two of them
cannot be said to be same (say, difference
between U and G is not same as G and P)
9
• Interval data in addition to ranking
(setting up inequalities) further allow for
forming differences
• For interval data there is no absolute
zero; unique origin does not exists
• Interval data are more powerful than
ordinal scale due to equality of intervals
Examples:
• Temperature in Fahrenheit,
Standardised scores
10
• Ratio data allow for forming quotients in
addition to setting up inequalities and forming
differences
• All mathematical operations (manipulations
with real numbers) are possible on ratio data
• It can have an absolute or true zero and
represent the actual amount/ value
• The most precise data and allow for
application of all statistical techniques
Examples:
• Height, weight, age
Relation among data types 14
Measuring Data
 Discrete Data
 Takes on only integer values
 whole numbers no decimals
 E.g., number of people in the room
 Continuous Data
 Takes on any value
 All numbers including decimals
 E.g., Rate of population increase
15
• Numerical data could be either discrete or
continuous
• Continuous data can take any numerical value
(within a range); For example, weight, height,
etc.
• There can be an infinite number of possible
values in continuous data
• Discrete data can take only certain values by
a finite ‘jumps’, i.e., it ‘jumps’ from one value
to another but does not take any intermediate
value between them (For example, number of
students in the class)
17
• Continuous data is more precise than discrete

• Continuous data is more informative than discrete
• Continuous data can remove estimation and
rounding of measurements
• Continuous data is often more time consuming to
obtain
• Discrete should also be converted to continuous
data when possible as to obtain a higher level of
information and detail
Examples of conversion of discrete to
continuous data
Discrete Data
Example: Absenteeism for 50 employees
I II III IV V
Relative
x X f x  X 
2
Absences Frequency Frequency
x f f/N
0 3 .06 -4.86 70.8
1 2 .04 -3.86 29.8
2 5 .10 -2.86 40.9
3 8 .16 -1.86 27.7
4 7 .14 -0.86 5.2
5 2 .04 0.14 .04
6 13 .26 1.14 16.9
7 2 .04 2.14 9.2
8 4 .08 3.14 39.4
9 0 .00 4.14 0
10 1 .02 5.14 26.4
11 2 .04 6.14 75.4
12 0 .00 7.14 0
13 1 .02 8.14 66.3
N=50 1.00  = 408.02
Absenteeism for 50 I II III
Relative
IV V
x X f x  X 
2
employees x f f/N
0 3 .06 -4.86 70.8
Discrete Data
1 2 .04 -3.86 29.8
(con’t)
2 5 .10 -2.86 40.9
3 8 .16 -1.86 27.7
4 7 .14 -0.86 5.2
5 2 .04 0.14 .04
6 13 .26 1.14 16.9
7 2 .04 2.14 9.2
8 4 .08 3.14 39.4
9 0 .00 4.14 0
10 1 .02 5.14 26.4
Frequency
11 2 .04 6.14 75.4
12 0 .00 7.14 0
 13 1
N=50
.02
1.00
8.14 66.3
 = 408.02
 Number of times we observe an event

 In our example, the number of employees that were
absent for a particular number of days.
 Three employees were absent no days (0) through out the
year
 Relative Frequency
 Number of times a particular event takes place in
relation to the total.
 For example, about a quarter of the people were absent
exactly 6 times in the year.
 Why do you think this was true?
Discrete Data: Chart
 Another way that we can represent this
data is by a chart.
14
F
r
12
e 10
q 8
u
e 6
n 4
c
y 2
0
0 2 4 6 8 10 12
Absences
Again we see that more people were

absent 6 days than any other single
value.
Continuous Data
Example: Men’s Heights
Relative Percentile
Height Midpoint Frequency Frequency (Cumulative
(f) (f / N) Frequency)
58.5-61.5 60 4 .02 .02
61.5-64.5 63 12 .06 .08
64.5-67.5 66 44 .22 .30
67.5-70.5 69 64 .32 .62
70.5-73.5 72 56 .28 .90
73.5-76.5 75 16 .08 .98
76.5-79.5 78 4 .02 1.00
N=200 1.00
Continuous Data (con’t)
 Note: must decide how to divide the data
 The interval you define changes how the data
appears.
 The finer the divisions, the less bias.
 But more ranges make it difficult to display and interpret.
Chart
70
 F Data divided
60
r
50
into 3”
e
q intervals
Can represent u
40
e 30
information n 20
c
graphically y 10
0
60 63 66 69 72 75 78
He ight
Continuous Data: Percentiles
 The Nth percentile
 The point such that n-percent of the
population lie below and (100-n) percent
lie above it. Height Example
Relative Percentile
Height Midpoint Frequency Frequency (Cumulative
(f) (f / N) Frequency)
58.5-61.5 60 4 .02 .02
What interval is the 61.5-64.5 63 12 .06 .08
50th percentile? 64.5-67.5 66 44 .22 .30

67.5-70.5 69 64 .32 .62
70.5-73.5 72 56 .28 .90
73.5-76.5 75 16 .08 .98
76.5-79.5 78 4 .02 1.00
N=200 1.00
Continuous Data: Income example
 What interval is the 50th percentile?
Average Income in California Congressional Districts
Cells Freq Tally Relative Frequency

$0-$20,708 1 X 0.019
$20,708-$25,232 3 XXX 0.057
$25,232-$29,756 7 XXXXXXX 0.135
$29,756-$34,280 15 XXXXXXXXXXXXXXX 0.289
$34,280-$38,805 8 XXXXXXXX 0.154
$38,805-$43,329 4 XXXX 0.077
$43,329-$47,853 8 XXXXXXXX 0.154
$47,853-$52,378 6 XXXXXX 0.115
n = 52
Central Tendency
 In general terms, central tendency is a
statistical measure that determines a
single value that accurately describes the
center of the distribution and represents
the entire distribution of scores.
 The goal of central tendency is to identify
the single value that is the best
representative for the entire set of data.
24
Central Tendency (cont.)
 By identifying the "average score," central
tendency allows researchers to summarize or
condense a large set of data into a single value.
 Thus, central tendency serves as a descriptive
statistic because it allows researchers to describe
or present a set of data in a very simplified,
concise form.
 In addition, it is possible to compare two (or
more) sets of data by simply comparing the
average score (central tendency) for one set
versus the average score for another set.
25
The Mean, the Median,
and the Mode
 It is essential that central tendency be
determined by an objective and well-defined
procedure so that others will understand exactly
how the "average" value was obtained and can
duplicate the process.
 No single procedure always produces a good,
representative value. Therefore, researchers
have developed three commonly used techniques
for measuring central tendency: the mean, the
median, and the mode.
26
Measures of Central Tendency
 Mode
 The category occurring most often.
 Median
 The middle observation or 50th percentile.
 Mean
 The average of the observations
Summary Measures
Summary Measures
Central Tendency Quartile Variation
Mean Mode
Median Range Coefficient
of Variation
Variance
Midrange
Standard Deviation
Midhinge
Central Tendency
Mean Median Mode

n
xi
i 1 Midrange
n
Midhinge
The Mean (Arithmetic Average)
•It is the Arithmetic Average of data values:
x
n
 xi xi  x 2      xn
Sample i 1

Mean n n
•The Most Common Measure of Central Tendency
•Affected by Extreme Values (Outliers)
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 12 14
Mean = 5 Mean = 6
The Median
•Important Measure of Central Tendency
•In an ordered array, the median is the
“middle” number.
•If n is odd, the median is the middle number.
•If n is even, the median is the average of the 2
middle numbers.
•Not Affected by Extreme Values
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 12
14
Median = 5 Median = 5
The Mode
•A Measure of Central Tendency
•Value that Occurs Most Often
•Not Affected by Extreme Values
•There May Not be a Mode
•There May be Several Modes
•Used for Either Numerical or Categorical Data
0 1 2 3 4 5 6
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Mode = 9 No Mode
Central Tendency: Mode
 The category that occurs most often.
Mode of the absentee F

r
14
12
6
70
69 example? e 10
q 8
u
e 6
F n 4
60 c
r
y 2
e 50 0
q
u
e
40
30
Mode of the height example? 0 2 4 6 8
Absences
10 12
n 20 Average Income in California Congressional Districts

c
y 10
0 Cells Freq Tally Relative Frequency
60 63 66 69 72 75 78 $0-$20,708 1 X 0.019
He ight $20,708-$25,232 3 XXX 0.057
$25,232-$29,756 7 XXXXXXX 0.135
$29,756-$34,280 15 XXXXXXXXXXXXXXX 0.289
$29-$34
What is the mode of $34,280-$38,805
$38,805-$43,329
8
4
XXXXXXXX
XXXX
0.154
0.077
the income example? $43,329-$47,853

$47,853-$52,378
8
6
XXXXXXXX
XXXXXX
0.154
0.115
n = 52
Central Tendency: Median
 The middle observation or 50th percentile.
Median of the height example? Median of the absentee example?

Height Example Absentee
Relative Percentile I Example
II III IV V
Frequency Relative
Height Midpoint Frequency (Cumulative x X f x  X 
2
(f) (f / N) Frequency) x f f/N
58.5-61.5 60 4 .02 .02 0 3 .06 -4.86 70.8
1 2 .04 -3.86 29.8
61.5-64.5 63 12 .06 .08 2 5 .10 -2.86 40.9
64.5-67.5 66 44 .22 .30 3 8 .16 -1.86 27.7
4 7 .14 -0.86 5.2
67.5-70.5 69 64 .32 .62 5 2 .04 0.14 .04
6 13 .26 1.14 16.9
70.5-73.5 72 56 .28 .90 7 2 .04 2.14 9.2
8 4 .08 3.14 39.4
73.5-76.5 75 16 .08 .98 9 0 .00 4.14 0
76.5-79.5 78 4 .02 1.00 10 1 .02 5.14 26.4
11 2 .04 6.14 75.4
N=200 1.00 12 0 .00 7.14 0
13 1 .02 8.14 66.3
N=50 1.00  = 408.02
Central Tendency: Mean
 The average of the observations.
 Defined as:
The sum of observations
Number of observations
This is denoted as:

N
 Xi
X  i 1
N
Central Tendency: Mean (con’t)
 Notation:
 X ’s are different observations
i
 Let Xi stand for the each individual's height:

 X1 (Sam), X2 (Oliver), X3 (Buddy), or observation 1, 2,
3…
 X stands for height in general
 X1 stands for the height for the first person.
 When we write  (the sum) of Xi we are

adding up all the observations.
 Example: Absenteeism
 What is the average number of days
any given worker is absent in the
I
Absentee
II III IV V
year?
Example
Relative
x X f x  X 
2
First, add the fifty individuals’ total days

x f f/N 
absent (243).
0 3 .06 -4.86 70.8
1 2 .04 -3.86 29.8
2 5 .10 -2.86 40.9
3 8 .16 -1.86 27.7
4
5
7
2
.14
.04
-0.86
0.14
5.2
.04  Then, divide the sum (243) by the total
number of individuals (50).
6 13 .26 1.14 16.9
7 2 .04 2.14 9.2
8 4 .08 3.14 39.4
9 0 .00 4.14 0
10
11
1
2
.02
.04
5.14
6.14
26.4
75.4
 We get 243/50, which equals 4.86.
12 0 .00 7.14 0
13 1 .02 8.14 66.3
N=50 1.00  = 408.02
 Alternatively use the frequency
 Add each event times the number of
occurrences
 0*3 +1*2+..... , which is lect2.xls
Absentee
N
 xi f i
I II III IV V
Example Relative
x X f x  X 
2
x f f/N
0 3 .06 -4.86 70.8
1 2 .04 -3.86 29.8
X
2 5 .10 -2.86 40.9
3
4
5
8
7
2
.16
.14
.04
-1.86
-0.86
0.14
27.7
5.2
.04
i 1
6 13 .26 1.14 16.9
7 2 .04 2.14 9.2
8
9
10
11
12
4
0
1
2
0
.08
.00
.02
.04
.00
3.14
4.14
5.14
6.14
7.14
39.4
0
26.4
75.4
0
N
13 1 .02 8.14 66.3
N=50 1.00  = 408.02
Relative Position of Measures
 For Symmetric Distributions
 If your population distribution is symmetric
and unimodal (i.e., with a hump in the
middle), then all three measures coincide.
70
F
Mode = 69
60
r
 e 50
q 40
Median =69
u
 e 30
n 20
Mean = 483/7= 69
c
10
 y
0
60 63 66 69 72 75 78
He ight
Relative Position of Measures
 For Asymmetric Distributions
 If the data are skewed, however, then the measures of
central tendency will not necessarily line up.
14
F
r
12
e 10
q 8
u
e 6
n 4
c
y 2
0
0 2 4 6 8 10 12
Absences
Mode = 6, Median = 4.5, Mean = 4.86

The Mean
 The mean is the most commonly used
measure of central tendency.
 Computation of the mean requires scores
that are numerical values measured on an
interval or ratio scale.
 The mean is obtained by computing the
sum, or total, for the entire set of scores,
then dividing this sum by the number of
scores.
41
The Mean (cont.)
Conceptually, the mean can also be defined as:
1. The mean is the amount that each individual
receives when the total (ΣX) is divided equally
among all N individuals.
2. The mean is the balance point of the distribution
because the sum of the distances below the
mean is exactly equal to the sum of the
distances above the mean.
42
Changing the Mean
 Because the calculation of the mean involves
every score in the distribution, changing the
value of any score will change the value of the
mean.
 Modifying a distribution by discarding scores or
by adding new scores will usually change the
value of the mean.
 To determine how the mean will be affected for
any specific situation you must consider: 1) how
the number of scores is affected, and 2) how the
sum of the scores is affected.
44
Changing the Mean (cont.)
 If a constant value is added to every score
in a distribution, then the same constant
value is added to the mean. Also, if every
score is multiplied by a constant value,
then the mean is also multiplied by the
same constant value.
45
When the Mean Won’t Work
 Although the mean is the most commonly used
measure of central tendency, there are situations
where the mean does not provide a good,
representative value, and there are situations
where you cannot compute a mean at all.
 When a distribution contains a few extreme
scores (or is very skewed), the mean will be
pulled toward the extremes (displaced toward
the tail). In this case, the mean will not provide
a "central" value.
46
When the Mean Won’t Work
(cont.)
 With data from a nominal scale it is
impossible to compute a mean, and when
data are measured on an ordinal scale
(ranks), it is usually inappropriate to
compute a mean.
 Thus, the mean does not always work as a
measure of central tendency and it is
necessary to have alternative procedures
available.
47
The Median
 If the scores in a distribution are listed in order
from smallest to largest, the median is defined as
the midpoint of the list.
 The median divides the scores so that 50% of
the scores in the distribution have values that
are equal to or less than the median.
 Computation of the median requires scores that
can be placed in rank order (smallest to largest)
and are measured on an ordinal, interval, or ratio
scale.
48
The Median (cont.)
Usually, the median can be found by a
simple counting procedure:
1. With an odd number of scores, list the
values in order, and the median is the
middle score in the list.
2. With an even number of scores, list the
values in order, and the median is half-way
between the middle two scores.
49
The Median (cont.)
 If the scores are measurements of a
continuous variable, it is possible to find
the median by first placing the scores in a
frequency distribution histogram with each
score represented by a box in the graph.
 Then, draw a vertical line through the
distribution so that exactly half the boxes
are on each side of the line. The median
is defined by the location of the line.
51
The Median (cont.)
 One advantage of the median is that it is
relatively unaffected by extreme scores.
 Thus, the median tends to stay in the
"center" of the distribution even when
there are a few extreme scores or when
the distribution is very skewed. In these
situations, the median serves as a good
alternative to the mean.
53
The Mode
 The mode is defined as the most frequently
occurring category or score in the distribution.
 In a frequency distribution graph, the mode is
the category or score corresponding to the peak
or high point of the distribution.
 The mode can be determined for data measured
on any scale of measurement: nominal, ordinal,
interval, or ratio.
54
The Mode (cont.)
 The primary value of the mode is that it is
the only measure of central tendency that
can be used for data measured on a
nominal scale. In addition, the mode often
is used as a supplemental measure of
central tendency that is reported along
with the mean or the median.
55
Bimodal Distributions
 It is possible for a distribution to have more than
one mode. Such a distribution is called bimodal.
(Note that a distribution can have only one mean
and only one median.)
 In addition, the term "mode" is often used to
describe a peak in a distribution that is not really
the highest point. Thus, a distribution may have
a major mode at the highest peak and a minor
mode at a secondary peak in a different location.
56
Central Tendency and the
Shape of the Distribution
 Because the mean, the median, and the
mode are all measuring central tendency,
the three measures are often
systematically related to each other.
 In a symmetrical distribution, for example,
the mean and median will always be equal.
58
Central Tendency and the
Shape of the Distribution (cont.)
 If a symmetrical distribution has only one
mode, the mode, mean, and median will
all have the same value.
 In a skewed distribution, the mode will be
located at the peak on one side and the
mean usually will be displaced toward the
tail on the other side.
 The median is usually located between the
mean and the mode.
59
Reporting Central Tendency in
Research Reports
 In manuscripts and in published research
reports, the sample mean is identified with the
letter M.
 There is no standardized notation for reporting
the median or the mode.
 In research situations where several means are
obtained for different groups or for different
treatment conditions, it is common to present all
of the means in a single graph.
60
Reporting Central Tendency in
Research Reports (cont.)
 The different groups or treatment
conditions are listed along the horizontal
axis and the means are displayed by a bar
or a point above each of the groups.
 The height of the bar (or point) indicates
the value of the mean for each group.
Similar graphs are also used to show
several medians in one display.
61
Which measures for which variables?
 Nominal variables which measure(s) of
central tendency is appropriate?
 Only the mode
 Ordinal variables which measure(s) of
central tendency is appropriate?
 Mode or median
 Interval-ratio variables which
measure(s) of central tendency is
appropriate?
 All three measures
Midrange
•A Measure of Central Tendency
•Average of Smallest and Largest
Observation:
x l arg est  x smallest
Midrange 
2
•Affected by Extreme Value
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Midrange = 5 Midrange = 5
Quartiles
 Not a Measure of Central Tendency
 Split Ordered Data into 4 Quarters

25% 25% 25%
Position of i-th Quartile: position of point
25%
Q1 Q2 Q3
Qi  i(n+1)
4
Data in Ordered Array: 11 12 13 16 16 17
18 21 22
Position of Q1 = 1•(9 + 1) = 2.50 Q1 =12.5
4
Midhinge
 A Measure of Central Tendency
 The Middle point of 1st and 3rd Quarters
Q1  Q 3
Midhinge =
2
 Not Affected by Extreme Values
Data in Ordered Array: 11 12 13 16 16

17 18 21 22
Q1  Q3 12.5  19.5
Midhinge =   16
2 2
The Range
• Measure of Variation
• Difference Between Largest &
Smallest Observations:
Range = x La rgest  x Smallest
• Ignores How Data Are Distributed:
Range = 12 - 7 = 5 Range = 12 - 7 = 5
7 8 9 10 11 12 7 8 9 10 11 12
Interquartile Range
• Measure of Variation
• Also Known as Mid-spread:
Spread in the Middle 50%
• Difference Between Third & First
Quartiles: Interquartile Range = Q3  Q1
Data in Ordered Array: 11 12 13 16 16 17
17 18 21 Q 3  Q1 = 17.5 - 12.5 = 5
• Not Affected by Extreme Values
Variance
•Important Measure of Variation
•Shows Variation About the Mean:
•For the Population:  
2 Xi   2
N
 X i  X 
2
•For the Sample: s 
2
n 1
For the Population: use
For the Sample : use n - 1
N in the denominator.
in the denominator.
Comparing Standard Deviations
Data : Xi : 10 12 14 15 17 18 18 24
N= 8 Mean =16
 X i  X 
2
s = = 4.2426
n 1
 X i   
2
  = 3.9686
N
Value for the Standard Deviation is larger for data considered as a
Sample.
Comparing Standard Deviations
Data A
Mean = 15.5
11 12 13 14 15 16 17 18 19 20 21 s = 3.338
Data B
Mean = 15.5
11 12 13 14 15 16 17 18 19 20 21
s = .9258
Data C
Mean = 15.5
11 12 13 14 15 16 17 18 19 20 21 s = 4.57
Coefficient of Variation
Measure of Relative Variation

Always a%
Shows Variation Relative to Mean
Used to Compare 2 or More Groups
Formula ( for Sample):
S 
CV     100%
X 
Comparing Coefficient of Variation
 Stock A: Average Price last year = $50

 Standard Deviation = $5
 Stock B: Average Price last year = $100
 Standard Deviation = $5
Coefficient of
S 
CV     100% Variation:
X  Stock A: CV = 10%
Stock B: CV = 5%
Shape
 Describes How Data Are Distributed
 Measures of Shape:
 Symmetric or skewed
Left-Skewed Symmetric Right-Skewed

Mean Median Mod Mean = Median = Mode Mode Median Mean
e
Box-and-Whisker Plot
 Graphical Display of Data Using
5-Number Summary
X smallestQ1 Median Q3 Xlargest
4 6 8 10 12
Distribution Shape &
Box-and-Whisker Plots
Left-Skewed Symmetric Right-Skewed

Q1 Median Q3 Q1 Median Q3 Q1 Median Q3
Measures of Deviation (con’t)
 Need measures of the deviation or
distribution of the data.
Unimodal Bimodal
 
Every different distributions even though

they have the same mean and median.
Need a measure of dispersion or spread.
 Range
 The difference between the largest and
smallest observation.
 Limited measure of dispersion
Unimodal Bimodal
 
 Inter-Quartile Range
 The difference between the 75th percentile
and the 25th percentile in the data.
Unimoda Bimod
l al
25%
 75% 25%
 75%
Better because it divides the data finer.

But it still only uses two observations
Would like to incorporate all the data, if possible.
 Mean Absolute Deviation
 The absolute distance between each data
point and the mean, then take the
average. N
 Xi  X
i 1
MAD =
N
Large absolute
Small absolute distance from
Unimodal Bimodal
distance from mean
mean
 
 Mean Squared Deviation
 The difference between each observation
and the mean, quantity squared.
 X X
N
2
i
 2 i 1
N
This is the variance of a distribution.
 Variance of a Distribution
 Note since we are using the square, the
greatest deviations will be weighed more
heavily.
 If we use a frequency table, as in the
absenteeism case, then we would write
this as:
f xi  X 
N

2
 2  i 1
N
 Standard deviation
 The square root of the variance:
 X X
N
2
i
  i 1
N
Gives common units to compare

deviation from the mean.
Absentee Example: Variances
 f x X  X X
N N
2 2
i
i
  i 1
From above, the mean = 4.86.  2 i 1
N
N
To calculate the variance, take the ABSENTEES
I II III IV V
difference between each Relative
x X f x  X 
2
Absences Frequency
observation and the mean. x f
Frequency
f/N
0 3 .06 -4.86 70.8
Column 5 then squares that 1 2 .04 -3.86 29.8
2 5 .10 -2.86 40.9
differences and multiplies it by the 3 8 .16 -1.86 27.7
number of times the event 4
5
7
2
.14
.04
-0.86
0.14
5.2
.04
occurred. 408.02 6 13 .26 1.14 16.9
Variance =  816
. 7 2 .04 2.14 9.2
50 8 4 .08 3.14 39.4
9 0 .00 4.14 0
.  2.86
10 1 .02 5.14 26.4
The Standard Deviation = 816 11 2 .04 6.14 75.4
12 0 .00 7.14 0
13 1 .02 8.14 66.3
N=50 1.00  = 408.02
Descriptive Statistics for Samples
 Consistent estimators of variance & standard
deviation:
  Xi  X    Xi  X 
N 2 N 2
s2  and s 
i 1 i 1
N 1 N 1
Why n-1?
We use n-1 pieces of data to calculate the variance
because we already used 1 piece of information to
calculate the mean.
So to make the estimate consistent, subtract one
observation from the denominator.
Descriptive Statistics for Samples (con’t)
 Absentee Example:
 If sampling from a population, we would
calculate: N
 Xi  X   Xi  X 
2 N 2
s2  and s 
i 1 i 1
N 1 N 1
408.02
Variance =  8.33
49
Standard Deviation = 8.33  2.88
Descriptive Statistics for Samples (con’t)
 Sample estimators approximate population

estimates:
 Notice from the formulas, as n gets large:
 the sample size converges toward the underlying
population, and
 the difference between the estimates gets negligible.
 This is known as the law of large numbers.
 We will discuss its relevance in more detail in later
lectures.
 Mostly dealing in samples
 Use the (n-1) formula, unless told otherwise.
 Don’t worry the computer does this automatically!
Examples of data presentation—
use it, don't abuse it
 Start axes at zero.

 Misleading comparisons
 Government spending not taking into account
inflation.
 Nominal vs. Real
 Selecting a particular base years.
 For example, a university presenting a budget
breakdown showed that since 1986 the number of
staff increased only slightly.
 But they failed to mention the huge increase
during the 5 years before.
Outline
 Measures of Central Tendency
 Mean
 Median
 Mode
 Descriptive Statistics
 Range
 Standard Deviation
 Variance
COMPUTING AND
UNDERSTANDING AVERAGES
Example Data Set
The following are the number of calls ran
per year for Anywhere Fire Department.
Year Number of Calls
2000 1231
2001 1342
2002 1423
2003 986
2004 1354
2005 1266
2006 1521
2007 1453
2008 1312
2009 1389
The AVERAGE is a single score that represents a

set of scores
Averages are also known as “Measures of Central
Tendency”
Three different ways to describe the distribution of
a set of scores…
 Mean – typical average score
 Median – middle score
 Mode – most common score
Computing the Mean
Formula for computing the mean

X
X 
n
“X bar” is the mean value of the group of scores
“” (sigma) tells you to add whatever follows it
X is each individual score in the group
The n is the sample size
Computing the Mean
 5 students scored the following on their
quizzes: 79, 83, 65, 98, and 86
 The average (X-bar) is the sum of the
scores (ΣX) divided by the number of
students (n)
 The average quiz score for this group of

students was 82.2
Using the AVERAGE function
Select the cell for the AVERAGE function

Create a formula to average the three values
 =(A1+A2+A3)/3
OR type the AVERAGE function
 =AVERAGE(A1:A3)
More Excel
Arithmetic Mean
Sum of the deviation is equal to zero
Geometric Mean
GEOMEAN uses multiplication instead of addition
Moving Mean
More accurate…good for unique distributions
Weighted Mean
Accounts for the frequency of a score’s occurrence
Weighted Mean Example
Using Excel to Compute a Weighted Mean

Weighted Mean Example
The Computation of a Weighted Mean

Computing the Median
Median = point/score at which 50% of scores fall

above and 50% fall below
No standard formula
 Rank order scores from highest to lowest or lowest to
highest
 Find the “middle” score
BUT…
 What if there are two middle scores?
 What if the two middle scores are the same?
Using the MEDIAN function
 Select the cell and type the MEDIAN function

 =MEDIAN(A2:A7)
Computing the Mode
Mode = most frequently occurring score
No formula
 List all values in the distribution
 Tally the number of times each value occurs
 The value occurring the most is the mode
Democrats = 90
Republicans = 70
Independents = 140 – the MODE!!
 When two values occur the same number of times --

Bimodal distribution
Using the MODE function
=MODE(A2:A20)
Descriptive Statistics Toolpak
The Descriptive Statistics Dialog Box

Descriptive Statistics Toolpak
The New and Improved Descriptive Statistics Output

Salkind, Chapter 3
DESCRIPTIVE STATISTICS
Why Variability Is Important
• Variability is how different the scores are
from one particular score
• Spread
• Dispersion
• What is the score of interest here?
• The MEAN!!
So…variability is really a measure of how
each score in a group of scores differs
from the mean of that set of scores.
Measures of Variability
• Three types of variability examine the amount
of spread or dispersion in a group of scores
 Range
 Standard Deviation
 Variance
• Typically report the average and the
variability together to describe a distribution
Computing the Range
• Range is the most general estimate of
variability
• Two types:
 Exclusive Range
• R=h-l
 Inclusive Range
• R=h–l+1
Computing Standard
Deviation
• Standard deviation (SD) is the most
frequently reported measure of variability
• SD = average amount of variability in a set of

scores
Using Excel’s STDEV
Function
Data for the STDEV Function

Using Excel’s STDEV
Function
Using the STDEV Function

Why n – 1?
• The standard deviation is intended to be an
estimate of the POPULATION standard
deviation
 We want it to be an unbiased estimate
 Subtracting 1 from n artificially inflates the SD,
making it larger
• In other words, we want to be conservative in
our estimate of the population
Why n – 1?
Comparing the STDEV and STDEVP Functions

Things to Remember…
• Standard deviation is computed as the average
distance from the mean
• The larger the standard deviation, the greater the
variability
• Like the mean, standard deviation is sensitive to
extreme scores
• If s = 0, then there is no variability among scores;
they must all be the same value
Computing Variance
• Variance = standard deviation squared
• So…what do these symbols represent?

Does the formula look familiar?
Using Excel’s VAR Function
Computing the Variance

Standard Deviation or Variance
• Although the formulas are quite similar,

the two are also quite different
 Standard deviation is stated in original units
 Variance is stated in units that are squared
 Which do you think is easier to interpret???
Concepts Covered
 Statistics
- Descriptive Statistics
- Histograms
- Hypothesis Testing
- Scatter Plots
- Regression Analysis
117
To Set Up Statistical Package
 Click File Tab, and Then Click Options.
 Click Add-ins. In View and Manage Box,
Select Analysis ToolPak.
 Click Go.
 In the Add-Ins Available Box, Select Analysis
ToolPak Check Box and Click OK. (If ToolPak
Is Not Listed, Click Browse to Locate It.)
118
Using Excel:
Descriptive Statistics
 Click Data/Data Analysis (Far Right) /Descriptive
Statistics & OK.
 Put Checkmarks on Summary Statistics, 95% or
99% Confidence Interval, & Labels in First Row
Boxes.
 Move Cursor to Input Range Window, Highlight
Data to Analyze including Labels, & Click OK.
 Your Data will Appear on New Worksheet.
 Widen Columns by Clicking
Home/Format/AutoFit Column Width.
119
Using Excel:
Constructing Histograms
 Click Data/Data Analysis/Histogram & OK.
 Put Checkmarks on Chart Output & New
Worksheet Boxes.
 Move Cursor to Input Range Window, Highlight
Data Going into Histogram.
 Move Cursor to Input Bin Range, Highlight Data
Showing Upper Value of Each Bin & Click OK.
 Histogram will be on New Worksheet. You May
Lengthen it by Clicking Blank Space in Window,
Moving Cursor to Window Bottom Line & Holding
Down Mouse Button as You Pull Down Window.
120
Using Excel:
Hypothesis Testing
 Go to Sheet One.
 Click Data/Data Analysis/ and the Appropriate
Statistical Test. Then Click OK.
 On New Window Check Labels Box and Put
Cursor on Variable 1 Range.
 Highlight Variable 1 Data Including Label.
 Put Cursor on Variable 2 Range & Highlight
Variable 2 Data (Including Label). Then Click
OK.
 Click Home/Format/AutoFit/Column Width
121
Using Excel:
Scatter Plots
 Highlight Data (Be Sure X Values are in Left
Column and Y Values are in Right Column).
 Click Insert/Scatter. Pull down menu and click
Upper Left Icon.
 Click a Datum Point on Chart with Right Mouse
Key, Add Trendline, & Click Linear.
122
Using Excel:
Regression Analysis
 Click Data/Data Analysis (On Far Right)
/Regression & Click OK.
 On New Window Check Labels Box and Put
Cursor on X Range.
 Highlight X Data Including Label.
 Put Cursor on Y Range & Highlight Y Data
(Including Label), Then Click OK.
 Click Home/Format/AutoFit Column Width.
123

Descriptive Statistics

Uploaded by

Copyright:

Available Formats

Descriptive Statistics

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Descriptive Statistics

Uploaded by

Copyright:

Available Formats

Central Tendency

• Continuous data is more precise than discrete

 Number of times we observe an event

Again we see that more people were

50th percentile? 64.5-67.5 66 44 .22 .30

Cells Freq Tally Relative Frequency

Central Tendency Quartile Variation

Mean Median Mode

Mode of the absentee F

n 20 Average Income in California Congressional Districts

the income example? $43,329-$47,853

Median of the height example? Median of the absentee example?

This is denoted as:

 Let Xi stand for the each individual's height:

 X1 stands for the height for the first person.

 When we write  (the sum) of Xi we are

First, add the fifty individuals’ total days

Mode = 6, Median = 4.5, Mean = 4.86

Data in Ordered Array: 11 12 13 16 16

Measure of Relative Variation

 Stock A: Average Price last year = $50

Left-Skewed Symmetric Right-Skewed

X smallestQ1 Median Q3 Xlargest

Left-Skewed Symmetric Right-Skewed

Every different distributions even though

Better because it divides the data finer.

Gives common units to compare

 Sample estimators approximate population

 Start axes at zero.

The AVERAGE is a single score that represents a

Formula for computing the mean

 The average quiz score for this group of

Select the cell for the AVERAGE function

Using Excel to Compute a Weighted Mean

The Computation of a Weighted Mean

Median = point/score at which 50% of scores fall

 Select the cell and type the MEDIAN function

 When two values occur the same number of times --

The Descriptive Statistics Dialog Box

The New and Improved Descriptive Statistics Output

• SD = average amount of variability in a set of

Data for the STDEV Function

Using the STDEV Function

Comparing the STDEV and STDEVP Functions

• So…what do these symbols represent?

Computing the Variance

• Although the formulas are quite similar,

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.