Descriptive Statistics
Descriptive Statistics
Descriptive Statistics
Measures of Dispersion
1
Types of Variables
Nominal Categories
Names or types with no inherent ordering:
Religious affiliation Marital status
Race
Ordinal Categories
Variables with a rank or order
University Rankings Academic position
Level of education
Types of Variables (con’t)
Interval/Ratio Categories
Variables are ordered and have equal space
between them
Salary measured in dollars
Age of an individual
Number of children
Thermometer scores measure intensity on issues
Views on abortion
Whether someone likes China
Feelings about US-Japan trade issues
Any dichotomous variable
A variable that takes on the value 0 or 1.
A person’s gender (M=0, F=1)
• Based on their mathematical properties, data
are divided into four groups: NOIR
Nominal
Ordinal
Interval
Ratio
• They are ordered with their increasing
accuracy
powerfulness of measurement
preciseness
wide application of statistical techniques
5
• Nominal means name and count; data are
alphabetic or numerical in name only
• They are categories without order or
direction
• Their use is restricted to keeping track of
people, objects and events
• They are least powerful in measurement
with no arithmetic origin, order, direction
or distance relationship
• Hence nominal data is of restricted or
limited use
6
• Gender, marital status or any
alphabetic/ numeric code without
intrinsic order or ranking
Sl. No. Subject Code
1 Physics P
2 Chemistry C
3 Mathematics M
4 Biology B
7
• Ordinal means rank or order
• Ordinal data place events in order; They are
ordered categories like rankings or scaling
• Ordinal data allows for setting up
inequalities and nothing much
• Adjacent ranks need not be equal in their
differences
• Has no absolute value (only relative
position in the inequality)
• More precise comparisons are not possible
• Ranks or grades of students; Quality 8
rating of service or product
Sl. No. Education Code
1 Undergraduate U
2 Graduate G
3 Postgraduate P
4 Doctorate D
• The inequalities like U < G < P < D does not help
to know differences between any two of them
cannot be said to be same (say, difference
between U and G is not same as G and P)
9
• Interval data in addition to ranking
(setting up inequalities) further allow for
forming differences
• For interval data there is no absolute
zero; unique origin does not exists
• Interval data are more powerful than
ordinal scale due to equality of intervals
Examples:
• Temperature in Fahrenheit,
Standardised scores
10
• Ratio data allow for forming quotients in
addition to setting up inequalities and forming
differences
• All mathematical operations (manipulations
with real numbers) are possible on ratio data
• It can have an absolute or true zero and
represent the actual amount/ value
• The most precise data and allow for
application of all statistical techniques
Examples:
• Height, weight, age
Relation among data types 14
Measuring Data
Discrete Data
Takes on only integer values
whole numbers no decimals
E.g., number of people in the room
Continuous Data
Takes on any value
All numbers including decimals
E.g., Rate of population increase
15
• Numerical data could be either discrete or
continuous
• Continuous data can take any numerical value
(within a range); For example, weight, height,
etc.
• There can be an infinite number of possible
values in continuous data
• Discrete data can take only certain values by
a finite ‘jumps’, i.e., it ‘jumps’ from one value
to another but does not take any intermediate
value between them (For example, number of
students in the class)
17
x X f x X
2
Absences Frequency Frequency
employees x f f/N
0 3 .06 -4.86 70.8
Discrete Data
1 2 .04 -3.86 29.8
(con’t)
2 5 .10 -2.86 40.9
3 8 .16 -1.86 27.7
4 7 .14 -0.86 5.2
5 2 .04 0.14 .04
6 13 .26 1.14 16.9
7 2 .04 2.14 9.2
8 4 .08 3.14 39.4
9 0 .00 4.14 0
10 1 .02 5.14 26.4
Frequency
11 2 .04 6.14 75.4
12 0 .00 7.14 0
13 1
N=50
.02
1.00
8.14 66.3
= 408.02
Chart
70
F Data divided
60
r
50
into 3”
e
q intervals
Can represent u
40
e 30
information n 20
c
graphically y 10
0
60 63 66 69 72 75 78
He ight
Continuous Data: Percentiles
The Nth percentile
The point such that n-percent of the
population lie below and (100-n) percent
lie above it. Height Example
Relative Percentile
Height Midpoint Frequency Frequency (Cumulative
(f) (f / N) Frequency)
58.5-61.5 60 4 .02 .02
What interval is the 61.5-64.5 63 12 .06 .08
n = 52
Central Tendency
In general terms, central tendency is a
statistical measure that determines a
single value that accurately describes the
center of the distribution and represents
the entire distribution of scores.
The goal of central tendency is to identify
the single value that is the best
representative for the entire set of data.
24
Central Tendency (cont.)
By identifying the "average score," central
tendency allows researchers to summarize or
condense a large set of data into a single value.
Thus, central tendency serves as a descriptive
statistic because it allows researchers to describe
or present a set of data in a very simplified,
concise form.
In addition, it is possible to compare two (or
more) sets of data by simply comparing the
average score (central tendency) for one set
versus the average score for another set.
25
The Mean, the Median,
and the Mode
It is essential that central tendency be
determined by an objective and well-defined
procedure so that others will understand exactly
how the "average" value was obtained and can
duplicate the process.
No single procedure always produces a good,
representative value. Therefore, researchers
have developed three commonly used techniques
for measuring central tendency: the mean, the
median, and the mode.
26
Measures of Central Tendency
Mode
The category occurring most often.
Median
The middle observation or 50th percentile.
Mean
The average of the observations
Summary Measures
Summary Measures
Mean Mode
Median Range Coefficient
of Variation
Variance
Midrange
Standard Deviation
Midhinge
Measures of Central Tendency
Central Tendency
Midhinge
The Mean (Arithmetic Average)
•It is the Arithmetic Average of data values:
x
n
xi xi x 2 xn
Sample i 1
Mean n n
•The Most Common Measure of Central Tendency
•Affected by Extreme Values (Outliers)
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 12 14
Mean = 5 Mean = 6
The Median
•Important Measure of Central Tendency
•In an ordered array, the median is the
“middle” number.
•If n is odd, the median is the middle number.
•If n is even, the median is the average of the 2
middle numbers.
•Not Affected by Extreme Values
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 12
14
Median = 5 Median = 5
The Mode
•A Measure of Central Tendency
•Value that Occurs Most Often
•Not Affected by Extreme Values
•There May Not be a Mode
•There May be Several Modes
•Used for Either Numerical or Categorical Data
0 1 2 3 4 5 6
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Mode = 9 No Mode
Central Tendency: Mode
The category that occurs most often.
70
69 example? e 10
q 8
u
e 6
F n 4
60 c
r
y 2
e 50 0
q
u
e
40
30
Mode of the height example? 0 2 4 6 8
Absences
10 12
n = 52
Central Tendency: Median
The middle observation or 50th percentile.
N
Central Tendency: Mean (con’t)
Notation:
X ’s are different observations
i
Example: Absenteeism
What is the average number of days
any given worker is absent in the
I
Absentee
II III IV V
year?
Example
Relative
x X f x X
2
absent (243).
0 3 .06 -4.86 70.8
1 2 .04 -3.86 29.8
2 5 .10 -2.86 40.9
3 8 .16 -1.86 27.7
4
5
7
2
.14
.04
-0.86
0.14
5.2
.04 Then, divide the sum (243) by the total
number of individuals (50).
6 13 .26 1.14 16.9
7 2 .04 2.14 9.2
8 4 .08 3.14 39.4
9 0 .00 4.14 0
10
11
1
2
.02
.04
5.14
6.14
26.4
75.4
We get 243/50, which equals 4.86.
12 0 .00 7.14 0
13 1 .02 8.14 66.3
N=50 1.00 = 408.02
Central Tendency: Mean (con’t)
Alternatively use the frequency
Add each event times the number of
occurrences
0*3 +1*2+..... , which is lect2.xls
Absentee
N
xi f i
I II III IV V
Example Relative
x X f x X
2
Absences Frequency Frequency
x f f/N
0 3 .06 -4.86 70.8
1 2 .04 -3.86 29.8
X
2 5 .10 -2.86 40.9
3
4
5
8
7
2
.16
.14
.04
-1.86
-0.86
0.14
27.7
5.2
.04
i 1
6 13 .26 1.14 16.9
7 2 .04 2.14 9.2
8
9
10
11
12
4
0
1
2
0
.08
.00
.02
.04
.00
3.14
4.14
5.14
6.14
7.14
39.4
0
26.4
75.4
0
N
13 1 .02 8.14 66.3
N=50 1.00 = 408.02
Relative Position of Measures
For Symmetric Distributions
If your population distribution is symmetric
and unimodal (i.e., with a hump in the
middle), then all three measures coincide.
70
F
Mode = 69
60
r
e 50
q 40
Median =69
u
e 30
n 20
Mean = 483/7= 69
c
10
y
0
60 63 66 69 72 75 78
He ight
Relative Position of Measures
For Asymmetric Distributions
If the data are skewed, however, then the measures of
central tendency will not necessarily line up.
14
F
r
12
e 10
q 8
u
e 6
n 4
c
y 2
0
0 2 4 6 8 10 12
Absences
42
Changing the Mean
Because the calculation of the mean involves
every score in the distribution, changing the
value of any score will change the value of the
mean.
Modifying a distribution by discarding scores or
by adding new scores will usually change the
value of the mean.
To determine how the mean will be affected for
any specific situation you must consider: 1) how
the number of scores is affected, and 2) how the
sum of the scores is affected.
44
Changing the Mean (cont.)
If a constant value is added to every score
in a distribution, then the same constant
value is added to the mean. Also, if every
score is multiplied by a constant value,
then the mean is also multiplied by the
same constant value.
45
When the Mean Won’t Work
Although the mean is the most commonly used
measure of central tendency, there are situations
where the mean does not provide a good,
representative value, and there are situations
where you cannot compute a mean at all.
When a distribution contains a few extreme
scores (or is very skewed), the mean will be
pulled toward the extremes (displaced toward
the tail). In this case, the mean will not provide
a "central" value.
46
When the Mean Won’t Work
(cont.)
With data from a nominal scale it is
impossible to compute a mean, and when
data are measured on an ordinal scale
(ranks), it is usually inappropriate to
compute a mean.
Thus, the mean does not always work as a
measure of central tendency and it is
necessary to have alternative procedures
available.
47
The Median
If the scores in a distribution are listed in order
from smallest to largest, the median is defined as
the midpoint of the list.
The median divides the scores so that 50% of
the scores in the distribution have values that
are equal to or less than the median.
Computation of the median requires scores that
can be placed in rank order (smallest to largest)
and are measured on an ordinal, interval, or ratio
scale.
48
The Median (cont.)
Usually, the median can be found by a
simple counting procedure:
1. With an odd number of scores, list the
values in order, and the median is the
middle score in the list.
2. With an even number of scores, list the
values in order, and the median is half-way
between the middle two scores.
49
The Median (cont.)
If the scores are measurements of a
continuous variable, it is possible to find
the median by first placing the scores in a
frequency distribution histogram with each
score represented by a box in the graph.
Then, draw a vertical line through the
distribution so that exactly half the boxes
are on each side of the line. The median
is defined by the location of the line.
51
The Median (cont.)
One advantage of the median is that it is
relatively unaffected by extreme scores.
Thus, the median tends to stay in the
"center" of the distribution even when
there are a few extreme scores or when
the distribution is very skewed. In these
situations, the median serves as a good
alternative to the mean.
53
The Mode
The mode is defined as the most frequently
occurring category or score in the distribution.
In a frequency distribution graph, the mode is
the category or score corresponding to the peak
or high point of the distribution.
The mode can be determined for data measured
on any scale of measurement: nominal, ordinal,
interval, or ratio.
54
The Mode (cont.)
The primary value of the mode is that it is
the only measure of central tendency that
can be used for data measured on a
nominal scale. In addition, the mode often
is used as a supplemental measure of
central tendency that is reported along
with the mean or the median.
55
Bimodal Distributions
It is possible for a distribution to have more than
one mode. Such a distribution is called bimodal.
(Note that a distribution can have only one mean
and only one median.)
In addition, the term "mode" is often used to
describe a peak in a distribution that is not really
the highest point. Thus, a distribution may have
a major mode at the highest peak and a minor
mode at a secondary peak in a different location.
56
Central Tendency and the
Shape of the Distribution
Because the mean, the median, and the
mode are all measuring central tendency,
the three measures are often
systematically related to each other.
In a symmetrical distribution, for example,
the mean and median will always be equal.
58
Central Tendency and the
Shape of the Distribution (cont.)
If a symmetrical distribution has only one
mode, the mode, mean, and median will
all have the same value.
In a skewed distribution, the mode will be
located at the peak on one side and the
mean usually will be displaced toward the
tail on the other side.
The median is usually located between the
mean and the mode.
59
Reporting Central Tendency in
Research Reports
In manuscripts and in published research
reports, the sample mean is identified with the
letter M.
There is no standardized notation for reporting
the median or the mode.
In research situations where several means are
obtained for different groups or for different
treatment conditions, it is common to present all
of the means in a single graph.
60
Reporting Central Tendency in
Research Reports (cont.)
The different groups or treatment
conditions are listed along the horizontal
axis and the means are displayed by a bar
or a point above each of the groups.
The height of the bar (or point) indicates
the value of the mean for each group.
Similar graphs are also used to show
several medians in one display.
61
Which measures for which variables?
Nominal variables which measure(s) of
central tendency is appropriate?
Only the mode
Ordinal variables which measure(s) of
central tendency is appropriate?
Mode or median
Interval-ratio variables which
measure(s) of central tendency is
appropriate?
All three measures
Midrange
•A Measure of Central Tendency
•Average of Smallest and Largest
Observation:
x l arg est x smallest
Midrange
2
•Affected by Extreme Value
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Midrange = 5 Midrange = 5
Quartiles
Not a Measure of Central Tendency
Split Ordered Data into 4 Quarters
25% 25% 25%
Position of i-th Quartile: position of point
25%
Q1 Q2 Q3
Qi i(n+1)
4
Data in Ordered Array: 11 12 13 16 16 17
18 21 22
Position of Q1 = 1•(9 + 1) = 2.50 Q1 =12.5
4
Midhinge
A Measure of Central Tendency
The Middle point of 1st and 3rd Quarters
Q1 Q 3
Midhinge =
2
Not Affected by Extreme Values
Q1 Q3 12.5 19.5
Midhinge = 16
2 2
The Range
• Measure of Variation
• Difference Between Largest &
Smallest Observations:
Range = x La rgest x Smallest
• Ignores How Data Are Distributed:
Range = 12 - 7 = 5 Range = 12 - 7 = 5
7 8 9 10 11 12 7 8 9 10 11 12
Interquartile Range
• Measure of Variation
• Also Known as Mid-spread:
Spread in the Middle 50%
• Difference Between Third & First
Quartiles: Interquartile Range = Q3 Q1
Data in Ordered Array: 11 12 13 16 16 17
17 18 21 Q 3 Q1 = 17.5 - 12.5 = 5
• Not Affected by Extreme Values
Variance
•Important Measure of Variation
•Shows Variation About the Mean:
•For the Population:
2 Xi 2
N
X i X
2
•For the Sample: s
2
n 1
For the Population: use
For the Sample : use n - 1
N in the denominator.
in the denominator.
Comparing Standard Deviations
Data : Xi : 10 12 14 15 17 18 18 24
N= 8 Mean =16
X i X
2
s = = 4.2426
n 1
X i
2
= 3.9686
N
Value for the Standard Deviation is larger for data considered as a
Sample.
Comparing Standard Deviations
Data A
Mean = 15.5
11 12 13 14 15 16 17 18 19 20 21 s = 3.338
Data B
Mean = 15.5
11 12 13 14 15 16 17 18 19 20 21
s = .9258
Data C
Mean = 15.5
11 12 13 14 15 16 17 18 19 20 21 s = 4.57
Coefficient of Variation
S
CV 100%
X
Comparing Coefficient of Variation
Coefficient of
S
CV 100% Variation:
X Stock A: CV = 10%
Stock B: CV = 5%
Shape
Describes How Data Are Distributed
Measures of Shape:
Symmetric or skewed
4 6 8 10 12
Distribution Shape &
Box-and-Whisker Plots
Unimodal Bimodal
Unimodal Bimodal
Measures of Deviation (con’t)
Inter-Quartile Range
The difference between the 75th percentile
and the 25th percentile in the data.
Unimoda Bimod
l al
25%
75% 25%
75%
Measures of Deviation (con’t)
Mean Squared Deviation
The difference between each observation
and the mean, quantity squared.
X X
N
2
i
2 i 1
N
This is the variance of a distribution.
Measures of Deviation (con’t)
Variance of a Distribution
Note since we are using the square, the
greatest deviations will be weighed more
heavily.
If we use a frequency table, as in the
absenteeism case, then we would write
this as:
f xi X
N
2
2 i 1
N
Measures of Deviation (con’t)
Standard deviation
The square root of the variance:
X X
N
2
i
i 1
N
. 2.86
10 1 .02 5.14 26.4
The Standard Deviation = 816 11 2 .04 6.14 75.4
12 0 .00 7.14 0
13 1 .02 8.14 66.3
N=50 1.00 = 408.02
Descriptive Statistics for Samples
Consistent estimators of variance & standard
deviation:
Xi X Xi X
N 2 N 2
s2 and s
i 1 i 1
N 1 N 1
Why n-1?
We use n-1 pieces of data to calculate the variance
because we already used 1 piece of information to
calculate the mean.
So to make the estimate consistent, subtract one
observation from the denominator.
Descriptive Statistics for Samples (con’t)
Absentee Example:
If sampling from a population, we would
calculate: N
Xi X Xi X
2 N 2
s2 and s
i 1 i 1
N 1 N 1
408.02
Variance = 8.33
49
Standard Deviation = 8.33 2.88
Descriptive Statistics for Samples (con’t)
Arithmetic Mean
Sum of the deviation is equal to zero
Geometric Mean
GEOMEAN uses multiplication instead of addition
Moving Mean
More accurate…good for unique distributions
Weighted Mean
Accounts for the frequency of a score’s occurrence
Weighted Mean Example
=MODE(A2:A20)
Descriptive Statistics Toolpak
DESCRIPTIVE STATISTICS
Why Variability Is Important
• Variability is how different the scores are
from one particular score
• Spread
• Dispersion
• What is the score of interest here?
• The MEAN!!
So…variability is really a measure of how
each score in a group of scores differs
from the mean of that set of scores.
Measures of Variability
• Three types of variability examine the amount
of spread or dispersion in a group of scores
Range
Standard Deviation
Variance
• Typically report the average and the
variability together to describe a distribution
Computing the Range
• Range is the most general estimate of
variability
• Two types:
Exclusive Range
• R=h-l
Inclusive Range
• R=h–l+1
Computing Standard
Deviation
• Standard deviation (SD) is the most
frequently reported measure of variability
Statistics
- Descriptive Statistics
- Histograms
- Hypothesis Testing
- Scatter Plots
- Regression Analysis
117
To Set Up Statistical Package
Click File Tab, and Then Click Options.
Click Add-ins. In View and Manage Box,
Select Analysis ToolPak.
Click Go.
In the Add-Ins Available Box, Select Analysis
ToolPak Check Box and Click OK. (If ToolPak
Is Not Listed, Click Browse to Locate It.)
118
Using Excel:
Descriptive Statistics
Click Data/Data Analysis (Far Right) /Descriptive
Statistics & OK.
Put Checkmarks on Summary Statistics, 95% or
99% Confidence Interval, & Labels in First Row
Boxes.
Move Cursor to Input Range Window, Highlight
Data to Analyze including Labels, & Click OK.
Your Data will Appear on New Worksheet.
Widen Columns by Clicking
Home/Format/AutoFit Column Width.
119
Using Excel:
Constructing Histograms
Click Data/Data Analysis/Histogram & OK.
Put Checkmarks on Chart Output & New
Worksheet Boxes.
Move Cursor to Input Range Window, Highlight
Data Going into Histogram.
Move Cursor to Input Bin Range, Highlight Data
Showing Upper Value of Each Bin & Click OK.
Histogram will be on New Worksheet. You May
Lengthen it by Clicking Blank Space in Window,
Moving Cursor to Window Bottom Line & Holding
Down Mouse Button as You Pull Down Window.
120
Using Excel:
Hypothesis Testing
Go to Sheet One.
Click Data/Data Analysis/ and the Appropriate
Statistical Test. Then Click OK.
On New Window Check Labels Box and Put
Cursor on Variable 1 Range.
Highlight Variable 1 Data Including Label.
Put Cursor on Variable 2 Range & Highlight
Variable 2 Data (Including Label). Then Click
OK.
Click Home/Format/AutoFit/Column Width
121
Using Excel:
Scatter Plots
Go to Sheet One.
Highlight Data (Be Sure X Values are in Left
Column and Y Values are in Right Column).
Click Insert/Scatter. Pull down menu and click
Upper Left Icon.
Click a Datum Point on Chart with Right Mouse
Key, Add Trendline, & Click Linear.
122
Using Excel:
Regression Analysis
Go to Sheet One.
Click Data/Data Analysis (On Far Right)
/Regression & Click OK.
On New Window Check Labels Box and Put
Cursor on X Range.
Highlight X Data Including Label.
Put Cursor on Y Range & Highlight Y Data
(Including Label), Then Click OK.
Click Home/Format/AutoFit Column Width.
123