Classification of Data: Objectives: Understand How Data Are Classified. Recognize The Different Types of Data
Classification of Data: Objectives: Understand How Data Are Classified. Recognize The Different Types of Data
Classification of Data: Objectives: Understand How Data Are Classified. Recognize The Different Types of Data
Objectives:
1. Understand how data are
classified.
2. Recognize the different types
of data.
VARIABLES AND DATA
What is a variable?
A variable is any one of the measures which
comprise the database in a study.
or:
A variable is a single parameter in a research
database.
THEREFORE: RESEARCH DATABASES ARE
COMPOSED OF DATA SETS OF
VARIABLES
PROPERTIES OF VARIABLES
RELATIONAL DESCRIPTION
How variables relate to each other
FUNCTIONAL DESCRIPTION
RELATIONAL DESCRIPTION OF
VARIABLES
A. INDEPENDENT. Those to which you
randomize, or control or manipulate
Multiple categories
Mild, moderate, severe
Phenotypic data
Food preferences
Some Examples:
ORDINAL DATA
STUDY
SAMPLE DATA
(study groups)
7 0.075
6
Number
5
0.050
4
2 0.025
0 0.000
0
10
11
12
13
14
15
16
17
18
19
20
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Bin Center
Bin Center
PROPERTIES OF A DISTRIBUTION I
Differential vs. cumulative. In diff., each bin contains
value for that bin only; in cumul., each bin is sum of all
previous bins
Histogram of Data 1:Freq. dist. (histogram) Histogram of Data 1:Freq. dist. (histogram)
0.100 1.00
0.075 0.75
0.050 0.50
0.025
0.25
0.000
0.00
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Bin Center
Bin Center
PROPERTIES OF A DISTRIBUTION II
SHAPE MEASURES OF
Tails: values far from CENTRAL TENDENCY
central tendencies Mode: value represent-
Skewness: ing greatest number of
asymmetry caused counts
by “uneven Median: 50%>;50%<
distribution
Mean: mathematical
Kurtosis: “flatness’ average,i.e., Sxn/n
or “peakedness” of
curve.
THE NORMAL DISTRIBUTION
1. Two-tailed, bell-shaped, mathematically described
as “Gaussian”
2. Median, mode and mean are identical
SOME EXAMPLES
Plotting Distributions; Line Graph
0.275
0.250
0.225
0.200
0.175
Percentage
0.150
0.125
0.100
0.075
0.050
0.025
0.000
10.0
12.5
15.0
17.5
20.0
22.5
25.0
27.5
0.0
2.5
5.0
7.5
Bin Center
Plotting Distributions; Bar Graph
18
17
16
15
14
13
12
Percentage
11
10
9
8
7
6
5
4
3
2
1
0
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
Bin Center
“Thinking on your feet”
If you do logarithmic transformation of a typically skewed
distribution of a biological parameter from a population,
should it be base 10 log or natural log?
Doesn’t matter which one
Would it make a difference and why (or why not)?
Doesn’t matter which one- still will end up normal, just the values
of the parameter will be a bit different because of the base you
use
*Can you calculate percentiles without graphing your
data? How?
Yes, you can calculate it by figuring out a ratio comparison or
ratio proportion
QUANTIFYING DISPERSION
Objectives:
1. Recognize why dispersion is described
in quantitative terms.
2. Learn the various ways in which
dispersion can be quantified.
3. Understand the purpose of each of these
terms.
The RANGE
Simplest description of variability
Gives no indication of the manner in which
a variable is distributed
Can be misleading, eg, the statement,
“values of x ranged from 0.1 to 10.1”
0.1, 0.1, 0.2, 0.2, 0.4, 0.5, 10.1
0.1, 9.8, 8.9, 10.0,10.0, 9.5, 10.1
The VARIANCE
Actually gives an indication of the distribution
because it depends upon how far the point
estimates are from the mean.
The variance depends on the difference
between
each point estimate and the mean (how far from the
mean is each point?) and
the number of points (intuitively, shouldn’t more
points result in a better estimate of the degree of
dispersion?)- more points = better estimate
The VARIANCE
The difference between each point and the
mean, summed:
(x1-x ) + (x2-x ) + (x3-x ) +… (xn-x )
or: S (x-x) and then squared: S (x-x)2
divided by n or n-1
S = S (x-x )2
n-1
s =(S)1/2
or
s = (S (x-xbar)2 / n-1)1/2
THE NORMAL DISTRIBUTION
1. Two-tailed, bell-shaped, mathematically described
as “Gaussian”
2. Median, mode and mean are identical
POPULATION vs. SAMPLE PARAMETERS
m vs. x
s vs. s or SD
WHAT’S A STANDARD DEVIATION?
It’s the distance from the mean in which 68.26%
of all point estimates of a variable from a
normally distributed sample will be found.
2 SD will include 95.45% of all points
3 SD will include 99.73% of all points
States that:
a large number of means, gathered from many
samples from a given population, will be normally
distributed,
the mean of the means of all these samples will
approximate the population mean, and
The standard deviation from the mean of the means
will approximate the population standard deviation
THE “STANDARD ERROR”
One implication of the CL theorem is that we
can learn about dispersion in the population
from which our sample derives if we can
estimate the standard deviation of the mean of
the means.
We call this estimate the standard error of the
mean, abbreviated SEM*
SEM is calculated (very simply) by:
SEM=SD/n1/2
* more strictly: sx
CONFIDENCE INTERVALS (CI)
Thus SEM tells us something about our
population
We can use this information to calculate
Confidence Intervals- tells how likely a
value that we have belongs to the
population of the interest
CONFIDENCE INTERVALS (CI)
Answers the question: What range of values can I expect for
X% of measurements made in my population ? (Where X is
the CI).
Can be calculated for different levels of likelihood, eg, 90%,
95%, 99% CI (usually 95%).
So a 95% CI about a mean value of 10 equal to 5-15 means that
I’m confident that 95% of measurements from the population
would fall in the range 10 to 15.
CIs can be estimated from the simple equation:
lower end of 95% CI= x – (1.96.SEM)
upper end of 95% CI= x + (1.96.SEM)