SPH 2 Lecture - 1 Introduction and Data
SPH 2 Lecture - 1 Introduction and Data
1
Introduction
• What is statistics?
• Statistics: A field of study concerned with:
– collection, organization, analysis, summarization and
interpretation of numerical data, &
– the drawing of inferences about a body of data when only a
small part of the data is observed.
2
Biostatistics: The application of statistical methods to
the fields of biological and medical sciences.
3
• The numbers must be presented in such a way that
valid interpretations are possible
4
Importance of Biostatistics
• Resource allocation(3M+IT)
• Magnitude of association
5
Importance of Biostatistics
• Assessing risk factors for the occurrence of disease
– Cause & effect relationship
6
What does biostatistics cover?
Research Planning
Presentation
Interpretation
Publication 7
Research Design
– Sampling technique
– Inclusion/exclusion criteria
– Study design
– Etc
8
Analysis
9
Interpretation
10
Types of Statistics
Descriptive statistics:
11
Types of Statistics……
Inferential statistics:
• Methods used for drawing conclusions about a population
based on the information obtained from a sample of
observations drawn from that population
» Principles of probability
» Estimation
» confidence interval
» comparison of two or more means or proportions
» hypothesis testing, etc.
12
DATA
13
Objectives
At the end of this session, the students will able to:
14
• Data are numbers which can be measurements
or can be obtained by counting
• The raw material for statistics
• Can be obtained from:
– Routinely kept records, literature
– Surveys
– Counting
– Experiments
– Reports
– Observation
– Etc 15
Types of Data
16
Variable
Variable: A characteristic which takes different
values in different persons, places, or things.
• Any aspect of an individual or object that is
measured (e.g., BP) or recorded (e.g., age, sex)
and takes any value.
• There may be one variable in a study or many.
• E.g., A study of treatment outcome of TB
17
• Variables can be broadly classified into:
– Categorical (or Qualitative) or
18
• Categorical variable: A variable or characteristic
which can not be measured in quantitative form
but can only be sorted by name or categories
19
• Quantitative variable: A variable that can be
measured (or counted) and expressed
numerically.
20
Quantitative variable is divided into two:
1. Discrete: It can only have a limited number of discrete
values (usually whole numbers).
– E.g., the number of episodes of diarrhoea a child has had in a
year. You can’t have 12.5 episodes of diarrhoea
• Characterized by gaps or interruptions in the values
(integers).
• Both the order and magnitude of the values matter.
• The values aren’t just labels, but are actual measurable
quantities.
21
2. Continuous variable: It can have an infinite
number of possible values in any given interval.
• Both the magnitude and the order of the values
matter
• Does not possess the gaps or interruptions
• Weight is continuous since it can take on any
number of values (e.g., 34.575 Kg).
22
SUMMARY
Variable
Types
of Qualitative Quantitative
variables or categorical measurement
Measurement scales
23
Scales of measurement
24
1. Nominal scale:
• The simplest type of data, in which the values fall
into unordered categories or classes
• Consists of “naming” observations or classifying
them into various mutually exclusive and
collectively exhaustive categories
• Uses names, labels, or symbols to assign each
measurement.
– Examples: Blood type, sex, race, marital status, etc.
25
Example of nominal Scale:
Race/Ethnicity:
1. Black • The numbers have NO meaning
2. White • They are labels only
3. Latino
4. Other
26
• If nominal data can take on only two possible
values, they are called dichotomous or binary.
• So sex is not just nominal, it is dichotomous
(male or female).
• Yes/no questions
– E.g., cured from TB at 6 months of Rx
27
2. Ordinal scale:
• Assigns each measurement to one of a limited
number of categories that are ranked in terms of
order.
• Although non-numerical, can be considered to have
a natural ordering
– Examples: Patient status, cancer stages,
social class, etc.
28
Example of ordinal scale:
29
3. Interval scale:
30
4. Ratio scale:
Ratio
Degree of precision in measuring
Data collection
33
The Quantitative data collection methods-
34
• Quantitative research is concerned with testing
hypotheses derived from theory and/or being able
to estimate the size of a phenomenon of interest.
35
Typical quantitative data gathering strategies include:
1. Experiments/clinical trials.
36
1.Interviews
38
C. Questionnaires
39
D.Telephone interviews - are less time consuming
and less expensive and the researcher has ready
access to anyone on the planet who has a
telephone.
40
E.Web based questionnaires :
A new and inevitably growing methodology is the use of
Internet based research.
This would mean receiving an e-mail on which you would
click on an address that would take you to a secure
web-site to fill in a questionnaire.
This type of research is often quicker and less detailed.
Some disadvantages of this method include the exclusion
of people who do not have a computer or are unable to
access a computer.
41
• Questionnaires often make use of Checklist and rating scales.
These devices help simplify and quantify people's behaviors
and attitudes.
• A checklist- is a list of behaviors,characteristics,or other
entities that the researcher is looking for. Either the researcher
or survey participant simply checks whether each item on the
list is observed, present or true or vice versa.
• A rating scale- is more useful when a behavior needs to be
evaluated on a continuum.
• They are also known as Likert scales.
42
Qualitative Data Collection Method:
Observations
In-depth interviews
Focus groups discussion.
44
1.observation
Observation is a technique which involves
systematically selection, watching, and
recording behaviors and characteristics of
living things, objects, or phenomena.
Observational techniques are methods by
which an individual or individuals gather first
hand data on behaviors being studied.
45
Observational cont…
46
Observational cont…
49
Advantages:
• Provide direct information about behaviors of
individuals or groups
51
2.In-depth Interviews
53
Cont…
54
Advantages:
• Permits face to face content respondents
58
Cont…
As a rule, the focus group session should
not last longer than 1 1/2 to 2 hours.
The participants are usually a relatively
homogeneous group of people.
So that respondents’ social class, level of
expertise, age, cultural background, and sex
should always be considered.
Although focus groups and in-depth
interviews share many characteristics, they
should not be used interchangeably.
59
Methods of Data Organization and
Presentation
Tables
Graphs
Numerical summaries
60
Frequency Distributions (Tables)
1.Ordered array: A simple arrangement of individual observations in the
order of magnitude.
• Very difficult with large sample size
12 19 27 36 42 59
15 22 31 39 43 61
17 23 31 41 44 65
18 26 34 41 54 67
61
2.Frequency distribution: A table which has a
list of each of the possible values that the data
can assume along with the number of times
each value occurs.
62
Simple Frequency Distribution
• Primary and secondary cases of syphilis morbidity
by age, 1989
Age group Cases
(years) Number Percent
0-14 230 0.5
15-19 4378 10.0
20-24 10405 23.6
25-29 9610 21.8
30-34 8648 19.6
35-44 6901 15.7
45-54 2631 6.0
>44 1278 2.9
Total 44081 100
63
Tables can also be used to present more than
three or more variables.
Variable Frequency (n) Percent
Sex
Male
Female
Age (yrs)
15-19
20-24
25-29
Religion
Christian
Muslim
Occupation
Student
Farmer
Merchant 64
Guidelines for constructing tables
• Keep them simple,
• Limit the number of variables to three or less,
• All tables should be self-explanatory,
• Include clear title telling what, when and where,
• Clearly label the rows and columns,
• State clearly the unit of measurement used,
• Explain codes and abbreviations in the foot-note,
• Show totals,
• If data is not original, indicate the source in foot-note.
65
Diagrammatic Representation
66
Importance of diagrammatic representation:
68
Specific types of graphs include:
• Bar graph Nominal, ordinal
• Pie chart data
• Histogram
• Stem-and-leaf plot
• Box plot Quantitative
• Scatter plot data
• Line graph
• Others
69
1. Bar charts (or graphs)
• Categories are listed on the horizontal axis (X-
axis)
• Frequencies or relative frequencies are
represented on the Y-axis (ordinate)
• The height of each bar is proportional to the
frequency or relative frequency of observations
in that category
70
Bar chart for the type of ICU for 25 patients
71
Method of constructing bar chart
• All the bars must have equal width
• The bars are not joined together (leave space
between bars)
• The different bars should be separated by
equal distances
• All the bars should rest on the same line
called the base
• Label both axes clearly
72
Example: Construct a bar chart for the following data.
73
74
2. Sub-divided bar chart
• If there are different quantities forming the
sub-divisions of the totals, simple bars may
be sub-divided in the ratio of the various
sub-divisions to exhibit the relationship of
the parts to the whole.
• The order in which the components are
shown in a “bar” is followed in all bars used
in the diagram.
– Example: Stacked and 100% Component bar
charts
75
Example: Plasmodium species distribution for
confirmed malaria cases, Zeway, 2003
76
3. Multiple bar graph
• Bar charts can be used to represent the
relationships among more than two variables.
• The following figure shows the relationship
between children’s reports of breathlessness
and cigarette smoking by themselves and
their parents.
77
We can see from the graph quickly that the prevalence of the symptoms
increases both with the child’s smoking and with that of their parents.
78
4. Pie chart
• Shows the relative frequency for each category by
dividing a circle into sectors, the angles of which
are proportional to the relative frequency.
• Used for a single categorical variable
• Use percentage distributions
79
Steps to construct a pie-chart
• Construct a frequency table
80
Example: Distribution of deaths for females, in England
and Wales, 1989.
81
82
5. Histogram
• Histograms a re f re q u e n c y d i st r i b u t i o n s w i t h
continuous class intervals that have been turned into
graphs.
• To construct a histogram, we draw the interval
boundaries on a horizontal line and the frequencies on
a vertical line.
• Non-overlapping intervals that cover all of the data
values must be used.
83
• Bars are drawn over the intervals in such a way
that the areas of the bars are all proportional in
the same way to their interval frequencies.
84
Example: Distribution of the age of women at the time of marriage
85
Histogram for the ages of 2087 mothers with <5 children,
Adami Tulu, 2003
86
Two problems with histograms
87
6. Stem-and-Leaf Plot
• A quick way to organize data to give visual impression
similar to a histogram while retaining much more detail
on the data.
• Similar to histogram and serves the same purpose and
reveals the presence or absence of symmetry
• Are most effective with relatively small data sets
• Are not suitable for reports and other communications,
but
• Help researchers to understand the nature of their data
88
Example
• 43, 28, 34, 61, 77, 82, 22, 47, 49, 51, 29, 36,
66, 72, 41
2 2 8 9
3 4 6
4 1 3 7 9
5 1
6 1 6
7 2 7
8 2
89
Steps to construct Stem-and-Leaf Plots
90
Steps to construct Stem-and-Leaf Plots
3. Write the second stem (first stem +1) below the first
stem
4. Continue with the remaining stems until you reach the
largest stem in the data set
5. Draw a vertical bar to the right of the column of stems
6. For each number in the data set, find the appropriate
stem and write the leaf to the right of the vertical bar
91
Example: 3031, 3101, 3265, 3260, 3245, 3200, 3248,
3323, 3314, 3484, 3541, 3649 (BWT in g)
93
Frequency polygon for the ages of 2087 mothers with <5 children,
Adami Tulu, 2003
94
It can be also drawn without erecting rectangles by joining the top
midpoints of the intervals representing the frequency of the classes as
follows:
95
8. Ogive Curve (The Cummulative Frequency
Polygon)
• Some times it may be necessary to know the number of
items whose values are more or less than a certain
amount.
• We may, for example, be interested to know the no. of
patients whose weight is <50 Kg or >60 Kg.
• To get this information it is necessary to change the form
of the frequency distribution from a ‘simple’ to a
‘cumulative’ distribution.
• Ogive curve turns a cumulative frequency distribution in
to graphs.
• Are much more common than frequency polygons
96
Cumulative Frequency and Cum. Rel. Freq. of Age
of 25 ICU Patients
98
99
9. Box and Whisker Plot
• It is another way to display information when
the objective is to illustrate certain locations
(skewness) in the distribution .
• Can be used to display a set of discrete or
continuous observations using a single vertical
axis – only certain summaries of the data are
shown
• First the percentiles (or quartiles) of the data
set must be defined
100
• A box is drawn with the top of the box at the
third quartile (75%) and the bottom at the
first quartile (25%).
• The location of the mid-point (50%) of the
distribution is indicated with a horizontal line
in the box.
• Finally, straight lines, or whiskers, are drawn
from the centre of the top of the box to the
largest observation and from the centre of the
bottom of the box to the smallest observation.
101
• Percentile = p(n+1), p=the required percentile
• Arrange the numbers in ascending order
A. 1st quartile = 0.25 (n+1)th
B. 2nd quartile = 0.5 (n+1)th
C. 3rd quartile = 0.75 (n+1)th
D. 20th percentile = 0.2 (n+1)th
C. 15th percentile = 0.15 (n+1)th
102
The pth percentile is a value that is p% of the
•
observations and the remaining (1-p)%.
• The pth percentile is:
103
• Given a sample of size n = 60, find the 10th
percentile of the data set.
p(n+1) = 0.10(60+1) = 6.1
= Average of 6th and 7th
– 10% of the observations are less than or equal to this
value and 90% of them are greater than or equal to the
value
104
How can the lower quartile, median and lower quartile be used
to judge the symmetry of a distribution?
105
106
Box plots are useful for comparing two or
more groups of observations
107
Outlying values
• The lines coming out of the box are called the
“whiskers”.
• The ends of the “whiskers’ are called “adjacent
values) [The largest and smallest non-outlying
values].
108
• The box plot is then completed:
– Draw a vertical bar from the upper quartile to the
largest non-outlining value in the sample
– Draw a vertical bar from the lower quartile to the
smallest non-outlying value in the sample
– Outliers are displayed as dots (or small circles) and
are defined by:
Values greater than 75th percentile + 1.5*IQR
Values smaller than 25th percentile − 1.5*IQR
– Any values that are outside the IQR but are not
outliers are marked by the whiskers on the plot.
– IQR = P75 – P25
109
• Number of cigarettes smoked per day was
measured just before each subject attempted to
quit smoking
110
10. Scatter plot
• Most studies in medicine involve measuring more
than one characteristic, and graphs displaying the
relationship between two characteristics are
common in literature.
• When both the variables are qualitative then we
can use a multiple bar graph.
• When one of the characteristics is qualitative and
the other is quantitative, the data can be displayed
in box and whisker plots.
111
• For two quantitative variables we use bivariate
plots (also called scatter plots or scatter
diagrams).
114
No. of microscopically confirmed malaria cases by species and
month at Zeway malaria control unit, 2003
115
Line graph can be also used to depict the relationship between two
continuous variables like that of scatter diagram.
116
117
Thank you
118