Chapter 1 (2)
Chapter 1 (2)
Introduction
Definition of terms
Data: are figures or facts from which conclusion can be made. Data are the numerical results of
any scientific measurement. Any value that is expressed in numbers is called data.
Population: the totality of all elements under study.
Sample: is a portion or part of the population taken so that some generalization about the
population can be made. It is the subset of the population which is assumed to be the
representative of the population.
Definition of Statistics
Statistics can be defined in two senses: plural (as Statistical Data) and singular (as Statistical
Methods).
Plural sense: Statistics are collection of facts (figures). This meaning of the word is widely used
when reference is made to facts and figures on sales, employment or unemployment, accident,
weather, death, education, e.t.c. Eg: Sales Statistics, Labor Statistics, Employment Statistics,
e.t.c. In this sense the word Statistics serves simply as data. But not all numerical data are
statistics.
Singular sense: Statistics is the science that deals with the methods of data collection,
organization, presentation, analysis and interpretation of data. It refers the subject area that is
concerned with extracting relevant information from available data with the aim to make sound
decisions. According to this meaning, statistics is concerned with the development and
application of methods and techniques for collecting, organizing, presenting, analyzing and
interpreting statistical data.
1
1. Collection of Data: This is the first stage in any statistical investigation and involves the
process of obtaining (gathering) a set of related measurements or counts to meet
predetermined objectives. The data collected may be primary data (data collected directly by
the investigator) or it may be secondary data (data obtained from intermediate sources such
as newspaper s, journals, official records, e.t.c).
2. Organization of Data: It is usually not possible to derive any conclusion about the main
features of the data from direct inspection of the observations. The second purpose of
statistics is describing the properties of the data in a summary form. This stage of statistical
investigation helps to have a clear understanding of the information gathered and includes
editing (correcting), classifying and tabulating the collected data in a systematic manner.
Thus the first step in the organization of data is editing. It means correcting (adjusting)
omissions, inconsistencies, irrelevant answers and wrong computations in the collected data.
The second step of the organization of data is classification that is arranging the collected
data according to some common characteristics. The last step of the organization of data is
presenting the classified data in tabular form, using rows and columns (tabulation).
3. Presenting of Data: The purpose of data presentation is to have an overview of what the data
actually looks like, and to facilitate statistical analysis. Data presentation can be done using
Graphs and Diagrams which have great memorizing effect and facilitates comparison.
4. Analysis of Data: The analysis of data is the extraction of summarized and comprehensive
numerical description in order to reach conclusions or provide answers to a problem. The
problem may require simple or sophisticated mathematical expressions.
5. Interpretation of Data: This is the last stage of statistical investigation. Interpretation
involves drawing conclusions from the data collected and analyzed in order to make decision.
Classification of Statistics
Based on the scope of the decision, statistics can be classified into two; Descriptive and
Inferential Statistics.
Descriptive Statistics refers to the procedures used to organize and summarize masses of data.
It is concerned with describing or summarizing the most important features of the data. It deals
only the characteristics of the collected data without going beyond it. That is, this part deals with
only describing the data collected without going any further: that is without attempting to
2
infer(conclude) anything that goes beyond the data themselves. The methodology of descriptive
statistics includes the methods of organizing (classification, tabulation, Frequency Distributions)
and presenting (Graphical and Diagrammatic Presentation) data and calculations of certain
indicators of data like Measures of Central Tendency and Measures of Dispersion (Variation)
which summarize some important features of the data.
Inferential (Inductive) Statistics includes the methods used to find out something about a
population, based on the sample. It is concerned with drawing statistically valid conclusions
about the characteristics of the population based on information obtained from sample. In this
form of statistical analysis, descriptive statistics is linked with probability theory in order to
generalize the results of the sample to the population. Performing hypothesis testing, determining
relationships between variables and making predictions are also inferential statistics.
Applications of Statistics
In this modern time, statistical information plays a very important role in a wide range of fields.
Today, statistics is applied in almost all fields of human endeavor.
3
In Public Health and Medicine: statistical methods are used for computation and
interpretation of birth and death rates.
In Economics: for modeling functional relationships between or among variables
In Education and Agricultural Extension: to study the effects of certain trainings.
In Natural and Social Sciences, Business, Planning, Behavior Sciences, e.t.c.
Uses of Statistics
Condenses and summarizes masses of data and presents facts in numerical and definite
form
Facilitates comparison: statistical devises such as averages, percentages, ratios, e.t.c. are
used for this purpose.
Formulating and testing hypothesis: For instance, hypothesis like whether a new medicine
is effective in curing a disease, whether there is an association between variables can be
tested using statistical tools.
Forecasting: Statistical methods help in studying past data and predicting future trends.
Limitations of Statistics
Variable
It is a characteristics or an attribute that can assume different values.
Eg: Height, Family size, Gender
Based on the values that variables assume, variables can be classified as
4
1. Qualitative variables: do not assume numeric values.
Eg: Gender
2. Quantitative variables: assume numeric values. These variables are numeric in
nature.
Eg: Height, Family size
Discrete variable: takes whole number values and consists of distinct
recognizable individual elements that can be counted. It is a variable that
assumes a finite or countable number of possible values. These values are
obtained by counting (0, 1, 2, . . ,).
Eg: Family size, Number of children in a family, number of cars at the
traffic light
Continuous variable: takes any value including decimals. Such a variable
can theoretically assume an infinite number of possible values. These values
are obtained by measuring.
Eg: Height, Weight, Time, and Temperature
Generally the values of a variable can be obtained either by counting for discrete variables, by
measuring for continuous variables or by making categories for qualitative variables.
Ex: Classify each of the following as Qualitative and Quantitative and if it is quantitative classify
as Discrete and Continuous.
5
Mr A scored 5 in Stat quiz.
Mr B scored 6 in Stat quiz.
Who did better?
What is the average score?
Based on the number on the shirts it is not possible to judge, whether Mr B plays better. But by
using the test score, it is possible to judge that Mr B did better in the exam. Also it not possible
to find the average shirt numbers (or the average shirt number is nothing) because the numbers
on the shirts are simply codes but it is possible to obtain the average test score.
Ordinal Scales of variables are also those qualitative variables whose values can be ordered
and ranked. Ranking and counting are the only mathematical operations to be done on the
values of the variables. But there is no precise difference between the values (categories) of
the variable.
6
Eg: Academic qualifications (B.Sc., M.Sc., Ph.D.), Grade Scores (A, B, C, D, F), Strength
(very weak, week, strong, very strong), Health status (very sick, sick, cured)
Interval Scales of variables are those quantitative variables when the value of the variables is
zero it does not show absence of the characteristics i.e. there is no true zero. Zero indicates
low than empty. There is a precise difference between the units of measurement (levels)
Eg: temperature, 00c does not mean there is no temperature but to say it is too cold.
A student who scored “0” doesn’t mean he/she has no knowledge. A person having an IQ =
0 doesn’t mean that he/she no IQ.
Ratio Scales of variables are those quantitative variables when the values of the variables are
zero it shows absence of the characteristics. Zero indicates absence of the characteristics.
Eg: Height, Weight, Income, Amount of yield, Expenditure, Consumption.
All mathematical operations are allowed to be operated on the values of the variables.
7
Chapter Two
Data Collection and Presentation
Classification of Data
Based on the source, data can be classified into two: Primary Data and Secondary Data.
Primary data are data collected for the first time either through direct observation or by
enquiring individuals. It refers to the data collected either by or under the direct
supervision and instruction of the researcher.
Secondary data are data obtained from published or unpublished sources like
newspapers, journals, official records, e.t.c.
Based on the role of time, data can be classified as Cross-sectional and Time series.
The first and foremost task in statistical investigation is data collection. Before data collection,
four important points should be considered. These are the purpose of data collection (why we
need to collect data), the data to be collected (what kind of data to be collected), the source of
data (where we can get the data) and the methods of data collection (how can we collect this
data). These steps are called the why, what, where and how of the data collection.
Primary data are collected from primary sources and secondary data from secondary sources.
Primary data can be collected through experimental methods in laboratory in natural sciences
and through survey method in social sciences.
The survey methods of data collection are personal interview, telephone interview, mailed
questionnaire and personal observation.
8
Observational Method: This method involves monitoring of an ongoing activity and
direct recording of data. It avoids incompleteness of data. However, it is rarely used as it
is not possible to plan when the events will happen.
Personal Interview: a trained interviewer asks a series of questions and records
responses on a specially designed form called questionnaire. In this approach the
enumerator is with the respondent s/he explains some points which is not clear for the
respondent. In this approach the quality of the data affected both the design of the
questionnaire and the quality of the interviewer.
Advantage
It has obtaining information in depth from a person being interviewed, since we can
make some clarifications to the questions and avoids incompleteness and disorder
responses.
Disadvantage:
It is costly than other methods, since it requires training of interviewers
and transportation cost.
The respondent may not tell us the real information for sensitive questions,
since there is face to face interaction. Eg: Asking about salary, if his/her
salary is very small, he/she might tell us the wrong one, since the
respondent gets ashamed of it.
Telephone Interview: This method involves contacting the respondent on telephone and
collecting information. It is faster to collect information. The absence of telephone lines
makes this approach less usable. It cannot be also used for rural surveys.
Advantage: It is less costly, since it requires less number of interviewers and the cost for
calling is than the cost for transportation. The respondent may give his/her opinion
candidly since there is no face to face interaction. Because of this, the data we get
through this method are more realistic than the previous one.
Disadvantage: this method is not applicable in developing countries because of the lack
of access to telephone. The respondent might not be in his/her house or may not respond
to the call, and in the meantime the interviewer might get bored. There is a high chance
of getting incomplete response, since the connection can be interrupted.
9
Mailed Questionnaire: the researcher sends the questionnaire to the respondent; the
respondents complete the form and sends back to the researcher.
Advantage
Costs are low. The responses are free from biases of the interviewer and respondents can
have more time to give well thought answers. But it is applicable for educated persons.
Non response, Partial response, low return rates.
Disadvantage: the respondent might give in appropriate answers to questions, since there
is no one is there with them they may understand the question wrongly and respond it
incorrectly.
Types of Surveys
In general there are two methods of data collection: Census Survey and Sample Survey
Method.
Census Survey: is (complete enumeration) a study covered all the elements in the population
under consideration. In this method we resort a 100% inspection of the population and each and
every unit of the population is enumerated. It enables to obtain information about each and every
element in the population.
Sample Survey: is a survey in which some elements which are representatives of the population
(sample) are taken to infer about the whole population. It is a statistical process in which we
select and examine a sample instead of considering the whole population
The Sampling method has many advantages over the census methods.
1. Sampling reduces cost of data collection.
2. Greater speed i.e. it enables us to obtain results on time.
3. Greater accuracy. It helps us to get data of good quality as the number of
enumerators’ decreases we can train and supervise them well in the process of data
collection.
4. Greater scope (under circumstances where human and material resources are
limited).
5. Census may be destructive. Samples reduce the damages caused by some tests in
quality control. For example, in cooking food mothers check whether the food has
10
enough amount of salt, spices, butter and so on, by taking small amount and testing
it. What would happen if the test is all what is in the dish?
6. Complete enumeration may be impossible or impractical (when the population is
infinite), thus sampling is the only way.
Questionnaire
It is a form containing the cover letter that explains about the person conducting the survey and
the objectives of the survey, and a set of related questions which will be answered by the
respondents.
It requires great care in preparing a questionnaire for data collection. One of the most important
points in preparing it is that all questions in it must have relevance to the objectives of the
survey.
A highly structured questionnaire is the one in which the questions to be asked and the response
permitted are completely predetermined.
Lower price
Good quality
Better picture
Longer guarantee
This is accomplished by employing fixed alternative questions in which the respondents are
limited to the stated alternative.
A highly unstructured questionnaire is the one in which the questions to be asked are only
loosely determined, and the respondent is free to respond in his her own words and in any s/he
sees it.
1. They are slow, and hence ,costly to administer in the field and to tabulate
2. The data collection process and the interpretation of the results are both subjective
and hence, open to bias.
11
Structured techniques overcome these problems, but they are difficult to use in situations where
respondents may hesitate to report their attitudes.
The disguised questionnaire attempts to hide the purpose of the study; whereas the undisguised is
one in which the purpose of the research is obvious from the question posed.
Structured undisguised questionnaires are the most common used type in practice. These are
simple to administer and easy to tabulate and analyze. The alternative questions are more
productive when possible replies are well known, limited in number and clear cut.
The unstructured undisguised questionnaire is the one in which the purpose of the study is not
concealed but the response to the question is open ended. Consider the question “how do you
express the need for democracy in developing countries?” such a question provides complete
freedom to the respondent. However, the responses are difficult to tabulate and analyze. In the
unstructured undisguised questionnaire, the respondents are not told about the purpose of the
study and the questions are framed in a manner that there is complete freedom for the respondent
to answer. The basic philosophy underlining such questionnaire is that, the more the unstructured
and ambiguous a stimulus, the more a subject can and will project the respondent emotions,
needs, motivations, attitudes and values. The practical difficulties of editing, coding and
tabulation of replies impose serious limitations on the use of methods. This method is more often
used for Investigative Research. The unstructured undisguised questionnaires are also not
popularly in practice. Having decided which type of questionnaire to use, the following points
should be kept in mind while designing a questionnaire.
The person conducting the survey should introduce himself and state the objective of the
survey, promise of the anonymity and include instructions as are necessary in giving
correct responses (on the cover letter).
The number of questions should be as few as possible.
Once the objectives of the survey are clearly defined only questions pertinent to the
objectives should. The time of the respondent should not be wasted by asking irrelevant
questions. In general 5 to 25 may be regarded as affair number. If a lengthy questionnaire
is unavoidable, it should preferably be divided into two or more parts.
12
Questions should be logically arranged. Put the questions in the appropriate sequence of
topics. Topics should not be mixed up.
The questions should be in a logical order so that a natural and spontaneous reply is
introduced. They should not skip back and forth.
It is undesirable to ask a person how many children s/he has before asking whether s/he is
married or not.
Questions related to identification and description of the respondent should be come first,
followed by major information questions. If opinions are requested, such questions
should usually be placed at the end of the list.
Questions should be simple, short and easy to understand and they should convey one
and only one idea. Technical terms should be avoided.
Sensitive questions (questions of personal and financial nature) should be avoided. Such
questions should be obtained indirectly, among asset of ranges. Unless put them at the
last part and within a set of ranges. Eg: Age (0-25, 26-50, 51-75,>75)
Salary (Below 200,200-500,500-1000,>1000)
Leading questions should be completely avoided. If you ask person like “Don not you
smoke?” the person will automatically say ‘Yes I do not’
Answers to the questions should not require any calculation.
There should be instructions how to fill the form.
Questions should be capable of objective answers.
Types of questions
Different types of questions that may form a questionnaire can be grouped into three
categories.
1. Dichotomous questions
2. Multiple-choice questions, and
3. Open-ended questions
Dichotomous questions are type of questions which have two alternative responses. Such
questions can be answered in ‘Yes’ or ‘No’.
13
Yes No
If you use such types of questions in a questionnaire it is an excellent technique but it is applied
to situations where a clear alternative exists.
However, if a questions do not have a clear choices like ‘Yes’ or ‘No’, such questions cannot be
used as a dichotomous questions or additional answers should be added as follows.
Yes No
If Yes, how often
Always Occasionally
Seldom
Multiple-choice questions: in such types of questions the respondent is asked to select one out of
a number of alternative responses. This process not only facilitates tabulation of data but also
takes very little time of the respondent to fill the questionnaire.
The problem with multiple-choice questions is that the respondent may like to tick more than one
alternative. So to avoid such a problem either we have to inform the respondent to choose the
most important one or to make a rank among his choices.
The use of multiple choice questions are indicated only when the investigator is confident of the
existence of a limited group of important alternatives.
Open-ended or free answer questions: In such types of questions, the respondent will have the
chance to answer the questions in his/her own words.
14
Eg: -What is your opinion on the teaching policy?
The difficulty with these types of questions is in classifying the questions during tabulations and
analysis.
Pre-test: test the questionnaire on a few numbers of respondents for some correction before
actual data collection. The pre-test helps the researcher to improve the language and structure of
the questionnaire. It can also help to estimate the average time taken to complete the
questionnaire and thereby to estimate the cost required for the survey.
Secondary data
Secondary data should be used with utmost care. So before using this data, the following three
points should be considered.
1. Whether the data are suitable for the purpose of investigation. This can be judged in the
light of the nature and scope of investigation.
2. If the data obtained is suitable for our purpose it should be look at whether the data are
adequate for the purpose of investigation. This can be judged in the light of the time and
geographical area covered by the available data.
3. Whether the data are reliable. The data obtained should be checked for its accuracy.
To describe situations, draw conclusions, or make inferences about events, the researcher must
organize the data in some meaningful way. The most convenient method of organizing data is to
construct a frequency distribution.
Two types of frequency distributions that are most often used are the categorical frequency
distribution and the grouped frequency distribution. The procedures for constructing these
distributions are shown now.
15
Categorical Frequency Distributions
The categorical frequency distribution is used for data that can be placed in specific categories,
such as nominal- or ordinal-level data. For example, data such as political affiliation, religious
affiliation, or major field of study would use categorical frequency distributions.
Nominal Scales of variables are those qualitative variables which show category of
individuals. They reflect classification into categories (name of groups) where there is no
particular order or qualitative difference to the labels. Numbers may be assigned to the
variables simply for coding purposes. It is not possible to compare individual basing on the
numbers assigned to them. The only mathematical operation permissible on these variables is
counting.
These variables
Have mutually exclusive (non-overlapping) and exhaustive categories.
No ranking or order between (among) the values of the variable.
Eg: Gender, Religion, ID No, Ethnicity, Color
Ordinal Scales of variables are also those qualitative variables whose values can be ordered
and ranked. Ranking and counting are the only mathematical operations to be done on the
values of the variables. But there is no precise difference between the values (categories) of
the variable.
Eg: Academic qualifications (B.Sc., M.Sc., Ph.D.), Grade Scores (A, B, C, D, F), Strength
(very weak, week, strong, very strong), Health status (very sick, sick, cured).
Example: Distribution of Blood Types
Twenty-five army inductees were given a blood test to determine their blood type. The data
set is
A B B AB O
O O B AB B
B B O A O
A O O O AB
AB A O B A
Construct a frequency distribution for the data.
Solution
16
Since the data are categorical, discrete classes can be used. There are four blood types: A, B,
O, and AB. These types will be used as the classes for the distribution.
The procedure for constructing a frequency distribution for categorical data is given next.
Step 1: Make a table as shown.
A B C D
Class Tally Frequency Percent
A
B
O
AB
Step 2: Tally the data and place the results in column B.
Step 3: Count the tallies and place the results in column C.
Step 4: Find the percentage of values in each class by using the formula
f
% = n . 100%
Where f = frequency of the class and, n = total number of values. For example, in the class of
type A blood, the percentage is:
5
% = 25 . 100% = 20%
Percentages are not normally part of a frequency distribution, but they can be added since
they are used in certain types of graphs such as pie graphs.
Also, the decimal equivalent of a percent is called a relative frequency.
Step 5: Find the totals for columns C (frequency) and D (percent). The completed table is
shown.
Grouped Frequency Distributions
When the range of the data is large, the data must be grouped into classes that are more than
one unit in width, in what is called a grouped frequency distribution.
17
Select the number of classes desired.
Find the width by dividing the range by the number of classes and rounding up.
Select a starting point (usually the lowest value or any convenient number less than the
lowest value); add the width to get the lower limits.
Find the upper class limits.
Find the boundaries.
Step 2: Tally the data.
Step 3: Find the numerical frequencies from the tallies, and find the cumulative frequencies.
For example, a distribution of the number of hours that boat batteries lasted is the following.
Class Class
Limits boundaries Tally Frequency
24–30 23.5–30.5 /// 3
31–37 30.5–37.5 / 1
38–44 37.5–44.5 //// 5
45–51 44.5–51.5 ///// //// /9
52–58 51.5–58.5 ///// / 6
59–65 58.5–65.5 / 1
25
The procedure for constructing the preceding frequency distribution is given in Example 2–2;
however, several things should be noted. In this distribution, the values 24 and 30 of the first
class are called class limits. The lower class limit is 24; it represents the smallest data value
that can be included in the class. The upper class limit is 30; it represents the largest data
value that can be included in the class. The numbers in the second column are called class
boundaries. These numbers are used to separate the classes so that there are no gaps in the
frequency distribution. The gaps are due to the limits; for example, there is a gap between 30
and 31. Students sometimes have difficulty finding class boundaries when given the class
limits. The basic rule of thumb is that the class limits should have the same decimal place
value as the data, but the class boundaries should have one additional place value and end in
a 5. For example, if the values in the data set are whole numbers, such as 24, 32, and 18, the
limits for a class might be 31–37, and the boundaries are 30.5–37.5. Find the boundaries by
subtracting 0.5 from 31 (the lower class limit) and adding 0.5 to 37 (the upper class limit).
Lower limit - 0.5 = 31 - 0.5 = 30.5 = Lower boundary
18
Upper limit + 0.5 = 37 + 0.5 = 37.5 = Upper boundary
If the data are in tenths, such as 6.2, 7.8, and 12.6, the limits for a class hypothetically might
be 7.8–8.8, and the boundaries for that class would be 7.75–8.85. Find these values by
subtracting 0.05 from 7.8 and adding 0.05 to 8.8.
Finally, the class width for a class in a frequency distribution is found by subtracting the
lower (or upper) class limit of one class from the lower (or upper) class limit of the next
class. For example, the class width in the preceding distribution on the duration of boat
batteries is 7, found from 31 - 24 = 7.
The class width can also be found by subtracting the lower boundary from the upper
boundary for any given class. In this case, 30.5 - 23.5 = 7
Note: Do not subtract the limits of a single class. It will result in an incorrect answer.
The researcher must decide how many classes to use and the width of each class. To
construct a frequency distribution, follow these rules:
1. There should be between 5 and 20 classes. Although there is no hard-and-fast rule for the
number of classes contained in a frequency distribution, it is of the utmost importance to
have enough classes to present a clear description of the collected data.
2. It is preferable but not absolutely necessary that the class width be an odd number.
This ensures that the midpoint of each class has the same place value as the data.
The class midpoint Xm is obtained by adding the lower and upper boundaries and
Dividing by 2, or adding the lower and upper limits and dividing by 2:
LowerBoundery+Upperboundery
Xm = 2
Or
LowerLimit+UpperLimit
Xm = 2
For example, the midpoint of the first class in the example with boat batteries is:
24+ 30 23 .5+30. 5
2 = 27 or 2 = 27
19
The midpoint is the numeric location of the center of the class. Midpoints are necessary for
graphing (see Section 2–2). If the class width is an even number, the midpoint is in tenths.
For example, if the class width is 6 and the boundaries are 5.5 and 11.5, the midpoint is:
5. 5+11. 5
2 = 8.5
Rule 2 is only a suggestion, and it is not rigorously followed, especially when a computer is
used to group data.
3. The classes must be mutually exclusive. Mutually exclusive classes have non-overlapping
class limits so that data cannot be placed into two classes. Many times, frequency
distributions such as:
Age
10–20
20–30
30–40
40–50
are found in the literature or in surveys. If a person is 40 years old, into which class should
she or he be placed? A better way to construct a frequency distribution is to use classes such
as:
Age
10–20
21–31
32–42
43–53
4. The classes must be continuous. Even if there are no values in a class, the class must be
included in the frequency distribution. There should be no gaps in a frequency distribution.
The only exception occurs when the class with a zero frequency is the first or last class. A
class with a zero frequency at either end can be omitted without affecting the distribution.
5. The classes must be exhaustive. There should be enough classes to accommodate all the
data.
6. The classes must be equal in width. This avoids a distorted view of the data.
20
One exception occurs when a distribution has a class that is open-ended. That is, the class has
no specific beginning value or no specific ending value. A frequency distribution with an
open-ended class is called an open-ended distribution. Here are two examples of
distributions with open-ended classes.
Age Frequency Minutes Frequency
10–20 3 below 110 16
21–31 6 110–114 24
32–42 4 115–119 38
43–53 10 120–124 14
54 and above 8 125–129 5
The frequency distribution for age is open-ended for the last class, which means that anybody
who is 54 years or older will be tallied in the last class. The distribution for minutes is open-
ended for the first class, meaning that any minute values below 110 will be tallied in that
class.
Example 2–2 shows the procedure for constructing a grouped frequency distribution, i.e.,
when the classes contain more than one data value.
112 100 127 120 134 118 105 110 109 112
110 118 117 116 118 122 114 114 105 109
107 112 114 115 118 117 118 122 106 110
116 108 110 121 113 120 119 111 104 111
120 113 120 117 105 110 118 112 114 114
21
Page: 43 (Bluman)
Required:
i. Determine the classes.
ii. Tally the data.
iii. Find the numerical frequencies from the tallies, and find the cumulative frequencies.
22