0% found this document useful (0 votes)
26 views

RM Study Material - Unit 2

The document discusses different types of data and methods for collecting data. It describes four scales of measurement for data: nominal, ordinal, interval, and ratio. It also distinguishes between primary and secondary sources of data, with primary data collected directly by researchers and secondary data collected previously by others. The document outlines quantitative and qualitative methods for collecting primary data, including observation, interviews, questionnaires, and schedules. It also discusses secondary data collection methods and provides examples of time-series and cross-sectional data.

Uploaded by

Nidhip Shah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views

RM Study Material - Unit 2

The document discusses different types of data and methods for collecting data. It describes four scales of measurement for data: nominal, ordinal, interval, and ratio. It also distinguishes between primary and secondary sources of data, with primary data collected directly by researchers and secondary data collected previously by others. The document outlines quantitative and qualitative methods for collecting primary data, including observation, interviews, questionnaires, and schedules. It also discusses secondary data collection methods and provides examples of time-series and cross-sectional data.

Uploaded by

Nidhip Shah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 42

M. COM. SEM.

– 2
Research Methodology
Unit – 2
Data Collection and Description
Complied by – Dr. Ankit Bhojak
Data types:

Data can be classified into four scales: nominal, ordinal, interval or ratio.

Each level of measurement has some important properties that are useful to know.

1. Nominal Scale: Nominal variables (also called categorical variables) can be placed into
categories. They don’t have a numeric value and so cannot be added, subtracted, divided
or multiplied. They also have no order. Gender, religion, birth place etc. are the examples
of nominal Scale.
2. Ordinal Scale:. The ordinal scale contains things that you can place in order. For
example, hottest to coldest, lightest to heaviest, richest to poorest. Basically, if you can
rank data by 1st, 2nd, 3rd place (and so on), then you have data that’s on an ordinal scale.
3. Interval Scale: An interval scale has ordered numbers with meaningful divisions.
Temperature is on the interval scale: a difference of 10 degrees between 90 and 100
means the same as 10 degrees between 150 and 160. Compare that to high school ranking
(which is ordinal), where the difference between 1st and 2nd might be .01 and between
10th and 11th .5. If you have meaningful divisions, you have something on the interval
scale.
4. Ratio Scale: The ratio scale is exactly the same as the interval scale with one major
difference: zero is meaningful. For example, a height of zero is meaningful (it means
you don’t exist). Compare that to a temperature of zero, which while it exists, it doesn’t
mean anything in particular (although admittedly, in the Celsius scale it’s the freezing
point for water).

1
Sources of Data:

There are two sources of collecting data: primary source and secondary source.

Primary data is data that is collected by a researcher from first-hand sources, using methods like

Experiments performed by the researcher, Letters, Surveys and censuses, Interviews.

A primary source is collected directly from the original source. It is not clouded with someone
else’s views or judgments.

The term is used in contrast with the term secondary data. Secondary data is data gathered from
studies, surveys, or experiments that have been run by other people or for other research.
Examples of secondary data include: Encyclopedias, Essays, Newspaper opinion pieces,
Reviews, Textbooks.

Difference between Primary Data and Secondary Data.

Primary Data Secondary Data


Secondary data refers to those
Primary data are those which
data which have already been
Definition are collected for the first
collected by some other
time.
person.
Primary data is original Secondary data are not
because these are collected by original because someone
Originality
the Investigator for the first else has collected these for
time. his own purpose.
Primary data are in the form Secondary data are in the
Nature of data
of raw materials. finished form.
Primary data are more It is less reliable and less
Reliability and Suitability
reliable and suitable for the suitable as someone else has

2
enquiry because it is collected collected the data which may
for a particular purpose. not perfectly match our
purpose.
Collecting primary data is Secondary data requires less
Time and Money quite expensive both in time time and money so it is
and money terms. economical.
No particular precaution or Both precaution and editing
editing is required while are essential as secondary
Precaution and Editing using primary data as these data were collected by
have been collected with a someone else for his own
definite purpose. purpose.

Methods for collecting primary data:

Primary data or raw data is a type of information that is obtained directly from the first-hand
source through experiments, surveys or observations. The primary data collection method is
further classified into two types. They are

(1) Quantitative Data Collection Methods


(2) Qualitative Data Collection Methods
Quantitative Data Collection Methods:

It is based on mathematical calculations using various formats like close-ended questions,


correlation and regression methods, mean, median or mode measures. This method is cheaper
than qualitative data collection methods and it can be applied in a short duration of time.

Qualitative Data Collection Methods:

It does not involve any mathematical calculations. This method is closely associated with
elements that are not quantifiable. This qualitative data collection method includes interviews,

3
questionnaires, observations, case studies, etc. There are several methods to collect this type of
data. They are

(1) Observation Method


(2) Interview Method
(3) Questionnaire Method
(4) Schedules
Observation Method

Observation method is used when the study relates to behavioural science. This method is
planned systematically. It is subject to many controls and checks. The different types of
observations are Structured and unstructured observation, Controlled and uncontrolled
observation, Participant, non-participant and disguised observation

Interview Method

The method of collecting data in terms of oral or verbal responses. It is achieved in two ways,
such as

 Personal Interview – In this method, a person known as an interviewer is required to


ask questions face to face to the other person. The personal interview can be structured or
unstructured, direct investigation, focused conversation, etc.
 Telephonic Interview – In this method, an interviewer obtains information by contacting
people on the telephone to ask the questions or views orally.

Questionnaire Method

In this method, the set of questions are mailed to the respondent. They should read, reply and
subsequently return the questionnaire. The questions are printed in the definite order on the form.
A good survey should have the following features:

 Short and simple


 Should follow a logical sequence
 Provide adequate space for answers
 Avoid technical terms

4
 Should have good physical appearance such as colour, quality of the paper to attract
the attention of the respondent
Schedules

This method is similar to the questionnaire method with a slight difference. The enumerations
are specially appointed for the purpose of filling the schedules. It explains the aims and objects
of the investigation and may remove misunderstandings, if any have come up. Enumerators
should be trained to perform their job with hard work and patience.

Secondary Data Collection Methods

Secondary data is data collected by someone other than the actual user. It means that the
information is already available, and someone analyses it. The secondary data includes
magazines, newspapers, books, journals, etc. It may be either published data or unpublished data.

Published data are available in various resources including

 Government publications
 Public records
 Historical and statistical documents
 Business documents
 Technical and trade journals
Unpublished data includes

 Diaries
 Letters
 Unpublished biographies, etc.

Time-Series Data:

Time-series data is a set of observations collected at usually discrete and equally spaced time
intervals. The daily closing price of a certain stock recorded over the last six weeks is an
5
example of time-series data. Note that a too-long or too-short time period may lead to time-
period bias.

Other examples of time-series data would be staff numbers at a particular institution taken on a
monthly basis in order to assess staff turnover rates and the number of students registered for a
particular course on a yearly basis. All of the above would be used to forecast likely data
patterns in the future.

Cross-Sectional Data:

Cross-sectional data are observations that come from different individuals or groups at a single
point in time. If one considered the closing prices of a group of 20 different tech stocks on
December 15, 1986, this would be an example of cross-sectional data. Note that the underlying
population should consist of members with similar characteristics. For example, suppose you are
interested in how much companies spend on research and development expenses. Firms in some
industries, such as retail, spend little on research and development (R&D), while firms in
industries such as technology spend heavily on R&D. Therefore, it's inappropriate to summarize
R&D data across all companies. Rather, analysts should summarize R&D data by industry and
then analyze the data in each industry group.

Other examples of cross-sectional data would be an inventory of all ice creams in stock at a
particular store and a list of grades obtained by a class of students on a specific test.

Panel Data:

Panel data, sometimes referred to as longitudinal data, is data that contains observations about
different cross sections across time. Examples of groups that may make up panel data series
include countries, firms, individuals, or demographic groups.

Like time series data, panel data contains observations collected at a regular frequency,
chronologically. Like cross-sectional data, panel data contains observations across a collection of
individuals.

Missing Data:
6
Missing data, or missing values, occur when no data value is stored for the variable in
an observation. Missing data are a common occurrence and can have a significant effect on the
conclusions that can be drawn from the data.

Missing data can occur because of no information is provided for one or more items or for a
whole unit. For an example items about private subjects such as income. Attrition is a type of
missingness that can occur in longitudinal studies—for instance studying development where a
measurement is repeated after a certain period of time. Missingness occurs when participants
drop out before the test ends and one or more measurements are missing. Understanding the
reasons why data are missing is important for handling the remaining data correctly. If values
are missing completely at random, the data sample is likely still representative of the population.
But if the values are missing systematically, analysis may be biased. There are some methods
available to handle missing data:

1. List wise Deletion: Delete all data from any participant with missing values. If your
sample is large enough, then you likely can drop data without substantial loss of
statistical power. Be sure that the values are missing at random and that you are not
inadvertently removing a class of participants.

2. Recover the Values: You can sometimes contact the participants and ask them to fill
out the missing values. For in-person studies, we’ve found having an additional check
for missing values before the participant leaves helps.

3. Replace Missing Value: Replace missing values with estimated values includes variety
of techniques to apply. This method usually assume that the data missing at random and
hence can be estimated.

Outliers:

An outlier is an observation that lies an abnormal distance from other values in a random
sample from a population. In a sense, this definition leaves it up to the analyst to decide what
will be considered abnormal. Before abnormal observations can be singled out, it is necessary
to characterize normal observations.

Two activities are essential for characterizing a set of data:

7
1. Examination of the overall shape of the graphed data for important features, including
symmetry and departures from assumptions.

2. Examination of the data for unusual observations that are far removed from the mass of
data. These points are often referred to as outliers.

Two graphical techniques for identifying outliers, scatter plots and box plots, along with an
analytic procedure for detecting outliers when the distribution is normal (Grubbs' Test).

Box plot construction: The box plot is a useful graphical display for describing the behavior
of the data in the middle as well as at the ends of the distributions. The box plot uses the
median and the lower and upper quartiles (defined as the 25th and 75th percentiles). If the
lower quartile is Q1 and the upper quartile is Q3, then the difference (Q3 - Q1) is called the
interquartile range or IQ. A box plot is constructed by drawing a box between the upper and
lower quartiles with a solid line drawn across the box to locate the median. The following
quantities (called fences) are needed for identifying extreme values in the tails of the
distribution:

lower inner fence: Q1 - 1.5*IQ

upper inner fence: Q3 + 1.5*IQ

lower outer fence: Q1 - 3*IQ

upper outer fence: Q3 + 3*IQ

A point beyond an inner fence on either side is considered a mild outlier. A point beyond an
outer fence is considered an extreme outlier.

Outliers should be investigated carefully. Often they contain valuable information about the
process under investigation or the data gathering and recording process. Before considering
the possible elimination of these points from the data, one should try to understand why they
appeared and whether it is likely similar values will continue to appear. Of course, outliers are
often bad data points.

8
Sampling Methods

Population and Sample: A population is the entire group that you want to draw conclusions
about. A sample is the specific group that you will collect data from. The size of the sample is
always less than the total size of the population.
In research, a population doesn’t always refer to people. It can mean a group containing elements
of anything you want to study, such as objects, events, organizations, countries, species,
organisms, etc.
Population and Sample Examples:
Population Sample
Advertisements for IT jobs in India. The top 50 search results for advertisements
for IT jobs in India
Undergraduate students in India Undergraduate students from three top
universities in India.

Populations are used when your research question requires, or when you have access to, data
from every member of the population. Usually, it is only straightforward to collect data from
a whole population when it is small, accessible and cooperative.

Example: A high school administrator wants to analyze the final exam scores of all
graduating seniors to see if there is a trend. Since they are only interested in applying their
findings to the graduating seniors in this high school, they use the whole population dataset.

For larger and more dispersed populations, it is often difficult or impossible to collect data from
every individual. For example, every 10 years, the federal US government aims to count every
person living in the country using the US Census. This data is used to distribute funding across
the nation.

9
However, historically, marginalized and low-income groups have been difficult to contact, locate
and encourage participation from. Because of non-responses, the population count is incomplete
and biased towards some groups, which results in disproportionate funding across the country.

In cases like this, sampling can be used to make more precise inferences about the population.

Collecting data from a sample

When your population is large in size, geographically dispersed, or difficult to contact, it’s
necessary to use a sample. With statistical analysis, you can use sample data to make estimates
or test hypotheses about population data.

Example: You want to study political attitudes in young people. Your population is the 300,000
undergraduate students in the Netherlands. Because it’s not practical to collect data from all of
them, you use a sample of 300 undergraduate volunteers from three Dutch universities – this is
the group who will complete your online survey.

Ideally, a sample should be randomly selected and representative of the population.


Using probability sampling methods reduces the risk of sampling bias and enhances
both internal and external validity.

For practical reasons, researchers often use non-probability sampling methods. Non-probability
samples are chosen for specific criteria; they may be more convenient or cheaper to access.
Because of non-random selection methods, any statistical inferences about the broader
population will be weaker than with a probability sample.

Reasons for sampling

 Necessity: Sometimes it’s simply not possible to study the whole population due to
its size or inaccessibility.
 Practicality: It’s easier and more efficient to collect data from a sample.
 Cost-effectiveness: There are fewer participant, laboratory, equipment, and
researcher costs involved.

10
 Manageability: Storing and running statistical analyses on smaller datasets is easier
and reliable.

Characteristics of a Good Sample

1. Goal-oriented: A sample design should be goal oriented. It should be relate to the


research objectives and fitted to the survey conditions.
2. Accurate representative of the universe: A sample should be an accurate representative
of the universe from which it is taken. There are different methods for selecting a sample.
It will be truly representative only when it represents all types of units or groups in the
total population in fair proportions. In brief sample should be selected carefully as
improper sampling is a source of error in the survey.
3. Proportional: A sample should be proportional. It should be large enough to represent
the universe properly. The sample size should be sufficiently large to provide statistical
stability or reliability. The sample size should give accuracy required for the purpose
of particular study.
4. Random selection: A sample should be selected at random. This means that any item in
the group has a full and equal chance of being selected and included in the sample. This
makes the selected sample truly representative in character.
5. Economical: A sample should be economical. The objectives of the survey should be
achieved with minimum cost and effort.
6. Practical: A sample design should be practical. The sample design should be simple
i.e. it should be capable of being understood and followed in the fieldwork.
7. Actual information provider: A sample should be designed so as to provide
actual information required for the study and also provide an adequate basis for the
measurement of its own reliability.
Sample Size:
A sample is a percentage of the total population in statistics. You can use the data from a sample
to make inferences about a population as a whole. For example, the standard deviation of a
sample can be used to approximate the standard deviation of a population. Finding a sample size
can be one of the most challenging tasks in statistics and depends upon many factors including
the size of your original population.
11
General Tips

Conduct a census if you have a small population. A “small” population will depend on your
budget and time constraints. For example, it may take a day to take a census of a student body
at a small private university of 1,000 students but you may not have the time to survey 10,000
students at a large state university.

Use a sample size from a similar study. Chances are, your type of study has already been
undertaken by someone else. You’ll need access to academic databases to search for a study
(usually your school or college will have access). A pitfall: you’ll be relying on someone else
correctly calculating the sample size. Any errors they have made in their calculations will
transfer over to your study.

Use a table to find your sample size. If you have a fairly generic study, then there is probably
a table for it.

Use a sample size calculator. Various calculators are available online, some simple, some more
complex and specialized. For example, this calculator is for group- or cluster-randomized trials
(GRTs).

Use a formula. There are many different formulas you can use, depending on what you know (or
don’t know) about your population.

Sampling definition: Sampling is a technique of selecting individual members or a subset of


the population to make statistical inferences from them and estimate characteristics of the whole
population. Different sampling methods are widely used by researchers so that they do not need
to research the entire population to collect actionable insights. It is also a time-convenient and a
cost-effective method and hence forms the basis of any research design.
For example, if a drug manufacturer would like to research the adverse side effects of a drug on
the country’s population, it is almost impossible to conduct a research study that involves
everyone. In this case, the researcher decides a sample of people from each demographic and
then researches them, giving him/her indicative feedback on the drug’s behavior.

12
Types of sampling: sampling methods

Sampling is of two types – probability sampling and non-probability sampling. Let’s take a
closer look at these two methods of sampling.
1. Probability sampling: Probability sampling is a sampling technique where a researcher sets a
selection of a few criteria and chooses members of a population randomly. All the members
have an equal opportunity to be a part of the sample with this selection parameter.
2. Non-probability sampling: In non-probability sampling, the researcher chooses members
for research at random. This sampling method is not a fixed or predefined selection process.
This makes it difficult for all elements of a population to have equal opportunities to be
included in a sample.
Probability sampling

Probability sampling is a sampling technique in which researchers choose samples from a larger
population using a method based on the theory of probability. This sampling method considers
every member of the population and forms samples based on a fixed process.
For example, in a population of 1000 members, every member will have a 1/1000 chance of
being selected to be a part of a sample. Probability sampling eliminates bias in the population
and gives all members a fair chance to be included in the sample.
There are four types of probability sampling techniques:
 Simple random sampling: One of the best probability sampling techniques that helps in
saving time and resources, is the Simple Random Sampling method. It is a reliable
method of obtaining information where every single member of a population is chosen
randomly, merely by chance. Each individual has the same probability of being chosen to
be a part of a sample.
For example, in an organization of 500 employees, if the HR team decides on conducting
team building activities, it is highly likely that they would prefer picking chits out of a
bowl. In this case, each of the 500 employees has an equal opportunity of being selected.
 Cluster sampling: Cluster sampling is a method where the researchers divide the entire
population into sections or clusters that represent a population. Clusters are identified
and included in a sample based on demographic parameters like age, sex, location, etc.
This makes it very simple for a survey creator to derive effective inference from the
feedback.
13
For example, if the United States government wishes to evaluate the number of immigrants
living in the Mainland US, they can divide it into clusters based on states such as
California, Texas, Florida, Massachusetts, Colorado, Hawaii, etc. This way of conducting a
survey will be more effective as the results will be organized into states and provide
insightful immigration data.
 Systematic sampling: Researchers use the systematic sampling method to choose the
sample members of a population at regular intervals. It requires the selection of a starting
point for the sample and sample size that can be repeated at regular intervals. This type
of sampling method has a predefined range, and hence this sampling technique is the
least time-consuming.
For example, a researcher intends to collect a systematic sample of 500 people in a
population of 5000. He/she numbers each element of the population from 1-5000 and will
choose every 10th individual to be a part of the sample (Total population/ Sample Size =
5000/500 = 10).
 Stratified random sampling: Stratified random sampling is a method in which the
researcher divides the population into smaller groups that don’t overlap but represent the
entire population. While sampling, these groups can be organized and then draw a
sample from each group separately.
For example, a researcher looking to analyze the characteristics of people belonging to
different annual income divisions will create strata (groups) according to the annual family
income. Eg – less than $20,000, $21,000 – $30,000, $31,000 to $40,000, $41,000 to
$50,000, etc. By doing this, the researcher concludes the characteristics of people
belonging to different income groups. Marketers can analyze which income groups to
target and which ones to eliminate to create a roadmap that would bear fruitful results.
Uses of probability sampling

There are multiple uses of probability sampling:


 Reduce Sample Bias: Using the probability sampling method, the bias in the sample
derived from a population is negligible to non-existent. The selection of the sample mainly
depicts the understanding and the inference of the researcher. Probability sampling leads
to higher quality data collection as the sample appropriately represents the population.

14
 Diverse Population: When the population is vast and diverse, it is essential to have
adequate representation so that the data is not skewed towards one demographic. For
example, if Square would like to understand the people that could make their point-of-sale
devices, a survey conducted from a sample of people across the US from different
industries and socio-economic backgrounds helps.
 Create an Accurate Sample: Probability sampling helps the researchers plan and
create an accurate sample. This helps to obtain well-defined data.
Types of non-probability sampling with examples
The non-probability method is a sampling method that involves a collection of feedback based
on a researcher or statistician’s sample selection capabilities and not on a fixed selection process.
In most situations, the output of a survey conducted with a non-probable sample leads to skewed
results, which may not represent the desired target population. But, there are situations such as
the preliminary stages of research or cost constraints for conducting research, where non-
probability sampling will be much more useful than the other type.
Four types of non-probability sampling explain the purpose of this sampling method in a better
manner:
 Convenience sampling: This method is dependent on the ease of access to subjects such as
surveying customers at a mall or passers-by on a busy street. It is usually termed
as convenience sampling, because of the researcher’s ease of carrying it out and getting in
touch with the subjects. Researchers have nearly no authority to select the sample elements,
and it’s purely done based on proximity and not representativeness. This non-probability
sampling method is used when there are time and cost limitations in collecting feedback. In
situations where there are resource limitations such as the initial stages of research,
convenience sampling is used.
For example, startups and NGOs usually conduct convenience sampling at a mall to
distribute leaflets of upcoming events or promotion of a cause – they do that by standing at
the mall entrance and giving out pamphlets randomly.
 Judgmental or purposive sampling: Judgmental or purposive samples are formed by
the discretion of the researcher. Researchers purely consider the purpose of the study,
along with the understanding of the target audience. For instance, when researchers want
to understand the thought process of people interested in studying for their master’s

15
degree.

16
The selection criteria will be: “Are you interested in doing your masters in …?” and those
who respond with a “No” are excluded from the sample.
 Snowball sampling: Snowball sampling is a sampling method that researchers apply when
the subjects are difficult to trace. For example, it will be extremely challenging to survey
shelter less people or illegal immigrants. In such cases, using the snowball theory,
researchers can track a few categories to interview and derive results. Researchers also
implement this sampling method in situations where the topic is highly sensitive and not
openly discussed—for example, surveys to gather information about HIV Aids. Not many
victims will readily respond to the questions. Still, researchers can contact people they
might know or volunteers associated with the cause to get in touch with the victims and
collect information.
 Quota sampling: In Quota sampling, the selection of members in this sampling technique
happens based on a pre-set standard. In this case, as a sample is formed based on specific
attributes, the created sample will have the same qualities found in the total population. It
is a rapid method of collecting samples.
Uses of non-probability sampling

Non-probability sampling is used for the following:


 Create a hypothesis: Researchers use the non-probability sampling method to create an
assumption when limited to no prior information is available. This method helps with the
immediate return of data and builds a base for further research.
 Exploratory research: Researchers use this sampling technique widely when conducting
qualitative research, pilot studies, or exploratory research.
 Budget and time constraints: The non-probability method when there are budget and time
constraints, and some preliminary data must be collected. Since the survey design is not
rigid, it is easier to pick respondents at random and have them take the survey
or questionnaire.

How do you decide on the type of sampling to use?

17
For any research, it is essential to choose a sampling method accurately to meet the goals
of your study. The effectiveness of your sampling relies on various factors. Here are some
steps expert researchers follow to decide the best sampling method.
 Jot down the research goals. Generally, it must be a combination of cost, precision,
or accuracy.
 Identify the effective sampling techniques that might potentially achieve the research goals.
 Test each of these methods and examine whether they help in achieving your goal.
 Select the method that works best for the research.
Difference between probability sampling and non-probability sampling methods

We have looked at the different types of sampling methods above and their subtypes. To
encapsulate the whole discussion, though, the significant differences between probability
sampling methods and non-probability sampling methods are as below:

Probability Sampling Methods Non-Probability Sampling


Methods
Non-probability sampling is a
Probability Sampling is a sampling
sampling technique in which the
technique in which samples from a
researcher selects samples based
Definition larger population are chosen using
on the researcher’s subjective
a method based on the theory of
judgment rather than random
probability.
selection.
Alternatively Known as Random sampling method. Non-random sampling method
The population is selected The population is selected
Population selection
randomly. arbitrarily.
Nature The research is conclusive. The research is exploratory.
Since there is a method for Since the sampling method is
deciding the sample, the population arbitrary, the population
Sample
demographics are conclusively demographics representation is
represented. almost always skewed.

18
Takes longer to conduct since the This type of sampling method is
research design defines the quick since neither the sample or
Time Taken
selection parameters before the selection criteria of the sample are
market undefined.
research study begins.
This type of sampling is entirely
This type of sampling is entirely
biased and hence the results are
Results unbiased and hence the results are
biased too, rendering the research
unbiased too and conclusive.
speculative.
In probability sampling, there is an
underlying hypothesis before the In non-probability sampling, the
Hypothesis study begins and the objective of hypothesis is derived after
this method is to prove the conducting the research study.
hypothesis.

19
CLASSIFICATION OF DATA

Introduction:

The data that are unorganized or have not been arranged in any way are called raw data. The
ungrouped data are often voluminous, complex to handle and hardly useful to draw any vital
decisions. Hence, it is essential to rearrange the elements of the raw data set in a specific pattern.
Further, it is important that such data must be presented in a condensed form and must be
classified according to homogeneity for the purpose of analysis and interpretation. An
arrangement of raw data in an order of magnitude or in a sequence is called array. Specifically,
an arrangement of observations in an ascending or a descending order of magnitude is said to be
an ordered array.

Classification is the process of arranging the primary data in a definite pattern and presenting in a
systematic form. Horace Secrist defined classification as the process of arranging the data into
sequences and groups according to their common characteristics or separating them into different
but related parts. It is treated as the process of classifying the elements of observations or things
into different groups or classes or sequences according to the resemblances and similarities of
their character. It is also defined as the process of dividing the data into different groups or
classes which are as homogeneous as possible within the groups or classes, but heterogeneous
between themselves.
Objectives of Classification

Classification of data has manifold objectives. The salient features among them are the following:

1. It explains the features of the data.


2. It facilitates comparison with similar data.
3. It strikes a note of homogeneity in the heterogeneous elements of the collected
information.
4. It explains the similarities which may exist in the diversity of data points.
5. It is required to condense the mass data in such a manner that the similarities and
dissimilarities are understood.
6. It reduces the complexity of nature of data and renders the data to comprehend easily.
7. It enables proper utilization of data for further statistical treatment.
20
Types of Classification:

The raw data can be classified in various ways depending on the nature of data. The general
types of classification are:
(i) Classification by Time or Chronological Classification
(ii) Classification by Space or Spatial Classification
(iii) Classification by Attribute or Qualitative Classification
(iv) Classification by Size or Quantitative Classification.

Each of these types is now described.

(i) Classification by Time or Chronological Classification


The method of classifying data according to time component is known as classification by time
or chronological classification. In this type of classification, the groups or classes are arranged
either in the ascending order or in the descending order with reference to time such as years,
quarters, months, weeks, days, etc. Illustrations for statistical data to be classified under this type
are listed below:
 Number of new schools established in Gujarat during 2005 – 2020
 Pass percentage of students in SSLC Board Examinations over a period of past 5 years
 Index of market prices in stock exchanges arranged day-wise
 Month-wise salary particulars of employees in an industry
 Particulars of outpatients in a Primary Health Centre presented day-wise.
(ii) Classification by Space (Spatial) or Geographical Classification
The method of classifying data with reference to geographical location such as countries, states,
cities, districts, etc., is called classification by space or spatial classification. It is also termed as
geographical classification. The following are some examples:
 Number of school students in rural and urban areas in a State
 Region-wise literacy rate in a state
 State-wise crop production in India
 Country-wise growth rate in South East Asia

21
(iii) Classification by Attributes or Qualitative classification

The method of classifying statistical data on the basis of attribute is said to be classification by
attributes or qualitative classification. Examples of attributes include nationality, religion,
gender, marital status, literacy and so on.

Classification according to attributes is of two kinds: simple classification and manifold


classification.
In simple classification the raw data are classified by a single attribute. All those units in which a
particular characteristic is present are placed in one group and others are placed in another group.
The classification of individuals according to literacy, gender, economic status would come
under simple classification.
In manifold classification, two or more attributes are considered simultaneously. When more
attributes are involved, the data would be classified into several classes and subclasses depending
on the number of attributes. For example, population in a country can be classified in terms of
gender as male and female. These two sub-classes may be further classified in terms of literacy
as literate and illiterate.
While classifying the data according to attributes, it is essential to ensure that the attributes
involved have to be defined without ambiguity. For example, while classifying income groups,
the investigator has to define carefully the different non-overlapping income groups.

(iv) Classification by Size or Quantitative Classification

When the characteristics are measured on numerical scale, they may be classified on the basis of
their magnitude. Such a classification is known as classification by size or quantitative
classification. For example data relating to the characteristics such as height, weight, age,
income, marks of students, production and consumption, etc., which are quantitative in nature,
come under this category.
Rules for Classification of Data
There are certain rules to be followed for classifying the data which are given below.
 The classes must be exhaustive, i.e., it should be possible to include each of the data
points in one or the other group or class.
 The classes must be mutually exclusive, i.e., there should not be any overlapping.
22
 It must be ensured that number of classes should be neither too large or nor too small.
Generally, the number of classes may be fixed between 4 and 15.
 The magnitude or width of all the classes should be equal in the entire classification.
 The system of open end classes may be avoided.
Tabulation

A logical step after classifying the statistical data is to present them in the form of tables. A table
is a systematic organization of statistical data in rows and columns. The main objective of
tabulation is to answer various queries concerning the investigation. Tables are very helpful
while carrying out the analysis of collected data and subsequently for drawing inferences from
them. It is considered as the final stage in the compilation of data and forms the basis for its
further statistical treatment.
Advantages of Tabulation
 It is a logical step of presenting statistical data after classification.
 It enables the reader to understand the required information with ease as the
information is contained in rows and columns with figures.
 It enables the investigator to present the data in a brief or condensed and compact form.
 Comparison is made simple by displaying data to be compared in a single table.
 It is easy to remember the data points or items if they are properly placed in the form
of table, as it provides a kind of visual aid.
 It facilitates easy computation and helps easy detection of errors and omissions.
 It enables the reader to refer the data to be presented in a manner that suits for
further statistical treatment and for making valid conclusions.
Types of Tables

Statistical tables can be classified under two general categories, namely, general tables and
summary tables.

General tables contain a collection of detailed information including all that is relevant to the
subject or theme. The main purpose of such tables is to present all the information available on a
certain problem at one place for easy reference and they are usually placed in the appendices of
reports.

23
Summary tables are designed to serve some specific purposes. They are smaller in size than
general tables, emphasize on some aspect of data and are generally incorporated within the
text. The summary tables are also called derivative tables because they are derived from the
general tables. The information contained in the summary table aims at analysis and inference.
Hence, they are also known as interpretative tables.

The statistical tables may further be classified into two broad classes namely simple tables and
complex tables. A simple table summarizes information on a single characteristic and is also
called a univariate table.

The marks secured by a batch of students in a class test are displayed in the following table.

This table is based on a single characteristic namely marks and from this table one may observe
the number of students in each class of marks. The questions such as the number of students
scored in the range 50 – 60, the maximum number of students in a specific range of marks and so
on can be determined from this table.
A complex table summarizes the complicated information and presents them into two or more
interrelated categories. For example, if there are two coordinate factors, the table is called a two-
way table or bi-variate table; if the number of coordinate groups is three, it is a case of three-way
tabulation, and if it is based on more than three coordinate groups, the table is known as higher
order tabulation or a manifold tabulation.
Table shown below is an illustration for a two-way table, in which there are two characteristics,
namely, marks secured by the students in the test and the gender of the students. The table
provides information relating to two interrelated characteristics, such as marks and gender of
students. It is observed from the table that 26 students have scored marks in the range 40 – 50
and among them students, 16 are males and 10 are females.

24
The below table is an example for a three – way table with three factors, namely, marks, gender
and location.

From this table, one may get information relating to the distribution of students according marks,
gender and geographical location from where they hail.

Components of a Table
Generally a table should be comprised of the following components:
1. Table number and title
2. Stub (the headings of rows)
3. Caption (the headings of columns)
4. Body of the table
5. Foot notes

25
6. Sources of data.
1. Table Number and Title: Each table should be identified by a number given at the top. It
should also have an appropriate short and self-explanatory title indicating what exactly the table
presents.
2. Stub: Stubs stand for brief and self-explanatory headings of rows.
3. Caption: Caption stands for brief and self explanatory headings of columns. It may involve
headings and sub-headings as well.
4. Body of the Table: The body of the table should provide the numerical information in
different cells.
5. Foot Note: The explanatory notes should be given as foot notes and must be complete in order
to understand them at a later stage.
6. Source of Data: It is always customary to provide source of data to enable the user to refer the
original data. The source of data may be provided in a foot note at the bottom of the table.
A typical format of a table is given below:

General Precautions for Tabulation

The following points may be considered while constructing statistical tables:


 A table must be as precise as possible and easy to understand.
 It must be free from ambiguity so that main characteristics from the data can be easily
brought out.
 Presenting a mass of data in a single table should be avoided. Displaying the data in a single
table would increase the chances for occurrence of mistakes and would make the table
26
unwieldy. Such data may be presented in more than one table such that each table should be
complete and should serve the purpose.
 Figures presented in columns for comparison must be placed as near to each other as
possible. Percentages, totals and averages must be kept close to each other. Totals to be
compared may be given in bold type wherever necessary.
 Each table should have an appropriate short and self- explanatory title indicating what
exactly the table presents.
 The main headings and subheadings must be properly placed.
 The source of the data must be indicated in the footnote.
 The explanatory notes should always be given as footnotes and must be complete in order to
understand them at a later stage.
 The column or row heads should indicate the units of measurements such as monetary units
like Rupees, and other units such as meters, etc. wherever necessary.
 Column heading may be numbered for comparison purposes. Items may be arranged either in
the order of their magnitude or in alphabetical, geographical, and chronological or in any
other suitable arrangement for meaningful presentation.
 Figures as accurate as possible are to be entered in a table. If the figures are approximate, the
same may be properly indicated.
Frequency Distribution
A tabular arrangement of raw data by a certain number of classes and the number of items
(called frequency) belonging to each class is termed as a frequency distribution. The frequency
distributions are of two types, namely, discrete frequency distribution and continuous frequency
distribution.
Discrete Frequency Distribution

Raw data sometimes may contain a limited number of values and each of them appeared many
numbers of times. Such data may be organized in a tabular form termed as a simple frequency
distribution. Thus the tabular arrangement of the data values along with the frequencies is a
simple frequency distribution. A simple frequency distribution is formed using a tool called ‘tally
chart’. A tally chart is constructed using the following method:
 Examine each data value.

27
 Record the occurrence of the value with the slash symbol (/), called tally bar or tally mark.
 If the tally marks are more than four, put a crossbar on the four tally bar and make this as
block of 5 tally bars (////)
 Find the frequency of the data value as the total number of tally bars i.e., tally marks
corresponding to that value.
Example

The marks obtained by 25 students in a test are given as follows: 10, 20, 20, 30, 40, 25, 25, 30, 40,
20, 25, 25, 50, 15, 25, 30, 40, 50, 40, 50, 30, 25, 25, 15 and 40. The following discrete frequency
distribution represents the given data:

Continuous Frequency Distribution:

It is necessary to summarize and present large masses of data so that important facts from the
data could be extracted for effective decisions. A large mass of data that is summarized in such a
way that the data values are distributed into groups, or classes, or categories along with the
frequencies is known as a continuous or grouped frequency distribution.
Example:

Table displays the number of orders for supply of machineries received by an industrial plant
each week over a period of one year.

28
This table is a grouped frequency distribution in which the number of orders are given as classes
and number of weeks as frequencies.
Some terminologies related to a frequency distribution are given below.
Class: If the observations of a data set are divided into groups and the groups are bounded by
limits, then each group is called a class.
Class limits: The end values of a class are called class limits. The smaller value of the class
limits is called lower limit (L) and the larger value is called the upper limit(U).
Class interval: The difference between the upper limit and the lower limit is called class interval
(I). That is, I = U – L.
Class boundaries: Class boundaries are the midpoints between the upper limit of a class and the
lower limit of its succeeding class in the sequence. Therefore, each class has an upper and lower
boundaries.
Width : Width of a particular class is the difference between the upper class boundary and lower
class boundary.
Mid- point: Half of the difference between the upper class boundary and lower class boundary.
In the above example, the interval 0 - 4 is a class interval with 0 as the lower limit and 4 as the
upper limit. The upper boundary of this class is obtained as midpoint of the upper limit of this
class and lower limit of its succeeding class. Thus the upper boundary of the class 0 - 4 is 4.5.
The lower class boundary of this is 0 - 0.5 which is - 0.5. The lower boundary of the class 5 - 9 is
clearly 4.5. Similarly, the other boundaries of different classes can be found. The width of the
classes is 5
29
Formation of frequency distribution is usually done by two different methods, namely inclusive
method and exclusive method.
Inclusive method

In this method, both the lower and upper class limits are included in the classes. Inclusive type of
classification may be used for a grouped frequency distribution for discrete variable like
members in a family, number of workers etc., It cannot be used in the case of continuous variable
like height, weight etc., where integral as well as fractional values are permissible. Since both
upper limit and lower limit of classes are included for frequency calculation, this method is
called inclusive method.
Exclusive method

In this method, the values which are equal to upper limit of a class are not included in that class
and instead they would be included in the next class. The upper limit is not at all taken into
consideration or in other words it is always excluded from the consideration. Hence this method
is called exclusive method.
Example

The marks scored by 50 students in an examination are given as follows:


23, 25, 36, 39, 37, 41, 42, 22, 26, 35, 34, 30, 29, 27, 47, 40, 31, 32, 43, 45, 34, 46, 23, 24, 27, 36,
41, 43, 39, 38, 28, 32, 42, 33, 46, 23, 34, 41, 40, 30, 45, 42, 39, 37, 38, 42, 44, 46, 29, 37.

It can be observed from this data set that the marks of 50 students vary from 22 to 47. If it is
decided to divide this group into 6 smaller groups, we can have the boundary lines fixed as 25,
30, 35, 40, 45 and 50 marks. Then, we form the six groups with the boundaries as 21 - 25, 26 -
30, 31
- 35, 36 - 40, 41 – 45 and 46 - 50.
The continuous frequency distribution formed by inclusive and exclusive methods are displayed
in tables, respectively.

30
True class intervals

In the case of continuous variables, we take the classes in such a way that there is no gap
between successive classes. The classes are defined in such a way that the upper limit of each
class is equal to lower limit of the succeeding class. Such classes are known as true classes. The
inclusive method of forming class intervals are also known as not-true classes. We can convert
the not-true classes into true-classes by subtracting 0.5 from the lower limit of the class and
adding 0.5 to the upper limit of each class like 19.5 - 25.5, 25.5 - 30.5, 30.5 – 35.5, 35.5 – 40.5,
40.5 - 45.5, and 45.5 –

31
50.5.

32
Open End Classes

When a class limit is missing either at the lower end of the first class interval or at the upper end
of the last classes or when the limits are not specified at both the ends, the frequency distribution
is said to be the frequency distribution with open end classes.
Example

Salary received by 113 workers in a factory are classified into 6 classes. The classes and their
frequencies are displayed in Table below. Since the lower limit of the first class and the upper
limit of the last class are not specified, they are open end classes.

Guidelines on Compilation of Continuous Frequency Distribution

The following guidelines may be followed for compiling the continuous frequency distribution.
 The values given in the data set must be contained within one (and only one) class and
overlapping classes must not occur.
 The classes must be arranged in the order of their magnitude.
 Normally a frequency distribution may have 8 to 10 classes. It is not desirable to have
less than 5 and more than 15 classes.
 Frequency distributions having equal class widths throughout are preferable. When this is
not possible, classes with smaller or larger widths can be used. Open ended classes are
acceptable but only in the first and the last classes of the distribution.
 It should be noted that in a frequency distribution, the first class should contain the
lowest value and the last class should contain the highest value.
33
 The number of classes may be determined by using the Sturges formula k = 1 +
3.322log10N, where N is the total frequency and k is the number of classes.
Cumulative Frequency Distribution
Cumulative frequency corresponding to a class interval is defined as the total frequency of all
values less than upper boundary of the class. A tabular arrangement of all cumulative frequencies
together with the corresponding classes is called a cumulative frequency distribution or
cumulative frequency table.
The main difference between a frequency distribution and a cumulative frequency distribution is
that in the former case a particular class interval according to how many items lie within it is
described, whereas in the latter case the number of items that have values either above or below a
particular level is described.

There are two forms of cumulative frequency distributions, which are defined as follows:
(i) Less than Cumulative Frequency Distribution: In this type of cumulative frequency
distribution, the cumulative frequency for each class shows the number of elements in the data
whose magnitudes are less than the upper limit of the respective class.
(ii) More than Cumulative Frequency Distribution: In this type of cumulative frequency
distribution, the cumulative frequency for each class shows the number of elements in the data
whose magnitudes are larger than the lower limit of the class.
Example

Construct less than and more than cumulative frequency distribution tables for the following
frequency distribution of orders received by a business firm over a number of weeks during a
year.

34
Solution:

For the data related to the number of orders received per week by a business firm during a period
of one year, the less than and more than cumulative frequencies are computed and are given in
table given below.

Relative-Cumulative Frequency Distributions

The relative cumulative frequency is defined as the ratio of the cumulative frequency to the total
frequency. The relative cumulative frequency is usually expressed in terms of a percentage. The
arrangement of relative cumulative frequencies against the respective class boundaries is termed
as relative cumulative frequency distribution or percentage cumulative frequency distribution.
Example

For the data given in Example, find the relative cumulative frequencies.
Solution:
For the data given in Example the less-than and more-than cumulative frequencies are obtained
and given in the above table. The relative cumulative frequency is computed for each class by
dividing the respective class cumulative frequency by the total frequency and is expressed as a
percentage. The cumulative frequencies and related cumulative frequencies are tabulated in table
below.

35
Bivariate Frequency Distributions
It is known that the frequency distribution of a single variable is called univariate distribution.
When a data set consists of a large mass of observations, they may be summarized by using a
two-way table. A two-way table is associated with two variables, say X and Y. For each
variable, a number of classes can be defined keeping in view the same considerations as in the
univariate case. When there are m classes for X and n classes for Y, there will be m × n cells in
the two- way table. The classes of one variable may be arranged horizontally, and the classes of
another variable may be arranged vertically in the two way table. By going through the pairs of
values of X and Y, we can find the frequency for each cell. The whole set of cell frequencies will
then define a bivariate frequency distribution. In other words, a bivariate frequency distribution
is the frequency distribution of two variables.
Following table shows the frequency distribution of two variables, namely, age and marks
obtained by 50 students in an intelligent test. Classes defined for marks are arranged horizontally
(rows) and the classes defined for age are arranged vertically (columns). Each cell shows the
frequency of the corresponding row and column values. For instance, there are 5 students whose
age fall in the class 20 – 22 years and their marks lie in the group 30 – 40.
36
37
Multiple Choice Questions

1. Which of the following is the non-probability sampling technique?


(a) Quota sampling
(b) Simple random sampling
(c) Stratified sampling
(d) Systematic sampling
2. Which of the following is the probability sampling technique?
(a) Quota sampling
(b) Judgmental sampling
(c) Convenience
(d) Systematic sampling
3. Roll number assigned to the student is an example of type of data.
(a) Ordinal
(b) Nominal
(c) Interval
(d) Ratio
4. The first, second and fourth person in a race is an example of type of data.
(a) Nominal
(b) Interval
(c) Ordinal
(d) Ratio
5. The temperature of the particular city is an example of type of data.
(a) Nominal
(b) Interval
(c) Ordinal
(d) Ratio
6. Age of a person is an example of type of data.
(a) Nominal
(b) Interval
(c) Ordinal
(d) Ratio
38
7. Which of the following type of data is used to categorized data and has no
order, distance or unique origin?
(a) Nominal
(b) Interval
(c) Ordinal
(d) Ratio
8. Which of the following type of data is used to rank the data values and has no distance
or unique origin?
(a) Nominal
(b) Interval
(c) Ordinal
(d) Ratio
9. Which of the following type of has both order and distance but no unique origin?
(a) Nominal
(b) Interval
(c) Ordinal
(d) Ratio
10. Which of the following type of data has all classification order, distance and unique
origin?
(a) Nominal
(b) Interval
(c) Ratio
(d) Ordinal
11. Non sampling errors occur because of
(a) Due to error in compilation
(b) Due to applying wrong statistical measure
(c) Due to lapse of memory of enumerator
(d) all of the above.
12. Which of the following is an example of discrete variable?
(a) Number of accidents
(b) Height of student

39
(c) Statistics of rainfall
(d) Weight of student.
13. Which of the following is an example of continuous variable?
(a) Number of accidents
(b) Height of student
(c) Number of room per house in a street
(d) Number of suicides
14. Tabulation is a classification of data type.
(a) qualitative
(b) variable
(c) discrete
(d) quantitative
15. Frequency distribution is a classification of data type.
(a) qualitative
(b) variable
(c) discrete
(d) quantitative
16. Anger and feeling are the examples of
(a) continuous variable
(b) discrete variable
(c) qualitative data
(d) quantitative data.
17. For applying Box – Plot method person requires:
(a) Median
(b) Lower and Upper quartiles
(c) Smallest and Largest value
(d) all of the above.
18. In Box – Plot method, the edges of the box called .
(a) hinges
(b) wings
(c) extremities

40
(d) outliers
19. Which among the following is the benefit of using simple random sampling?
(a) The results are always representative.
(b) Interviewers can choose respondents freely.
(c) Informants can refuse to participate.
(d) We can calculate the accuracy of the results.
20. Increasing the sample size has the following effect upon the sampling error.
(a) It increases the sampling error
(b) It reduces the sampling error

(c) It has no effect on the sampling error

(d) All of the above

21. The difference between a statistic and the parameter is called:


(a) Non random
(b) Probability
(c) Random
(d) Sampling error
22. In which classification upper limit of one class becomes lower limit of the next class?
(a) inclusive classification
(b) exclusive classification
(c) open ended classification
(d) none of the above
23. In which classification upper limit and lower limit are included in interval itself?
(a) open ended classification
(b) exclusive classification
(c) inclusive classification
(d) none of the above
24. In which classification lower limit of the first class and/or upper limit of the last class
is not defined?
(a) open ended classification
(b) exclusive classification

41
(c) inclusive classification
(d) none of the above
25. The average of the class limits of a class is called ?
(a) upper limit of the class
(b) lower limit of the class
(c) class length
(d) midpoint of the class

References:

1. Business Research Methods – Sudhir Prakashan


2. Research Methodology – C.R. Kothari
3. Fundamentals of Statistics – S. P. Gupta
4. Comprehensive Statistical Methods – Arora
5. Business Statistics – Sharma & Khatur
6. www.brainkart.com
7. www.statisticshowto.com
8. www.questionpro.com

42

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy