Stat I Chapter 1 and 2
Stat I Chapter 1 and 2
Stat I Chapter 1 and 2
Introduction to Statistics
Introduction
Statistics is the science that deals with the method of collection, organization, analysis of data and
interpretation of the results. The term statistics can also be defined in its plural sense. In the plural
sense statistics are collections of numerical facts, values that are obtained from sample results are
called statistics. The science of statistics is very essential for research and decision processes in all
aspects of human life.
Statistical analysis begins with data collection and the analysis of the data is then undertaken for
one of the following purpose:
Each time that we record information about an object we observe a case. We might include several
different variables in the same case. For example, we might measure the height, weight, and hair
color of a group of people in an experiment. We would have one case for each person, and that case
would contain that person's height, weight, and hair color values. All of our cases put together are
called our data set. To some people, statistics means summarized data, such as unemployment
figures or the number of runs, hits, and errors in a baseball game. To others, it means a course of
study. Neither description is adequate. In presenting the statistics as a method of getting
information from data to help managers make decisions, we will see that statistics comprises
various techniques with a wide range of applications to practical problems. In this section you will
be introduced with the definition of statistics, classification and applications of statistical methods.
1
The definitions of statistics are very dynamic, changed from time to time. Some of the definitions of
statistics are given below:
Classification of Statistics
Statistics means different things to different people. And we can say there are as many definitions as the
number of people who have tried to define the term Statistics. Some of these definitions are:
Statistics is a branch of mathematics that consists of a set of analytical techniques that can be
applied to data to help in making judgments and decisions in problems involving uncertainty.
Statistics is a scientific discipline consisting of procedures for collecting, describing, analyzing and
interpreting numerical data.
Statistics is a body of principles and methods concerned with extracting useful information from a
set of numerical data.
Statistics is a body of methods dealing with collection, description, analysis, and interpretation of
information that can be given in a numerical form.
Classification of Statistics
Generally statistics can be classified as descriptive and inferential statistics based on their scope of
coverage.
Descriptive Statistics deals with methods of organizing, summarizing, and presenting numerical
data in a convenient form through graphs, charts, tables, etc. It deals with description of the
characteristics of large masses of data. E.g. the computation of average weekly sales for a
business, average number of students in a class, the average mark for a section for Introductory
statistics course, etc. Most of the statistical information in newspapers, magazines, company
reports, and other publications consists of data that are summarized and presented in a form that
is easy for the reader to understand. Such summaries of data, which may be tabular, graphical, or
numerical, are referred to as descriptive statistics.
Inferential statistics consists of a set of procedures that helps in making inferences and
predictions about a whole population (a collection of persons, objects, or items of interest)
based on information from a sample (a portion of the whole and, if properly taken, is
representative of the whole) of the population. It is a body of methods for drawing conclusions
2
(that is, making inferences) about characteristics of a population, based on information available
in a sample taken from the population. E.g. Using the average mark of a section for estimating
the average mark for ten sections for a given course. Many situations require information about a
large group of elements (individuals, companies, voters, households, products, customers, and so
on). But, because of time, cost, and other considerations, data can be collected from only a small
portion of the group. The larger group of elements in a particular study is called the population,
and the smaller group is called the sample.
NB. While descriptive statistics describes the characteristics of the observed data and helps to
reach conclusions about that same group only, inferential statistics provides methods for making
generalizations about the whole population based on the sample of observed data.
Population is the totality of items under observation. It consists of all those items falling in to a
defined category or it is a set or collection of all possible observations of some specific
characteristic (usually people, objects, transactions or events). It is frequently large and may
sometimes be indefinitely large. E.g. All students in Ethiopia, all students in Addis Ababa, all
students in Faculty of Business, all students of a given department.
Census is the gathering of information from all elements in a population. It is the study of each
and every element in the population. It is sometimes called complete enumeration.
3
parameters are population mean (µ), population variance (σ2), and population standard deviation
(σ).
Sampling is a process of selecting out the representatives of a population from the population. It
is gathering information from the part of a population.
usually denoted by lower case Roman letters. Examples of statistics are sample mean ( x ),
sample variance (s2), and sample standard deviation (s).
Variable
It is property of an object or event that can take on different values. For example, college major
is a variable that takes on values like mathematics, computer science, English, psychology, etc.
Quantitative variables: are those for which the value has numerical meaning. The value
refers to a specific amount of some quantity. You can do mathematical operations on the
values of quantitative variables (like taking an average). A good example would be a
person's height.
Qualitative variables: are those for which the value indicates deferent groupings. Objects
that have the same value on the variable are the same with regard to some characteristic,
but you can't say that one group has \more" or \less" of some feature. It doesn't really
make sense to do math on categorical variables. A good example would be a person's
gender
4
Applications of Statistics
1. Marketing: Statistical analysis are frequently used in providing information for making decision
in the field of marketing it is necessary first to find out what can be sold and the to evolve
suitable strategy, so that the goods which to the ultimate consumer. A skill full analysis of data
on production purchasing power, man power, habits of compotators, habits of consumer,
transportation cost should be consider to take any attempt to establish a new market.
2. Production: In the field of production statistical data and method play a very important role. The
decision about what to produce? How to produce? When to produce? For whom to produce is
based largely on statistical analysis.
3. Finance: The financial organization discharging their finance function effectively depend very
heavily on statistical analysis of peat and tigers.
4. Banking: Banking institute have found if increasingly to establish research department within
their organization for the purpose of gathering and analysis information, not only regarding
their own business but also regarding general economic situation and every segment of
business in which they may have interest.
5. Investment: Statistics greatly assists investors in making clear and valued judgment in his
investment decision in selecting securities which are safe and have the best prospects of
yielding a good income.
6. Purchase: the purchase department in discharging their function makes use of statistical data to
frame suitable purchase policies such as what to buy? What quantity to buy? What time to buy?
Where to buy? Whom to buy?
7. Accounting: statistical data are also employer in accounting particularly in auditing function, the
technique of sampling and destination is frequently used.
8. Control: the management control process combines statistical and accounting method in
making the overall budget for the coming year including sales, materials, labor and other costs
and net profits and capital requirement.
5
CHAPTER TWO
Statistical Data
Data are set of values collected for some purpose. They are raw facts about a phenomenon
which do not give any message. Data are records of the actual state of some measurable
aspects of the universe at a particular point in a given time. They are not abstract but are
concrete, tangible and countable features of a particular aspect.
There are two types of data based on their source. These are primary and secondary data.
Primary data – These are data which are the measurements and records of original study.
These are data which are collected as a fresh and for the first time and thus happens to be
original in character. These are data which are directly measured and recorded from the
source. These are data which are not collected by someone else before.
Secondary Data – In some situations there are cases which are not conducive for the
principal investigator to start his study from the very beginning. In such a situation he may
use and take in to consideration what have already been collected by others.
Secondary data are those which have already been collected by someone else and which
have already been passed through some statistical process. When an investigator uses the
data which have already been collected by others, such data are called secondary data.
Secondary data can be taken from journals, reports, periodicals, publications, etc.
Secondary data should be used with greater care. The investigator, before using these data,
must observe that they possess the following characteristics.
6
1. Reliability of Data: The data collected from other source should be reliable enough
to be used by the investigator. Determining and testing the reliability of secondary
data is the most important as well as difficult task. Reliability can be tested by
answering questions like:
Who collected them?
What were the sources of data?
What methods were used to collect them?
At what time were they collected?
2. Suitability of Data:
Before using the secondary data, they must be evaluated whether they could serve for
another purpose other than the one for which they were collected. The suitability of
data can be evaluated from the point of the nature and scope of investigation view.
3. Adequacy of Data: Reliability and suitability of secondary data may not be sufficient
for the investigator to use these data for analysis. Besides these, they should be tested
for adequacy. Adequacy can be tested by evaluating the data in terms of area
coverage, level of accuracy; number of respondents participated and so on.
Scales of Measurement
The four generally used scales of measurement are listed here from weakest to strongest.
A. Nominal Scale. In the nominal scale of measurement, numbers are used simply as labels
for groups or classes. If our data set consists of blue, green, and red items, we may
designate blue as 1, green as 2, and red as 3. In this case, the numbers 1, 2, and 3 stand
7
only for the category to which a data point belongs. ―Nominal‖ stands for ―name‖ of
category. The nominal scale of measurement is used for qualitative rather than
quantitative data: blue, green, red; male, female; professional classification; geographic
classification; and so on.
B. Ordinal Scale. In the ordinal scale of measurement, data elements may be ordered
according to their relative size or quality. Four products ranked by a consumer may be
ranked as 1, 2, 3, and 4, where 4 is the best and 1 is the worst. In this scale of
measurement we do not know how much better one product is than others, only that it is
better.
C. Interval Scale. In the interval scale of measurement the value of zero is assigned
arbitrarily and therefore we cannot take ratios of two measurements. But we can take
ratios of intervals. A good example is how we measure time of day, which is in an
interval scale. We cannot say 10:00 A.M. is twice as long as 5:00 A.M. But we can say
that the interval between 0:00 A.M. (midnight) and 10:00 A.M., which is duration of 10
hours, is twice as long as the interval between 0:00 A.M. and 5:00 A.M., which is
duration of 5 hours. This is because 0:00 A.M. does not mean absence of any time.
Another example is temperature. When we say 0°F, we do not mean zero heat. A
temperature of 100°F is not twice as hot as 50°F.
D. Ratio Scale. If two measurements are in ratio scale, then we can take ratios of those
measurements. The zero in this scale is an absolute zero. Money, for example, is
measured in a ratio scale. A sum of $100 is twice as large as $50. A sum of $0 means
absence of any money and is thus an absolute zero. We have already seen that
measurement of duration (but not time of day) is in a ratio scale. In general, the interval
between two interval scale measurements will be in ratio scale. Other examples of the
ratio scale are measurements of weight, volume, area, or length.
Data can also be classified as either qualitative or quantitative. Qualitative data include labels
or names used to identify an attribute of each element. Qualitative data use either the nominal or
ordinal scale of measurement and may be nonnumeric or numeric. Quantitative data require
numeric values that indicate how much or how many. Quantitative data are obtained using either
the interval or ratio scale of measurement.
8
Methods of Data Collection
Data are records of the actual state of some measurable aspect of the universe at a particular
point in time. Data are not abstract; they are concrete, they are measurements or the tangible and
countable features of the world. In general, data could be quantitative (expressed in numerical
form) or qualitative (expressed in the form of verbal descriptions rather than numbers).
Primary data are those which are collected afresh and for the first time, and thus happen to be
original in character. Its advantage is its relevance to the user, but it is also likely to be expensive
in time and money terms to collect. The primary data can be collected using the following
methods
A. OBSERVATION
Observation is the most commonly used method of data collection especially, in behavioral
studies. This method could be used both for cross checking information obtained using other
methods and for understanding processes which are difficult to grasp in an interview context.
This method is useful when studying subjects who are not capable of giving verbal reports of
their feelings for one reason or another.
9
1. expensive;
B. Interview
The interview method of collecting data involves presentation of oral-verbal stimuli and reply in
terms of oral-verbal responses. This method can be used through personal interviews and, if
possible, through telephone interviews.
Personal interviews: This method requires a person (interviewer) asking questions in a face-to-
face contact to the interviewee.
If the interview is carried out in a structured way, it is called structured interview. This involves
the use of a set of predetermined questions and highly standardized techniques of recording. The
interviewer in a structured interview follows a rigid procedure laid down, asking questions in a
form and order prescribed. As against it, the unstructured interviews are characterized by a
flexibility of approach to questioning. In unstructured interview, the interviewer is allowed much
greater freedom to ask, in case of need, supplementary questions or at times he may omit certain
questions if the situation so requires. He may even change the sequence of questions. But this
sort of flexibility results in lack of comparability of one interview with another and the analysis
of unstructured responses becomes much more difficult and time consuming than that of the
structured responses obtained in case of structured interviews.
10
Some of the weaknesses of the personal interview method:
1. It is very expensive, especially when large and widely spread geographical sample is taken
2. The possibility of the bias of interviewer as well as that of the respondent
3. Certain types of respondents may not be easily approachable (eg. Important officials or
executives, people in high income groups)
4. It is relatively more time consuming
5.
Telephone interviews: This method of collecting information consists in contacting respondents
on telephone itself. It is not a very widely used method, but plays important part in industrial
surveys, particularly in developed countries.
This method is quite popular, particularly in case of big inquiries. Service evaluations of hotels,
restaurants, transportation providers, and other service providers are good examples of self-
administered questionnaire. Often a short questionnaire is left to be completed by the respondent
in a convenient location. In a mail survey, a questionnaire can also be sent (usually by post) to
the persons concerned with a request to answer the questions and return the questionnaire.
11
A questionnaire consists of a number of questions printed or typed in a definite order on a form
or set of forms. The questionnaire is mailed to respondents who are expected to read and
understand the questions and write down the reply in the space meant for the purpose in the
questionnaire itself.
1. it is free from the bias of the interviewer; answers are in respondents’ own words
2. respondents have adequate time to give well thought out answers
3. respondents who are not easily approachable can also be reached conveniently
The main demerits of this system can be:
The use of existing data (secondary data) in a research activity is termed as desk research simply
because the person carrying it out can usually gather such data with out leaving his/her desk. In
any type of study, it is advisable to assess the availability of secondary data before embarking
upon a primary data collection exercise, since the latter is expensive in terms of time, money and
manpower.
12
Data Presentation
After data have been collected, the next step is to present it in some convenient way. The
logic behind data presentation is that statistical data in their raw form are difficult to
understand and summarize. When data are presented, the user can understand it in some
meaningful form with in short period of time. Therefore, Data presentation is the process of
re-organization, classification, compilation and summarization of data to present it in a
meaningful form.
It is the process of organization of raw data in a table form using classes and frequencies.
There are two types of frequency distributions; these are categorical and grouped
frequency distribution.
The categorical frequency distribution is used for data which are qualitatively described.
The important thing here is that it can be able to classify the data in to complete and non-
overlapping categories.
Example: The following are data of employees of organization X by level of education (LOE)
No Name LOE
1 Abebe Diploma
2 Hordofa B.Sc
3 Toga M.Sc
4 Kahsay PhD
13
5 Ahmed Diploma
6 Hirut B.Sc
“ ” “
“ “ “
“ “ “
50 Kassech Ph.D
There are 15 workers having diploma, 20 workers having B.Sc, 10 works having M.Sc and 5
workers having Ph.D.
LOE NO Percentage
Diploma 15 30%
Bachelor 20 40%
Master 10 20%
Ph.D 5 10%
Total 50 100%
14
B. Grouped Frequency Distribution
This is a method of presenting data which is quantitatively measured and when a variable
contains a large volume of raw data. It contains several important concepts such as class
limits, class width, class interval and frequencies. Class limits are classified as lower class
limit and upper class limit. .
The steps necessary to define the classes for a frequency distribution with quantitative data are:
Let us demonstrate steps by developing a frequency distribution for the audit time data
A. Number of Classes: Classes are formed by specifying ranges that will be used to group
the data. As a general guideline, we recommend using between 5 and 20 classes. For a
small number of data items, as few as five or six classes may be used to summarize the
data. For a larger number of data items, a larger number of classes is usually required.
The goal is to use enough classes to show the variation in the data, but not so many
classes that some contain only a few data items.
There is also a formula which helps to determine K.
K = 1 3.322 log n . This formula will not give us a whole number.
B. Width of the Classes the second step in constructing a frequency distribution for
quantitative data is to choose a width for the classes. As a general guideline, we
recommend that the width be the same for each class. Thus the choices of the number of
classes and the width of classes are not independent decisions. A larger number of classes
means a smaller class width, and vice versa. To determine an approximate class width,
we begin by identifying the largest and smallest data values. Then, with the desired
15
number of classes specified, we can use the following expression to determine the
approximate class width.
Approximate class width = Largest data value _ Smallest data value
Number of classes
The approximate class width given by equation can be rounded to a more convenient
value based on the preference of the person developing the frequency distribution. For
example, an approximate class width of 9.28 might be rounded to 10 simply because 10
is a more convenient class width to use in presenting a frequency distribution.
In practice, the number of classes and the appropriate class width are determined by trial
and error. Once a possible number of classes is chosen, the equation is used to find the
approximate class width. The process can be repeated for a different number of classes.
Ultimately, the analyst uses judgment to determine the combination of the number of
classes and class width that provides the best frequency distribution for summarizing the
data. For the audit time data in Table after deciding to use five classes, each with a
width of five days, the next task is to specify the class limits for each of the classes.
C. Class Limits Class limits must be chosen so that each data item belongs to one and only
one class. The lower class limit identifies the smallest possible data value assigned to the
class. The upper class limit identifies the largest possible data value assigned to the class.
In developing frequency distributions for qualitative data, we did not need to specify
class limits because each data item naturally fell into a separate class. But with
quantitative data, such as the audit times in the table class limits are necessary to
determine where each data value belongs. Using the audit time data in the table, we
selected 10 days as the lower class limit and 14 days as the upper class limit for the first
class.
16
The smallest data value, 12, is included in the 10 –14 class. We then selected 15 days as the
lower class limit and 19 days as the upper class limit of the next class. We continued defining the
lower and upper class limits to obtain a total of five classes: 10–14, 15–19, 20–24, 25–29, and
30–34. The largest data value, 33, is included in the 30 –34 class. The difference between the
lower class limits of adjacent classes is the class width. Using the first two lower class limits of
10 and 15, we see that the class width is 15 - 10 = 5. With the number of classes, class width, and
class limits determined, a frequency distribution can be obtained by counting the number of data
values belonging to each class.
The most frequently occurring audit times are in the class of 15–19 days. Eight of the 20 audit
times belong to this class. Only one audit required 30 or more days. Other conclusions are
possible, depending on the interests of the person viewing the frequency distribution. The value
of a frequency distribution is that it provides insights about the data that are not easily obtained
by viewing the data in their original unorganized form.
D. Class Midpoint In some applications, we want to know the midpoints of the classes in a
frequency distribution for quantitative data. The class midpoint is the value halfway
between the lower and upper class limits. For the audit time data, the five class midpoints
are 12, 17, 22, 27, and 32.
Exercises
1. The dean of the college of Business and Economics wishes to determine the amount
of studying business and economics students do. He selects a random sample of 30
students and determines the number of hours each student studies per week: 15.0,
23.7, 19.7, 15.4, 18.3, 23.0, 14.2, 20.8, 13.5, 20.7, 17.4, 18.6, 12.9, 20.3, 13.7, 21.4,
17
18.3, 29.8, 17.1, 18.9, 10.3, 26.1, 15.7, 14.0, 17.8, 33.8, 23.2, 12.9, 27.1, 16.6.Organize
the data into a frequency distribution.
2. Consider the following class
10-15 10
16 - 21 20
22 – 27 30
28-33 25
Total 85
Required:
i. Determine LCL, LCB, UCL, and UCB for each class.
ii. Develop additional columns for class marks, relative frequencies, less than
cumulative frequencies, and more than cumulative frequencies.
Cumulative Frequency
Example
The marks of 30 students of a class, obtained in a test out of 75, are given below:
42, 21, 50, 37, 42, 37, 38, 42, 49, 52, 38, 53, 57, 47, 29, 59, 61, 33, 17, 17, 39, 44, 42, 39, 14, 7,
27, 19, 54, 51.
18
A frequency table and a cumulative frequency table with equal class interval is formed.
The Greater than Type cumulative frequency: The cumulative frequency of a particular class
and all the classes after that class is called the "greater than" type cumulative frequency.
The greater than cumulative frequencies are related to lower class limit and form a
decreasing sequence. The marks of 30 students of a class, obtained in a test out of 75, are given
below:
42, 21, 50, 37, 38, 42, 49, 52, 38, 53, 57, 47, 29, 59, 61, 33, 17,17, 39, 44, 42, 39, 14, 7, 27, 19,
54, 51.
On the basis of the given data a frequency table and a cumulative frequency "greater than" type
of table with equal class interval would look like this.
19
2.1.3 Diagrammatical presentation of data
Even though tabular method of presentation yields good information for those who
can understand them, they may not generate understandable information for
common people. Because of this reason we are introducing other means of data
presentation which will have more importance. Diagrammatic presentation of data
has the following advantages:
They help in drawing the required information with short period of time
without any complexity.
They have greater attraction than figures.
They facilitate comparison
Diagrammatic presentations have greater importance in the presentation of
categorical data. There are different types of diagrammatic presentation that are in
use these days. Some of these are discussed next.
20
A. Bar charts
Bar charts are one dimensional rectangular diagrams used to display usually
qualitative distributions. Bar charts have the following common characteristics:
a. The length or height of the bar associated with a category of a class interval
represents the corresponding frequency.
b. The bars are equally spaced. Equal space should be left between consecutive
bars.
c. Each bar has equal width
There are different types of bar charts used for data presentation: Vertical bar
graph, horizontal bar graph, grouped bar graph, Stocked bar graph, etc.
Example: Consider the preceding illustration about the level of Education of employees of
certain organization.
LOE NO
Diploma 15
Bachelor 20
Master 10
Ph.D 5
Total 50
21
60
Total
50
40 Ph.D
30 Master
20
Bachelor
10
Diploma
0
Diploma Bachelor Master Ph.D Total 0 20 40 60
Example: The following describes types of clothes and area of sales of these clothes
for the year 2005 by a textile factory. Present it using a bar chart.
Local Export
22
Total sales for 2005 by the type of clothes manufactured and areas of sales
1000
800
600 Local
400 Export
Total
200
0
Men’s Women’s Children’s Total
2000
1500
Total
1000
Export
500 Local
0
Men’s Women’s Children’s Total
Total
Children’s
Total
Export
Women’s
Local
Men’s
23
B. Pie- Chart
LOE fi fi Ai
n
Total 50 1 360
In pie-chart it is better to shade different colors for each component. It is also being
better that percentages are associated with each component for easy comparison.
Diploma
Bachelor
Masters
Ph.D
24
Graphical presentation of data
Histogram
Parts of a Histogram
1. The title: The title describes the information included in the histogram.
2. X-axis: refers intervals that show the scale of values which the measurements
fall under.
3. Y-axis: The Y-axis shows the number of times that the values occurred within
the intervals set by the X-axis.
4. The bars: The height of the bar shows the number of times that the values
occurred within the interval, while the width of the bar shows the interval
that is covered. For a histogram with equal bins, the width should be the same
across all bars.
25
Frequency Polygon
Mark the class intervals for each class on the horizontal axis. We will plot the
frequency on the vertical axis.
Calculate the classmark for each class interval. The formula for class mark is:
Mark all the class marks on the horizontal axis. It is also known as the mid-
value of every class.
Corresponding to each class mark, plot the frequency as given to you. The
height always depicts the frequency. Make sure that the frequency is plotted
against the class mark and not the upper or lower limit of any class.
26
Join all the plotted points using a line segment. The curve obtained will be
kinked.
Note that the above method is used to draw a frequency polygon without drawing a
histogram. You can also draw a histogram first by drawing rectangular bars against
the given class intervals. After this, you must join the midpoints of the bars to obtain
the frequency polygon. Remember that the bars will have no spaces between them
in a histogram.
We now start by plotting the class marks such as 54.5, 64.5, 74.5 and so on till 94.5.
Note that we will also plot the previous and next class marks to start and end the
polygon, i.e. we plot 44.5 and 104.5 as well.
Then, the frequencies corresponding to the class marks are plotted against each
class mark. Like you can see below, this makes sense as the frequency for class
marks 44.5 and 104.5 are zero and touching the x-axis. These plot points are used
only to give a closed shape to the polygon. The polygon looks like this:
Frequency Polygon
27
Cumulative frequency curve or O-give
The graphs of the frequency distribution are frequency graphs that are used to
exhibit the characteristics of discrete and continuous data. Such figures are more
appealing to the eye than the tabulated data. It helps us to facilitate the
comparative study of two or more frequency distributions. We can relate the
shape and pattern of the two frequency distributions.
28
The graph given above represents less than and the greater than Ogive curve.
The rising curve (Brown Curve) represents the less than Ogive, and the falling
curve (Green Curve) represents the greater than Ogive.
The frequencies of all preceding classes are added to the frequency of a class.
This series is called the less than cumulative series. It is constructed by adding
the first-class frequency to the second-class frequency and then to the third class
frequency and so on. The downward accumulation results in the less than
cumulative series.
The frequencies of the succeeding classes are added to the frequency of a class.
This series is called the more than or greater than cumulative series. It is
constructed by subtracting the first class, second class frequency from the total,
third class frequency from that and so on. The upward accumulation result is
greater than or more than the cumulative series.
29