HBTopic 4
HBTopic 4
Introduction
Descriptive Statistics
The Normal Distribution
Hypothesis- testing
Correlation and Regression
Chi-Square
Conducting a Study
HB
4.1 Introduction
Mathematics provides a tool for understanding and dealing with various aspects of present day living. One
such tool is Statistics. This chapter presents a variety of statistical tools to process, manage, or use numerical
data in order to describe a phenomenon and predict values.
Recall that Statistics is a field or discipline of study pertaining to the collection, organization, summary,
presentation, and analysis of data to ultimately make wise or sound decisions. In Statistics, we often deal
with numerical facts known as statistics. Examples of statistics are numbers that represent the monthly
income of an employee, age of a college student, rate of passing in a Board Examination, COVID-19 death
counts, and others.
The two branches of Statistics - Descriptive and Inferential, were introduced in High School. We have
learned that Statistics is important and that knowledge of Statistics could be of great use especially in the
conduct of a quantitative research or study. Let us deepen our understanding on these branches of Statistics
for us to be able to critically examine information from various sources such as newspapers, magazines,
journals, and the web.
Descriptive Statistics consists of methods for organizing, displaying, and describing data by using tables,
graphs, and summary descriptive measures. If we deal with research, descriptive statistics usually takes the
form of visual graphics, charts, diagrams, and basic mathematics to describe data.
The descriptive measures we obtain are often numerical in nature. For instance, when we talk about
measures of central tendency, we refer to the values of the mean, median, and mode as summary measures.
Similarly, when we talk about measures of dispersion, we refer to the numerical values of the range,
variance, and standard deviation as most common summary measures.
The measures of central tendency are used to describe the “average”, “middle”, or “central” point of a set
of data. Each measure could be used to compare the differences of the individual scores from the group.
The mean is the arithmetic average, the median is the middle score after arranging the scores either in
ascending or descending order, while the mode is the most frequent among the scores.
The information on the measures of central tendency are not certainly sufficient to give adequate description
of the data. We also need to know the degree or extent to which numerical values are dispersed or spread
out about the average value in the distribution. Hence, we compute for the measures of dispersion.
Shown below are the formulas for the Mean and Standard Deviation of ungrouped data:
Note: Formulas intentionally omitted for you to recall past lessons or do some readings and subsequently
communicate with your MMWN01A professor when necessary.
HB
Example:
Compute the mean, median, mode, range, variance, and standard deviation of the following scores of nine
(9) students in a 30-item:
14, 15, 21, 20, 14, 15, 30, 6, 9
Solution:
By using the formula or making use of a scientific calculator, we could easily verify that the mean is 16.
Arranging the scores from lowest to highest, it becomes clear that the median is 15. By inspection, we see
that there are two values which appear twice as compared to the other values which appear once only. Thus,
the modes are 14 and 15 and we say that the set of data is bimodal.
To compute the range, we simply find the difference between the highest and the lowest scores. So, the
range of the set of scores is 30 – 6 = 9. To find the variance and the standard deviation, we may use the
formula or use a scientific calculator. For the given data, the value of the variance is 49.50 and the standard
deviation (which is the square root of the variance) is 7.04.
The Normal Distribution is a probability distribution that is most widely used. Two mathematicians are
credited in the study of this particular distribution and they are Abraham de Moivre and Johann Carl
Friedrich Gauss.
Many phenomena in real life follow the normal distribution. Variables such as weight of mangoes in a crate,
height of teachers in a big private university, scores of students in a quiz, intelligence quotient (IQ) of
applicants in a government position, among others, tend to approximately follow a normal distribution. The
graphical representation of a normal distribution is a normal curve (as shown below):
Note: Graphical representation intentionally omitted for you to recall past lessons or do some readings
and subsequently communicate with your MMWN01A professor when necessary.
Recall that a normal curve is bell-shaped and symmetric about a line. Its tails extend indefinitely in both
directions and are asymptotic to the baseline. The mean, median, and mode are all equal and are located at
the center of the distribution. Moreover, recall that the mean and the standard deviation are two important
parameters that are used to describe it. We can come up with different normal curves, depending on the
values of the two parameters. If in a normal distribution the mean is 0 and the standard deviation is 1, then
we have a standard normal distribution.
The areas under the normal curve indicate probability. The total area under the normal curve is 100 % or
1.0. This important characteristic of the normal curve allows us to compute for the indicated area under the
normal curve. Areas under the normal curve from a standard score of zero, or z = 0, up to an indicated value
of z, could be found in Appendix A. By making use of MS Excel, we can also find areas under the normal
curve and solve problems involving the applications of areas under the normal curve.
HB
Example:
The assembly time for a particular toy car follows a normal distribution with a mean of 55 minutes and a
standard deviation of 4 minutes. If the company closes at 5:00 P.M., what is the probability that a worker
will be able to finish such task if he starts to assemble a toy car at 4:10 P.M.?
Solution:
To solve this problem, we let x = time to assemble a toy car. It is given the x is normally distributed with
mean = 55 minutes and standard deviation = 4 minutes. We want to find the probability that the worker will
be able to do the task in 50 minutes or less.
We first convert 50 minutes (the raw score) into a standard score using the formula z = (x – mean)/s. Hence,
z = (50 – 55)/4 = - 5/4 = - 1.25. The probability or the area we want to obtain is shown below:
Note: Illustration intentionally omitted for you to recall past lessons or do some readings and
subsequently communicate with your MMWN01A professor when necessary.
Using the table of areas under the normal curve, we obtain the area from z = 0 to z = 1.25 which is 0. 3944.
Since what we want is the area to the left of z = -1.25, we subtract 0.3944 from 0.5 (which is the area of the
left half of the normal curve) to get the answer. Thus, A = 0.5 – 0.3944 = 0.1056. This tells us that there is
a 10.56 % probability that the worker will be able to finish the task before the company closes at 5:00 P.M.
4.4 Hypothesis Testing
Hypothesis testing is a topic that falls under Inferential Statistics. In a test of hypothesis, we test a certain
theory about a population parameter by means of some information from a sample. It is made up of several
steps and procedures with the end in mind of accepting or rejecting the null hypothesis in favor of the
alternative hypothesis by comparing the computed value from the tabular or critical value of a test statistic.
The choice of a test statistic will depend on the assumed probability model as well as the hypothesis under
consideration. Some examples of elementary test statistics are the z-test, t-test, chi-square test, and others.
Typically, a hypothesis testing procedure will consist of the following steps:
1. Formulate the null hypothesis and the corresponding alternative hypothesis.
2. Specify the significance level and decide if a one-tailed or a two-tailed test should be used.
3. Decide for the appropriate test statistic and compute its value.
4. Look for the value of the test statistic from the appropriate table.
5. Make a decision by way of comparing the computed value and the tabular value of the test statistic.
The rule is to reject the null hypothesis if the computed value is greater than the tabular value.
Otherwise, do not reject (and therefore accept) the null hypothesis.
6. Arrive at a conclusion.
It is advisable that we follow the six-step procedure in testing a hypothesis. But nowadays, by means of
technology or a computer program, we may readily accept or reject the null hypothesis by simply obtaining
the p-value. We compare the level of significance with this p-value. If the level of significance is greater
than the p-value, we reject the null hypothesis. If the level of significance is lesser than the p-value, we do
not reject the null hypothesis.
HB
4.5 Correlation and Regression
The methods of correlation and regression are important in the analysis of data. We can use them to predict
the value of a variable given certain conditions.
Correlation is a way of measuring the strength of relationship between two variables. We may first construct
a scatter diagram to see if a certain relationship exists. If, for instance a linear relationship exist between a
pair of ratio level variables, we may proceed by computing for the Pearson Product-Moment Coefficient of
Correlation, r. Recall also that when data are ranked or under ordinal levels, we may obtain the strength of
relationship of the variables by means of the Spearman Rank-Order Correlation Coefficient. The coefficient
of correlation will always range from -1 to +1. The following table can serve as a guide in interpreting
linear correlation:
Note: Table intentionally omitted for you to recall past lessons and do some readings and subsequently
communicate with your MMWN01A professor when necessary.
If two variables have a strong linear relationship, we may be interested to come up with the best fitting
straight line that would describe or summarize them. This line is actually the regression line, and obtaining
its equation could help us predict additional values of the variables under consideration. This method of
predicting values is known as Linear Regression method.
The formulas for Pearson r, Spearman rho, and Regression Equation are shown below:
Note: Formulas intentionally omitted for you to recall past lessons or do some readings and subsequently
communicate with your MMWN01A professor when necessary.
Example:
A high school student was curious to find out if there is a relationship between age and height. He randomly
asked 12 students about their respective ages and heights and obtained the data as shown in the table below.
Student 1 2 3 4 5 6 7 8 9 10 11 12
Age (in 16 14 20 14 17 12 19 17 16 15 18 15
years)
Height 156 152 165 153 162 150 170 160 157 155 165 152
(in cm)
HB
(1) Draw the Scatter Diagram.
(2) Compute for Pearson r.
(3) Obtain the regression equation.
(4) Draw the regression line.
(5) Predict the height of a student who is 13 years old.
(6) Predict the age of a student whose height is 181 cm.
Solutions:
Note: Solutions intentionally omitted for you to solve and communicate with your professor.
4.6 Chi-Square Tests
A statistical test that is most commonly used for qualitative data is the chi-square test.
A non-parametric test, the chi-square test is actually a test that does not strictly require a normal distribution.
Instead of strictly using parameters such as the mean and the standard deviation, chi-square test uses counts
or frequencies. This is a test to compare between observed and expected frequencies, and it utilizes nominal
and ordinal data.
Recall that a qualitative data could be obtained from a qualitative variable. Also known as a categorical
variable, it is a variable that can be classified into two or more nonnumeric categories. Nominal data on
nationality, religious affiliation, and gender are examples of qualitative data. Ordinal data such as the data
obtained from a Likert Scale which makes use of ordering or ranking can also be qualitative in nature.
The chi-square test may function as a goodness-of-fit test, a test of independence, or a test of homogeneity.
In a goodness-of-fit test, we test the null hypothesis that the observed frequencies follow a certain pattern
or theoretical distribution. We want to find out how good or how well the observed frequencies fit the
pattern.
Some useful formulas related to chi-square are as follows:
Note: Formulas intentionally omitted for you to recall past lessons or do some readings and subsequently
communicate with your MMWN01A professor when necessary.
Example:
A store stocks five brands of cola. The choices made by a sample of 100 customers in that store is shown
in the table below. Conduct a chi-square goodness-of-fit test with a 1 % level of significance to determine
if the consumers in this store differ significantly from the population of cola consumers at large.
Cola Brand A B C D E
Number of 18 16 23 20 23
customers who
prefer the cola
brand
HB
Solution:
We follow the 6-step procedure in hypothesis testing:
1. Formulate the null hypothesis as well as the corresponding alternative hypothesis.
Null Hypothesis, Ho: There is no significant difference between the cola preferences.
Alternative hypothesis, Ha: There is a significant difference between the cola preferences.
2. Specify the significance level and decide if a one-tailed or a two-tailed test should be used.
For this problem, the significance level is set at 1 % or 0.01.
3. Decide for the appropriate test statistic and compute its value.
The chi-square goodness-of-fit test will be used. To compute its value, it is helpful to construct the
table below:
Note: Table intentionally omitted for you to solve and show, and subsequently communicate with your
MMWN01A professor when necessary.
Observe that the expected frequencies for all of the categories as shown in Column 3, are equal to
each other.
The value of chi-square is obtained by getting the sum of the entries in the last column. That is
(fo – fe)2/fe = 1.90
4. Look for the value of the test statistic from the appropriate table.
We first compute for the degrees of freedom, df. For chi-square goodness-of-fit test, df = k – 1,
where k is the number of categories. Since there are 5 cola brands, df = 5 – 1 = 4.
We may now use the Table of Critical Values of Chi-Square. For df = 4 and α = 0.01, the critical
or tabular value of chi-square is 13.2767.
5. Make a decision by way of comparing the computed value and the tabular value of the test statistic.
The computed value of chi-square is 1.90 while the tabular value is 13.2767.
Since the computed value is lesser than the tabular value, we do not reject the null hypothesis.
6. Arrive at a conclusion.
We now conclude that based from the available data, there is no significant difference between cola
preferences. This means that the consumers in this store do not differ significantly from the
population at large.
HB
Example:
A certain machine is supposed to mix peanuts, green peas, and corn in the ratio 1:2:3. A pack containing
600 pieces of these was found to have 102 peanuts, 201 green peas, and 297 corn. At 0.05 significance
level, test the hypothesis that the machine is still functioning properly.
Solution:
The machine is still functioning properly if it still mixes peanuts, green peas, and corn in the ratio 1:2:3.
We now construct the table as follows:
Ratio Observed, Expected, fe fo - fe (fo – fe)2 (fo – fe)2/fe
fo
peanuts 1 102 (600/6)*1 2 4 0.04
=100
green peas 2 201 (600/6)*2 = 1 1 0.005
200
corn 3 297 (600/6)*3 = -3 9 0.03
300
0.075
This time, let us conduct a chi-square test of independence. In here, we test the null hypothesis that the two
attributes or characteristics of the elements of a given population are independent or are not related against
the alternative that they are dependent or related.
Example:
A random sample of 4,000 adult men and women were asked about their opinion in a particular issue. The
results are shown below:
Is opinion independent of gender with regards to the issue? Test at 5 % significance level.
Solution:
The test of independence may be carried out in the same manner as that of the goodness-of-fit test with
some additional preliminary steps.
HB
We start by considering the given table as the table of observed frequencies. However, we need to include
the total for each row and column as follows:
Observed Frequencies
We then proceed by constructing the table for expected frequencies using the formula
Expected Frequency, E = Row Total, R * Column Total, C/Grand Total, G
For instance, the entry for expected frequency under “In favor”, “Men” is
E = R * C / G = (2,000)(1,900)/4,000 = 950.
Using the formula, other values could be obtained in order to come up with the table of Expected
Frequencies as shown below:
Expected Frequencies
The chi-square computed value is the sum of the entries in the last column. Thus,
Chi-square = 238.60
The degrees of freedom for a chi-square test of independence is df = (r – 1)(c – 1), where r and c refer to
number of rows and number of columns, respectively.
In our example, df = (2 – 1)(3 – 1) = (1)(2) = 2.
Based from the table, the chi-square tabular value at a significance level = 0.05 and df = 2 is 5.99146.
By comparison, the computed value which is 238.60 is greater than the tabular value which is 5.99146.
HB
With this, we need to reject the null hypothesis. We therefore conclude that opinion and gender are
dependent. This simply means that gender and opinion on a particular issue are related.
Again, it is important to note that we can perform any statistical test just like the chi-square test not only by
manual computation but also by way of computer programs.
Conducting a Study or Research brings a lot of benefits. In order to come up with a credible output, data
collection, organization, summary, presentation, and analysis should be done carefully. Especially for
quantitative research and mixed research which utilize quantitative data, a researcher must have a good
command or a deep understanding of Statistics – be it descriptive or inferential. We should always
remember that Statistics is a very important tool of Research so a working knowledge of statistics will
surely help us a lot in our research endeavors.
Research as we know is a broad field. We must be equipped with the necessary skills, values, and attitude
to come up with a commendable research output. In order for us to be guided accordingly, we may consult
individuals who have already established themselves in the field of Research. Reading recommended books
in research and conducting research with the help of a mentor until such time that we can do it by ourselves
will also be of great help. We must properly observe research ethics and protocol in conducting our research
studies.
After conducting a Study or Research, it is equally important to utilize and disseminate our findings. We
may present our findings in proper venues such as seminars and conferences. Finally, we may even have it
published especially if it really offers a solution to a particular problem or it contributes to national
development.
HB