Allama Iqbal Open University Islamabad: Muhammad Ashraf
Allama Iqbal Open University Islamabad: Muhammad Ashraf
Allama Iqbal Open University Islamabad: Muhammad Ashraf
Assignment No. 2
Registration No 17-PMI-02128
1
Common Uses: The Independent Samples t-Test is commonly used to test the following:
Statistical differences between the means of two groups
Statistical differences between the means of two interventions
Statistical differences between the means of two change scores
Data Requirements: Your data must meet the following requirements:
i. Dependent variable that is continuous (i.e., interval or ratio level)
ii. Independent variable that is categorical (i.e., two or more groups)
iii. Cases that have values on both the dependent and independent variables
iv. Independent samples/groups (i.e., independence of observations)
There is no relationship between the subjects in each sample. This means that:
Subjects in the first group cannot also be in the second group
No subject in either group can influence subjects in the other group
No group can influence the other group
Violation of this assumption will yield an inaccurate p value
v. Random sample of data from the population
vi. Normal distribution (approximately) of the dependent variable for each group
Non-normal population distributions, especially those that are thick-tailed or heavily
skewed, considerably reduce the power of the test
Among moderate or large samples, a violation of normality may still yield accurate p
values
2
sets of observations is zero. In a paired sample t-test, each subject or entity is measured
twice, resulting in pairs of observations. Common applications of the paired sample t-test
include case-control studies or repeated-measures designs. Suppose you are interested in
evaluating the effectiveness of a company training program. One approach you might
consider would be to measure the performance of a sample of employees before and after
completing the program, and analyze the differences using a paired sample t-test.
Hypotheses: Like many statistical procedures, the paired sample t-test has two competing
hypotheses, the null hypothesis and the alternative hypothesis. The null hypothesis assumes
that the true mean difference between the paired samples is zero. Under this model, all
observable differences are explained by random variation. Conversely, the alternative
hypothesis assumes that the true mean difference between the paired samples is not equal to
zero. The alternative hypothesis can take one of several forms depending on the expected
outcome. If the direction of the difference does not matter, a two-tailed hypothesis is used.
Otherwise, an upper-tailed or lower-tailed hypothesis can be used to increase the power of
the test. The null hypothesis remains the same for each type of alternative hypothesis. The
paired sample t-test hypotheses are formally defined below:
The null hypothesis (\(H_0\)) assumes that the true mean difference (\(\mu_d\)) is
equal to zero.
The two-tailed alternative hypothesis (\(H_1\)) assumes that \(\mu_d\) is not equal to
zero.
The upper-tailed alternative hypothesis (\(H_1\)) assumes that \(\mu_d\) is greater
than zero.
The lower-tailed alternative hypothesis (\(H_1\)) assumes that \(\mu_d\) is less than
zero.
Assumptions: As a parametric procedure (a procedure which estimates unknown parameters),
the paired sample t-test makes several assumptions. Although t-tests are quite robust, it is good
practice to evaluate the degree of deviation from these assumptions in order to assess the quality
of the results. In a paired sample t-test, the observations are defined as the differences between
two sets of values, and each assumption refers to these differences, not the original data values.
The paired sample t-test has four main assumptions:
3
The dependent variable must be continuous (interval/ratio).
The observations are independent of one another.
The dependent variable should be approximately normally distributed.
The dependent variable should not contain any outliers.
Level of Measurement: The paired sample t-test requires the sample data to be numeric and
continuous, as it is based on the normal distribution. Continuous data can take on any value
within a range (income, height, weight, etc.). The opposite of continuous data is discrete data,
which can only take on a few values (Low, Medium, High, etc.). Occasionally, discrete data can
be used to approximate a continuous scale, such as with Likert-type scales.
Normality: To test the assumption of normality, a variety of methods are available, but the
simplest is to inspect the data visually using a tool like a histogram (Figure 1). Real-world data
are almost never perfectly normal, so this assumption can be considered reasonably met if the
shape looks approximately symmetric and bell-shaped. The data in the example figure below is
4
However, the possibility that the null hypothesis is true and that we simply obtained a very rare
result can never be ruled out completely. The cutoff value for determining statistical significance
is ultimately decided on by the researcher, but usually a value of .05 or less is chosen. This
corresponds to a 5% (or less) chance of obtaining a result like the one that was observed if the
null hypothesis was true.
5
REFERENCES
1. Educational Statistics (8614), Department of Early Childhood Education and
Elementary Teacher Education by Allama Iqbal Open University, Islamabad.
2. https://www.statisticshowto.com/probability-and-statistics/t-
distribution/independent-samples-t-test/
3. https://www.statisticssolutions.com/manova-analysis-paired-sample-t-test
6
Question 2: Why do we use regression analysis? Write down the types of
regression.
Answer: Regression: A regression finds the best line that predicts dependent variables
from theindependent variable. The decision of which variable is calls dependent and which
callsindependent is an important matter in regression, as it will get a different best-fit line
ifwe exchange the two variables, i.e. dependent to independent and independent to
dependent. The line that best predicts independent variable from dependent variable will not
be the same as the line that predicts dependent variable from independent variable.
Let us start with the simple case of studying the relationship between two variables X and Y.
The variable Y is dependent variable and the variable X is the independent variable. We are
interested in seeing how various values of the independent variable X predict corresponding
values of dependent Y. This statistical technique is called regression analysis. We can say
that regression analysis is a technique that is used to model the dependency of one dependent
variable upon one independent variable. Merriam-Webster online dictionary defines
regression as a functional relationship between two or more correlated variables that is often
empirically determined from data and is used especially to predict values of one variable
when given variables of others. According to Gravetter&Wallnua (2002), regression is
statistical technique for finding the best-fitting straight line for a set of data is called
regression, and the resulting straight line is called regression line.
7
Why we use Regression Analysis:
8
data points of previous sales data as well as current sales data in an organisation to
understand and predict the future success.
ix. Organisations use regression analysis in order to predict future events. In this process, the
business analysts predict the man of the dependent variables for given specific values of
the dependent variables. The multivariate linear regressions is used for various important
purposes such as forecasting sale volumes or create growth plans, etc.
x. Organisations, in order to run smoothly as well as efficiently, need better decisions and
must understand the effects of the decision taken. Organisations collect data about sales,
investments, expenditures and other parameters and analyse it for improvement. The
regression analysis helps the organisations to make sense of the data which is then used
for gaining insights into an organisation.
Types of regression: The different types of regression in machine learning techniques are
explained below in detail:
9
the regression coefficient. As a result, the coefficient value gets nearer to zero, which
does not happen in the case of Ridge Regression.
Due to this, feature selection gets used in Lasso Regression, which allows selecting a set
of features from the dataset to build the model. In the case of Lasso Regression, only the
required features are used, and the other ones are made zero. This helps in avoiding the
over fitting in the model. In case the independent variables are highly collinear, then
Lasso regression picks only one variable and makes other variables to shrink to zero.
v. Polynomial Regression:Polynomial Regression is another one of the types of regression
analysis techniques in machine learning, which is the same as Multiple Linear Regression
with a little modification. In Polynomial Regression, the relationship between
independent and dependent variables, that is X and Y, is denoted by the n-th degree.
It is a linear model as an estimator. Least Mean Squared Method is used in Polynomial
Regression also. The best fit line in Polynomial Regression that passes through all the
data points is not a straight line, but a curved line, which depends upon the power of X or
value of n.
vi. Bayesian Linear Regression:Bayesian Regression is one of the types of regression in
machine learning that uses the Bayes theorem to find out the value of regression
coefficients. In this method of regression, the posterior distribution of the features is
determined instead of finding the least-squares. Bayesian Linear Regression is like both
Linear Regression and Ridge Regression but is more stable than the simple Linear
Regression.
10
REFERENCES
1. Educational Statistics (8614), Department of Early Childhood Education and
Elementary Teacher Education by Allama Iqbal Open University, Islamabad.
2. https://analyticsindiamag.com/why-regression-analysis-is-the-backbone-for-
enterprises/
3. https://www.upgrad.com/blog/types-of-regression-models-in-machine-learning/
11
Question 3: Write a short note on one way ANOVA. Write down main
assumptions underlying one way ANOVA.
Answer: one way ANOVA: ANOVA is short for ANalysis Of VAriance. The main purpose of
an ANOVA is to test if two or more groups differ from each other significantly in one or more
characteristics.
The one-way analysis of variance (ANOVA) is used to determine whether there are any
statistically significant differences between the means of two or more independent (unrelated)
groups (although you tend to only see it used when there are a minimum of three, rather than two
groups). For example, you could use a one-way ANOVA to understand whether exam
performance differed based on test anxiety levels amongst students, dividing students into three
independent groups (e.g., low, medium and high-stressed students). Also, it is important to
realize that the one-way ANOVA is an omnibus test statistic and cannot tell you which specific
groups were statistically significantly different from each other; it only tells you that at least two
groups were different. Since you may have three, four, five or more groups in your study design,
determining which of these groups differ from each other is important. You can do this using a
post hoc test.
Post Hoc Tests: Post Hoc tests are useful if your independent variable includes more than two
groups. In our example the independent variable just specifies the outcome of the final exam on
two factor levels – pass or fail. If more than two factor levels are given it might be useful to run
pairwise tests to test which differences between groups are significant. Because executing
several pairwise tests in one analysis decreases the degrees of freedom, the Bonferoni adjustment
should be selected, which corrects for multiple pairwise comparisons. Another test method
commonly employed is the Student-Newman-Keuls test (or short S-N-K), which pools the
groups that do not differ significantly from each other. Therefore this improves the reliability of
the post hoc comparison because it increases the sample size used in the comparison.
The One-Way ANOVA is often used to analyze data from the following types of studies:
Field studies
Experiments
Quasi-experiments
The One-Way ANOVA is commonly used to test the following:
12
Statistical differences among the means of two or more groups
Statistical differences among the means of two or more interventions
Statistical differences among the means of two or more change scores
Assumption-3: You should have independence of observations, which means that there is no
relationship between the observations in each group or between the groups themselves. For
example, there must be different participants in each group with no participant being in more
than one group. This is more of a study design issue than something you can test for, but it is an
important assumption of the one-way ANOVA. If your study fails this assumption, you will need
to use another statistical test instead of the one-way ANOVA (e.g., a repeated measures design).
Assumption-4: There should be no significant outliers. Outliers are simply single data points
within your data that do not follow the usual pattern (e.g., in a study of 100 students' IQ scores,
where the mean score was 108 with only a small variation between students, one student had a
score of 156, which is very unusual, and may even put her in the top 1% of IQ scores globally).
The problem with outliers is that they can have a negative effect on the one-way ANOVA,
reducing the validity of your results. Fortunately, when using SPSS Statistics to run a one-way
ANOVA on your data, you can easily detect possible outliers. In our enhanced one-way ANOVA
13
guide, we: (a) show you how to detect outliers using SPSS Statistics; and (b) discuss some of the
options you have in order to deal with outliers.
Assumption-5: Your dependent variable should be approximately normally distributed for each
category of the independent variable. We talk about the one-way ANOVA only requiring
approximately normal data because it is quite "robust" to violations of normality, meaning that
assumption can be a little violated and still provide valid results. You can test for normality using
the Shapiro-Wilk test of normality, which is easily tested for using SPSS Statistics. In addition to
showing you how to do this in our enhanced one-way ANOVA guide, we also explain what you
can do if your data fails this assumption (i.e., if it fails it more than a little bit).
Assumption-6: There needs to be homogeneity of variances. You can test this assumption in
SPSS Statistics using Levene's test for homogeneity of variances. If your data fails this
assumption, you will need to not only carry out a Welch ANOVA instead of a one-way ANOVA,
which you can do using SPSS Statistics, but also use a different post hoc test. In our enhanced
one-way ANOVA guide, we (a) show you how to perform Levene’s test for homogeneity of
variances in SPSS Statistics, (b) explain some of the things you will need to consider when
interpreting your data, and (c) present possible ways to continue with your analysis if your data
fails to meet this assumption, including running a Welch ANOVA in SPSS Statistics instead of a
one-way ANOVA, and a Games-Howell test instead of a Tukey post hoc test.
14
REFERENCES
1. Educational Statistics (8614), Department of Early Childhood Education and
Elementary Teacher Education by Allama Iqbal Open University, Islamabad.
2. https://www.statisticssolutions.com/Conduct-and-Interpret-One-Way-ANOVA/
3. https://statistics.laerd.com/spss-tutorials/one-way-anova-using-spss-statistics.php
4. https://libguides.library.kent.edu/spss/onewayanova
15
Question 4: What do you know about chi- square (x2) goodness of fit test?
Write down the procedure for goodness of fit test.
Answer: Chi-square (x2) goodness of fit test: Chi-Square goodness of fit test is a non-
parametric test that is used to find out how the observed value of a given phenomenon is
significantly different from the expected value. In Chi-Square goodness of fit test, the term
goodness of fit is used to compare the observed sample distribution with the expected probability
distribution. Chi-Square goodness of fit test determines how well theoretical distribution (such
as normal, binomial, or Poisson) fits the empirical distribution. In Chi-Square goodness of fit
test, sample data is divided into intervals. Then the numbers of points that fall into the interval
are compared, with the expected numbers of points in each interval.
The Chi-square goodness of fit test checks whether your sample data is likely to be from a
specific theoretical distribution. We have a set of data values, and an idea about how the data
values are distributed. The test gives us a way to decide if the data values have a “good enough”
fit to our idea, or if our idea is questionable.
When an analyst attempts to fit a statistical model to observed data, he or she may wonder how
well the model actually reflects the data. How "close" are the observed values to those which
would be expected under the fitted model? One statistical test that addresses this issue is the chi-
square goodness of fit test. This test is commonly used to test association of variables in two-way
tables (see "Two-Way Tables and the Chi-Square Test"), where the assumed model of
independence is evaluated against the observed data.
In chi-square goodness of fit test sample data is divided into intervals. Then, the numbersof
points that fall into the intervals are compared with the expected numbers of points
ineachinterval. . The null hypothesis for the chi-square goodness of fit test is that the datadoes
not come from the specified distribution. The alternate hypothesis is that the datacomes from the
specified distribution. The formula for chi-square goodness of fit test is:
2
(𝑂𝑏𝑠𝑒𝑟𝑣𝑒𝑑 − 𝐸𝑥𝑝𝑒𝑐𝑡𝑒𝑑)2
𝑥 =∑
𝐸𝑥𝑝𝑒𝑐𝑡𝑒𝑑
If the computed test statistic is large, then the observed and expected values are not close and the
model is a poor fit to the data.
A test based upon the Chi-squared distribution is a nonparametric test. Nonparametric tests
determine the probability that an observed distribution of data, based upon rankings or
16
distribution into categories of a qualitative nature, is due to chance (sampling error) alone. If you
have numbers that appear to follow a normal or t-distribution, then you would want to use a
parametric test such as ‘Student’s’t test to address your question. The chi-square test is very
useful, especially when data are not quantitative.
Hypothesis Testing: We use the chi-square test to test the validity of a distribution assumed for
a random phenomenon. The test evaluates the null hypotheses H0 (that the data are governed by
the assumed distribution) against the alternative (that the data are not drawn from the assumed
distribution).
Estimating Parameters: Often, the null hypothesis involves fitting a model with parameters
estimated from the observed data. In the above gambling example, for instance, we might wish to
fit a binomial model to evaluate the probability of rolling a six with the gambler's loaded dice.
We know that this probability is not equal to 1/6, so we might estimate this value by calculating
the probability from the data. By estimating a parameter, we lose a degree of freedom in the chi-
square test statistic. In general, if we estimate d parameters under the null hypothesis with k
possible counts the degrees of freedom for the associated chi-square distribution will be k - 1 - d.
Two potential disadvantages of chi-square are:
i. The chi-square test can only be used to put data into classes. If there is data that have not
been put into classes then it is necessary to make a frequency table of histogram before
performing the test.
ii. It requires sufficient sample size in order for chi-square approximation to be valid.
Example: A small community gym might be operating under the assumption that it has its
highest attendance on Mondays, Tuesdays and Saturdays, average attendance on Wednesdays,
and Thursdays, and lowest attendance on Fridays and Sundays. Based on these assumptions, the
gym employs a certain number of staff members each day to check in members, clean facilities,
offer training services, and teach classes.
However, the gym is not performing well financially and the owner wants to know if these
attendance assumptions and staffing levels are correct. The owner decides to count the number of
gym attendees each day for six weeks. He can then compare the gym's assumed attendance with
its observed attendance using a chi-square goodness-of-fit test for example. With the new data,
he can determine how to best manage the gym and improve profitability.
17
Procedure for goodness of fit test: The procedure for carrying out a goodness of fit test is as
follows:
i. States the null hypothesis (H0):It might take the form. The data are not consistent with a
specified distribution.
ii. States the alternate hypothesis (Ha): This is an opposite statement to the null
hypothesis. The data are consistent with a specified distribution.
iii. Calculate the Test Statistic: The test statistic is calculated using the formula:
(𝑂𝑏𝑠𝑒𝑟𝑣𝑒𝑑 − 𝐸𝑥𝑝𝑒𝑐𝑡𝑒𝑑)2
𝑥2 = ∑
𝐸𝑥𝑝𝑒𝑐𝑡𝑒𝑑
iv. Find the p-value: The range of our p-value can be found by comparing test statistic to
table values.
v. Reach a conclusion: We need a p-value less than the significance level, generally less than
5% (p < .05), to reject the null hypothesis. It is suitable to write a sentence in the context of
the question, i.e. “the data appears to follow a normal distribution”.
18
REFERENCES
1. Educational Statistics (8614), Department of Early Childhood Education and
Elementary Teacher Education by Allama Iqbal Open University, Islamabad.
2. https://www.statisticssolutions.com/chi-square-goodness-of-fit-test/
3. https://www.ruf.rice.edu/~bioslabs/tools/stats/chisquare.html
19
Question 5: What is chi-square (x2) independence test? Explain in detail.
Answer: Chi-square (x2) independence test: A chi-square (χ2) test of independence is the
second important form of chi-square tests. It is used to explore the relationship between two
categorical variables. Each of these variables can have two of more categories.
The Chi-square test of independence (also known as the Pearson Chi-square test, or simply the
Chi-square) is one of the most useful statistics for testing hypotheses when the variables are
nominal, as often happens in clinical research. Unlike most statistics, the Chi-square (χ2) can
provide information not only on the significance of any observed differences, but also provides
detailed information on exactly which categories account for any differences found. Thus, the
amount and detail of information this statistic can provide renders it one of the most useful tools
in the researcher’s array of available analysis tools. As with any statistic, there are requirements
for its appropriate use, which are called “assumptions” of the statistic. Additionally, the χ2 is a
significance test, and should always be coupled with an appropriate test of strength.
It determines if there is a significant relationship between two nominal (categorical) variables.
The frequency of one nominal variable is compared with different values of the second nominal
variable. For example, the researcher wants to examine the relationship between gender (male
and female) and empathy (high vs. low). The researcher will use chi-square test of independence.
If the null hypothesis is accepted there would be no relationship between gender and empathy. If
the null hypothesis is rejected then the conclusion will be there is a relationship between gender
and empathy (e.g. say females tent to score higher on empathy and males tend to score lower on
empathy).The chi-square test for independence compares two sets of data to see if there is
relationship.
The Chi-square test of independence checks whether two variables are likely to be related or not.
We have counts for two categorical or nominal variables. We also have an idea that the two
variables are not related. The test gives us a way to decide if our idea is plausible or not.
The chi-square test of independence being a non-parametric technique follow less
strictassumptions, there are some general assumptions which should be taken care of:
i. Random Sample - Sample should be selected using simple random sampling method.
ii. Variables - Both variables under study should be categorical.
20
iii. Independent Observations – Each person or case should be counted only once and none
should appear in more than one category of group. The data from one subjectshould not
influence the data from another subject.
iv. If the data are displayed in a contingency table, the expected frequency count foreach cell
of the table is at least 5.
The Chi-Square Test of Independence determines whether there is an association between
categorical variables (i.e., whether the variables are independent or related). It is a nonparametric
test.
This test utilizes a contingency table to analyze the data. A contingency table (also known as a
cross-tabulation, crosstab, or two-way table) is an arrangement in which data is classified
according to two categorical variables. The categories for one variable appear in the rows, and
the categories for the other variable appear in columns. Each variable must have two or more
categories. Each cell reflects the total count of cases for a specific pair of categories.
For the Chi-square test of independence, we need two variables. Our idea is that the variables are
not related.
Data values that are a simple random sample from the population of interest.
Two categorical or nominal variables. Don't use the independence test with continous
variables that define the category combinations. However, the counts for the combinations
of the two categorical variables will be continuous.
For each combination of the levels of the two variables, we need at least five expected
values. When we have fewer than five for any one combination, the test results are not
reliable.
Formula: The formula is given as:
2
(𝑂𝑏𝑠𝑒𝑟𝑣𝑒𝑑 − 𝐸𝑥𝑝𝑒𝑐𝑡𝑒𝑑)2
𝑥 =∑
𝐸𝑥𝑝𝑒𝑐𝑡𝑒𝑑
The non-parametric tests, including the χ2 assume the data were obtained through random
selection. However, it is not uncommon to find inferential statistics used when data are from
convenience samples rather than random samples. (To have confidence in the results when the
random sampling assumption is violated, several replication studies should be performed with
21
essentially the same result obtained). Each non-parametric test has its own specific assumptions
as well.
i. The data in the cells should be frequencies, or counts of cases rather than percentages or
some other transformation of the data.
ii. The levels (or categories) of the variables are mutually exclusive. That is, a particular
subject fits into one and only one level of each of the variables.
iii. Each subject may contribute data to one and only one cell in the χ 2. If, for example, the
same subjects are tested over time such that the comparisons are of the same subjects at
Time 1, Time 2, Time 3, etc., then χ2 may not be used.
iv. The study groups must be independent. This means that a different test must be used if
the two groups are related. For example, a different test must be used if the researcher’s
data consists of paired samples, such as in studies in which a parent is paired with his or
her child.
v. There are 2 variables, and both are measured as categories, usually at the nominal level.
However, data may be ordinal data. Interval or ratio data that have been collapsed into
ordinal categories may also be used. While Chi-square has no rule about limiting the
number of cells (by limiting the number of categories for each variable), a very large
number of cells (over 20) can make it difficult to meet assumption number vi below, and
to interpret the meaning of the results.
vi. The value of the cell expected should be 5 or more in at least 80% of the cells, and no cell
should have an expected of less than one (3). This assumption is most likely to be met if
the sample size equals at least the number of cells multiplied by 5. Essentially, this
assumption specifies the number of cases (sample size) needed to use the χ 2 for any
number of cells in that χ2.
Uses: It should be used when any one of the following conditions pertains to the data:
i. The level of measurement of all the variables is nominal or ordinal.
ii. The sample sizes of the study groups are unequal; for the χ2 the groups may be of equal
size or unequal size whereas some parametric tests require groups of equal or
approximately equal size.
iii. The original data were measured at an interval or ratio level, but violate one of the
following assumptions of a parametric test:
22
The distribution of the data was seriously skewed or kurtotic (parametric tests assume
approximately normal distribution of the dependent variable), and thus the researcher
must use a distribution free statistic rather than a parametric statistic.
The data violate the assumptions of equal variance or homoscedasticity.
For any of a number of reasons , the continuous data were collapsed into a small
number of categories, and thus the data are no longer interval or ratio.
Advantages: Advantages of the Chi-square include its robustness with respect to distribution of
the data, its ease of computation, the detailed information that can be derived from the test, its
use in studies for which parametric assumptions cannot be met, and its flexibility in handling
data from both two group and multiple group studies.
Limitations:Limitations include its sample size requirements, difficulty of interpretation when
there are large numbers of categories (20 or more) in the independent or dependent variables, and
tendency to produce relative low correlation measures, even for highly significant results.
23
REFERENCES
1. Educational Statistics (8614), Department of Early Childhood Education and
Elementary Teacher Education by Allama Iqbal Open University, Islamabad.
2. https://www.jmp.com/en_ca/statistics-knowledge-portal/chi-square-test/chi-square-
test-of-independence.html
3. https://www.investopedia.com/terms/c/chi-square-statistic.asp
4. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3900058/
5. https://online.stat.psu.edu/stat500/lesson/8/8.1
24