8614(2)
8614(2)
Assignment : 2
Program: B.Ed
Consider following data set. 25, 43, 39, 25, 82, 77, 25, 47.
The score 25 comes more frequently, so it is the mode. Sometimes there may be no single
mode if no one value appears more than any other. There may be one mode (uni-modal),
two modes (bi-model), three modes (tri-model), or more than three modes (multi-model).
Mode is useful when scores reflect a nominal scale of measurement. But along with mean
and median it can also be used for ordinal, interval or ratio data. It can be located
By hand
To find the mode manually, arrange the numbers in ascending or descending order, then
count how often each number appears. The number that appears most often is the mode.
Example: 55, 4, 28, 44, 32, 55, 32, 45, 48, 6, 44, 28, 14, 23, 12, 32, 44
In order, these numbers are: 4, 6, 12, 14, 23, 28, 28, 32, 32, 32, 44, 44, 44, 48, 55, 55
It’s now easy to see which numbers appear most often. In this case, the data set is bimodal,
and has two modes: 32 and 44. The number 55 appears twice, but 32 and 44 both appear
Using a calculator
Input all numbers in your data set following the instructions. You may need to add a comma
Select "calculate."
With Excel
Enter each number of your data set into a separate cell in a single column.
sort from largest to smallest. You can determine the mode manually at this stage.
Select the cell where you want to mode to display. This cannot be a cell that contains a
In the function box, type "MODE:(BEGINNING CELL:ENDING CELL)" using your specific
Merits of Mode
iii) Even if the extreme values are not known mode can be calculated.
Demerits of Mode
Reference
Anderson, T. W., & Sclove, S. L. (1974). Introductory Statistical Analysis, Finland: Houghton
Mifflin Company.
T-Test
A t-test is a useful statistical technique used for comparing mean values of two data sets
obtained from two groups. The comparison tells us whether these data sets are different
from each other. It further tells us how significant the differences are and if these
differences could have happened by chance. The statistical significance of t-test indicates
whether or not the difference between the mean of two groups most likely reflects a real
difference in the population from which the groups are selected. t-tests are used when
there are two groups (male and female) or two sets of data (before and after), and the
T-test applications
The T-test is used to compare the mean of two samples, dependent or independent.
It can also be used to determine if the sample mean is different from the assumed mean.
T-test has an application in determining the confidence interval for a sample mean.
The following flowchart can be used to determine which t-test to use based on the
characteristics of the sample sets. The key items to consider include the similarity of the
sample records, the number of data records in each sample set, and the variance of each
sample set.
Reference
Gravetter, F. J., & Wallnau, L. B. (2002). Essentials of Statistics for the Behavioral Sciences
Regression Analysis
Regression analysis is used when you want to predict a continuous dependent variable from
regression should be used. (If the split between the two levels of the dependent variable is
close to 50-50, then both logistic and linear regression will end up giving you similar results.)
The independent variables used in regression can be either continuous or dichotomous.
Independent variables with more than two levels can also be used in regression analyses,
but they first must be converted into variables that have only two levels. This is called
dummy coding and will be discussed later. Usually, regression analysis is used with
you can use regression with experimentally manipulated variables. One point to keep in
mind with regression analysis is that causal relationships among the variables cannot be
determined. While the terminology is such that we say that X "predicts" Y, we cannot say
that X "causes" Y.
Applied of Regression
Number of cases
When doing regression, the cases-to-Independent Variables (IVs) ratio should ideally be
20:1; that is 20 cases for every IV in the model. The lowest your ratio should be is 5:1 (i.e., 5
Accuracy of data
If you have entered the data (rather than using an established dataset), it is a good idea to
check the accuracy of the data entry. If you don't want to re-check each data point, you
should at least check the minimum and maximum value for each variable to ensure that all
values for each variable are "valid." For example, a variable that is measured using a 1 to 5
Missing data
You also want to look for missing data. If specific variables have a lot of missing values, you
may decide not to include those variables in your analyses. If only a few cases have any
missing values, then you might want to delete those cases. If there are missing values for
several cases on different variables, then you probably don't want to delete those cases
(because a lot of your data will be lost). If there are not too much missing data, and there
does not seem to be any pattern in terms of what is missing, then you don't really need to
worry. Just run your regression, and any cases that do not have values for the variables used
in that regression will not be included. Although tempting, do not assume that there is no
pattern; check for this. To do this, separate the dataset into two groups: those cases missing
values for a certain variable, and those not missing a value for that variable. Using t-tests,
you can determine if the two groups differ on other variables included in the sample. For
example, you might find that the cases that are missing values for the "salary" variable are
younger than those cases that have values for salary. You would want to do t-tests for each
variable with a lot of missing values. If there is a systematic difference between the two
groups (i.e., the group missing values vs. the group not missing values), then you would
need to keep this in mind when interpreting your findings and not overgeneralize.
After examining your data, you may decide that you want to replace the missing values with
some other value. The easiest thing to use as the replacement value is the mean of this
variable. Some statistics programs have an option within regression where you can replace
the missing value with the mean. Alternatively, you may want to substitute a group mean
(e.g., the mean for females) rather than the overall mean.
The default option of statistics packages is to exclude cases that are missing values for any
variable that is included in regression. (But that case could be included in another
regression, as long as it was not missing values on any of the variables included in that
analysis.) You can change this option so that your regression analysis does not exclude cases
that are missing data for any variable included in the regression, but then you might have a
Outliers
You also need to check your data for outliers (i.e., an extreme value on a particular item) An
outlier is often operationally defined as a value that is at least 3 standard deviations above
or below the mean. If you feel that the cases that produced the outliers are not part of the
same "population" as the other cases, then you might just want to delete those cases.
Alternatively, you might want to count those extreme values as "missing," but retain the
case for other variables. Alternatively, you could retain the outlier, but reduce how extreme
it is. Specifically, you might want to recode the value so that it is the highest (or lowest) non-
outlier value.
Normality
You also want to check that your data is normally distributed. To do this, you can construct
histograms and "look" at the data to see its distribution. Often the histogram will include a
line that depicts what the shape would look like if the distribution were truly normal (and
you can "eyeball" how much the actual distribution deviates from this line).
You can also construct a normal probability plot. In this plot, the actual scores are ranked
and sorted, and an expected normal value is computed and compared with an actual normal
value for each case. The expected normal value is the position a case with that rank holds in
a normal distribution. The normal value is the position it holds in the actual distribution.
Basically, you would like to see your actual values lining up along the diagonal that goes
"residuals." Residuals are the difference between obtained and predicted DV scores.
(Residuals will be explained in more detail in a later section.) If the data are normally
distributed, then residuals should be normally distributed around each predicted DV score.
If the data (and the residuals) are normally distributed, the residuals scatterplot will show
the majority of residuals at the center of the plot for each value of the predicted score, with
some residuals trailing off symmetrically from the center. You might want to do the residual
plot before graphing each variable separately because if this residuals plot looks good, then
you don't need to do the separate plots. Below is a residual plot of a regression where age
of patient and time (in months since diagnosis) are used to predict breast tumor size.
Although there is some debate about the origins of the name, the statistical technique
described above most likely was termed "regression" by Sir Francis Galton in the 19th
century to describe the statistical feature of biological data (such as heights of people in a
population) to regress to some mean level. In other words, while there are shorter and taller
people, only outliers are very tall or short, and most people cluster somewhere around (or
"regress" to) the average.
A regression model output may be in the form of Y = 1.0 + (3.2)X1 - 2.0(X2) + 0.21.
Here we have a multiple linear regression that relates some variable Y with two explanatory
variables X1 and X2. We would interpret the model as the value of Y changes by 3.2x for
every one-unit change in X1 (if X1 goes up by 2, Y goes up by 6.4, etc.) holding all else
constant (all else equal). That means controlling for X2, X1 has this observed relationship.
Likewise, holding X1 constant, every one unit increase in X2 is associated with a
2x decrease in Y. We can also note the y-intercept of 1.0, meaning that Y = 1 when X 1 and
X2 are both zero. The error term (residual) is 0.21.
What Are the Assumptions That Must Hold for Regression Models?
In order to properly interpret the output of a regression model, the following main
assumptions about the underlying data process of what you analyzing must hold:
Data scientists for professional sports teams often use linear regression to measure the
effect that different training regimens have on player performance.
For example, data scientists in the NBA might analyze how different amounts of weekly
yoga sessions and weightlifting sessions affect the number of points a player scores. They
might fit a multiple linear regression model using yoga sessions and weightlifting sessions as
the predictor variables and total points scored as the response variable. The regression
model would take the following form:
points scored = β0 + β1(yoga sessions) + β2(weightlifting sessions)
The coefficient β0 would represent the expected points scored for a player who participates
in zero yoga sessions and zero weightlifting sessions.
The coefficient β1 would represent the average change in points scored when weekly yoga
sessions is increased by one, assuming the number of weekly weightlifting sessions remains
unchanged.
The coefficient β2 would represent the average change in points scored when weekly
weightlifting sessions is increased by one, assuming the number of weekly yoga sessions
remains unchanged.
Depending on the values of β1 and β2, the data scientists may recommend that a player
participates in more or less weekly yoga and weightlifting sessions in order to maximize
their points scored.
Data scientists for professional sports teams often use linear regression to measure the
effect that different training regimens have on player performance.
For example, data scientists in the NBA might analyze how different amounts of weekly
yoga sessions and weightlifting sessions affect the number of points a player scores. They
might fit a multiple linear regression model using yoga sessions and weightlifting sessions as
the predictor variables and total points scored as the response variable. The regression
model would take the following form:
points scored = β0 + β1(yoga sessions) + β2(weightlifting sessions)
The coefficient β0 would represent the expected points scored for a player who participates
in zero yoga sessions and zero weightlifting sessions.
The coefficient β1 would represent the average change in points scored when weekly yoga
sessions is increased by one, assuming the number of weekly weightlifting sessions remains
unchanged.
The coefficient β2 would represent the average change in points scored when weekly
weightlifting sessions is increased by one, assuming the number of weekly yoga sessions
remains unchanged.
Depending on the values of β1 and β2, the data scientists may recommend that a player
participates in more or less weekly yoga and weightlifting sessions in order to maximize
their points scored.
Reference
Argyrous, G. (2012). Statistics for Research, with a guide to SPSS. India: SAGE Publications.
procedure?
ANOVA
The t-tests have one very serious limitation – they are restricted to tests of the significance
of the difference between only two groups. There are many times when we like to see if
there are significant differences among three, four, or even more groups. For example we
may want to investigate which of three teaching methods is best for teaching ninth class
algebra. In such case, we cannot use t-test because more than two groups are involved. To
deal with such type of cases one of the most useful techniques in statistics is analysis of
variance (abbreviated as ANOVA). This technique was developed by a British Statistician
Ronald A.
mean differences between two or more treatments (or population). Like all other inferential
procedures. ANOVA uses sample data to as a basis for drawing general conclusion about
populations. Sometime, it may appear that ANOVA and t-test are two different ways of
doing exactly same thing: testing for mean differences. In some cased this is true – both
tests use sample data to test hypothesis about population mean. However, ANOVA has
much more advantages over t-test. t-tests are used when we have compare only two groups
or variables (one independent and one dependent). On the other hand ANOVA is used when
we have two or more than two independent variables (treatment). Suppose we want to
study the effects of three different models of teaching on the achievement of students. In
this case we have three different samples to be treated using three different treatments. So
performance under three temperature conditions. The scores are variable and we want to
measure the amount of variability (i.e. the size of difference) to explain where it comes
from. To compare the total variability, we will combine all the scores from all the separate
samples into one group and then obtain one general measure of variability for the complete
experiment. Once we have measured the total variability, we can begin to break it into
separate components. The word analysis means breaking into smaller parts. Because we are
going to analyze the variability, the process is called analysis of variance (ANOVA). This
analysis process divides the total variability into two basic components:
i) Between-Treatment Variance
Variance simply means difference and to calculate the variance is a process of measuring
how big the differences are for a set of numbers. The between-treatment variance is
measuring how much difference exists between the treatment conditions. In addition to
measuring differences between treatments, the overall goal of ANOVA is to evaluate the
differences between treatments. Specifically, the purpose for the analysis is to distinguish is
a) The differences between the treatments have been caused by the treatment effects.
Thus, there are always two possible explanations for the variance (difference) that exists
between treatments
1) Treatment Effect:
The differences are caused by the treatments. For the scores in sample 1 are obtained at
room temperature of 50o and that of sample 2 at 70 . It is possible that the difference
2) Chance:
The differences are simply due to chance. It there is no treatment effect, even then we can
expect some difference between samples. The chance differences are unplanned and
unpredictable differences that are not caused or explained by any action of the researcher.
Individual Differences
Each participant of the study has its own individual characteristics. Although it is reasonable
to expect that different subjects will produce different scores, it is impossible to predict
Experimental Error
measures the same individuals twice under same conditions, there is greater possibility to
obtain two different measurements. Often these differences are unplanned and
that could be either by treatment effect or could simply be due to chance. In order to
demonstrate that the difference is really a treatment effect, we must establish that the
differences between treatments are bigger than would be expected by chance alone. To
accomplish this goal, we will determine how big the differences is when there is no
treatment effect involved. That is, we will measure how much difference (variance) occurred
i) Assumption of Independence
According to this assumption the observations are random and independent samples from
the populations. The null hypothesis actually states that the samples come from populations
that have the same mean. The samples must be random and independent if they are to be
representative of the populations. The value of one observation is not related to any other
observation. In other words, one individual’s score should not provide any clue as to how
any of the other individual should score. That is, one event does not depend on another.
The distributions of the population from which the samples are selected are normal. This
assumption implies that the dependent variable is normally distributed in each of the
groups. One way ANOVA is considered a robust test against the assumption of normality
and tolerates the violation of this assumption. As regards the normality of grouped data, the
one way ANOVA can tolerate data that is normal (skewed or kurtotic distribution) with only
a small effect on I error rate. However, platykurtosis can have profound effect when group
i) Transform data using various algorithms so that the shape of the distribution becomes
normally distributed. Or
ii) Choose nonparametric Kruskal-Wallis H Test which does not require the assumption of
The variances of the distribution in the populations are equal. This assumption provides that
the distribution in the population have the same shapes, means, and variances; that is, they
are the same populations. In other words, the variances on the dependent variable are
equal across the groups. If assumption of homogeneity of variances has been violated then
Alternatively, Kruskal-Wallis H Test can also be used. All these tests are available in SPSS.
Reference
Fraenkel, J. R., Wallen, N. E., & Hyun, H. H. (2012). How to Design and Evaluate in
Pallant, J. (2005). SPSS Survival Manual – A Step by Step Guide to Data Analysis Using SPSS
Test?
The chi-square (χ2) goodness of fit test (commonly referred to as one-sample chi-square) is
the most commonly used goodness of fit test. It explores the proportion of cases that fall
into the various categories of a single variable, and compares these with hypothesized
values. In some simple words we can say that it is used to find out how the observed value
of a given phenomena is significantly different from the expected value. Or we can also say
that it is used to test if sample data fits a distribution from a certain population. In other
words we can say that chi-square goodness of fit test tells us if the sample data represents
the data we expect to find in the actual population. It tells us whether sample data are
test. The setting for this test is a single categorical variable that can have many levels.
In chi-square goodness of fit test sample data is divided into intervals. Then, the numbers
of points that fall into the intervals are compared with the expected numbers of points in
each interval. . The null hypothesis for the chi-square goodness of fit test is that the data
does not come from the specified distribution. The alternate hypothesis is that the data
comes from the specified distribution. The formula for chi-square goodness of fit test is:
For using chi-square (χ2) goodness of fit test we will have to set up null and alternate
observed and expected value. Then, alternate hypothesis will become, there is significant
different difference between the observed and the expected value. Now compute the
a) The chi-square test can only be used to put data into classes. If there is data that have not
been put into classes then it is necessary to make a frequency table of histogram before
Reference
Gay, L. R., Mills, G. E., & Airasian, P. W. (2010). Educational Research: Competencies for