0% found this document useful (0 votes)
7 views19 pages

8614(2)

The document covers educational statistics, focusing on the calculation of mode, its merits and demerits, and the application of t-tests and regression analysis in educational research. Mode is defined as the most frequently occurring score, with methods for calculation and its advantages and disadvantages discussed. T-tests compare means of two groups, while regression analysis predicts a continuous dependent variable from independent variables, emphasizing the importance of data accuracy and assumptions in statistical modeling.

Uploaded by

Sehrish Noor
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views19 pages

8614(2)

The document covers educational statistics, focusing on the calculation of mode, its merits and demerits, and the application of t-tests and regression analysis in educational research. Mode is defined as the most frequently occurring score, with methods for calculation and its advantages and disadvantages discussed. T-tests compare means of two groups, while regression analysis predicts a continuous dependent variable from independent variables, emphasizing the importance of data accuracy and assumptions in statistical modeling.

Uploaded by

Sehrish Noor
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 19

Course Code : 8614

Course Name : Educational Statistics

Assignment : 2

Semester : Spring 2022

Program: B.Ed

Q. 1 How is mode calculated? Also discuss its merits and demerits.

The mode is the most frequently occurring score in the distribution.

Consider following data set. 25, 43, 39, 25, 82, 77, 25, 47.

The score 25 comes more frequently, so it is the mode. Sometimes there may be no single

mode if no one value appears more than any other. There may be one mode (uni-modal),

two modes (bi-model), three modes (tri-model), or more than three modes (multi-model).

Mode is useful when scores reflect a nominal scale of measurement. But along with mean

and median it can also be used for ordinal, interval or ratio data. It can be located

graphically by drawing histogram.

You can calculate mode in several different ways, including:

By hand
To find the mode manually, arrange the numbers in ascending or descending order, then

count how often each number appears. The number that appears most often is the mode.

Example: 55, 4, 28, 44, 32, 55, 32, 45, 48, 6, 44, 28, 14, 23, 12, 32, 44

In order, these numbers are: 4, 6, 12, 14, 23, 28, 28, 32, 32, 32, 44, 44, 44, 48, 55, 55

It’s now easy to see which numbers appear most often. In this case, the data set is bimodal,

and has two modes: 32 and 44. The number 55 appears twice, but 32 and 44 both appear

three times each.

Using a calculator

You can also use an online calculator with a mode function:

Do an online search for "calculator with mode function."

Input all numbers in your data set following the instructions. You may need to add a comma

between numbers or put each number on a separate line.

Select "calculate."

With Excel

Here are the steps to calculate mode using Excel:

Enter each number of your data set into a separate cell in a single column.

Highlight the column.

Click the "Sort" button on the toolbar.


Select either "Sort A to Z" to sort your numbers from smallest to largest or "Sort Z to A" to

sort from largest to smallest. You can determine the mode manually at this stage.

Select the cell where you want to mode to display. This cannot be a cell that contains a

number from your data set.

In the function box, type "MODE:(BEGINNING CELL:ENDING CELL)" using your specific

beginning and ending cell labels, like B2:B32

Press enter. The mode should appear in the selected cell.

Merits of Mode

i) It is easy to understand and easy to calculate.

ii) It is not affected by extreme values.

iii) Even if the extreme values are not known mode can be calculated.

iv) It can be located just by inspection in many cases.

v) It can be located graphically.

vi) It is always present in the data.

vii) It is applicable for both quantitative and qualitative data.

viii) It is useful for methodological forecasts.

 The mode is easy to understand and calculate.


 The mode is not affected by extreme values.
 The mode is easy to identify in a data set and in a discrete frequency distribution.
 The mode is useful for qualitative data.
 The mode can be computed in an open-ended frequency table.
 The mode can be located graphically.
 It can be easily observed from the data.
 It is easy to compute.
 It is unaffected by extreme values.
 Mode can be determined even if the distribution has open end class.
 It can also be determined easily by graphic method.
 It is easy to understand.

Demerits of Mode

i) It is not rigidly defined.

ii) It is not based upon all values of the given data.

iii) It is not capable of further mathematical calculation.

iv) There will be no mode if there is no common value in the data.

v) It cannot be used for further methodological processing

Reference

Anderson, T. W., & Sclove, S. L. (1974). Introductory Statistical Analysis, Finland: Houghton

Mifflin Company.

Q.2 Discuss t-test and its application in educational research?

T-Test

A t-test is a useful statistical technique used for comparing mean values of two data sets

obtained from two groups. The comparison tells us whether these data sets are different
from each other. It further tells us how significant the differences are and if these

differences could have happened by chance. The statistical significance of t-test indicates

whether or not the difference between the mean of two groups most likely reflects a real

difference in the population from which the groups are selected. t-tests are used when

there are two groups (male and female) or two sets of data (before and after), and the

researcher wishes to compare the mean score on some continuous variable.

T-test applications

 The T-test is used to compare the mean of two samples, dependent or independent.

 It can also be used to determine if the sample mean is different from the assumed mean.

 T-test has an application in determining the confidence interval for a sample mean.

The following flowchart can be used to determine which t-test to use based on the

characteristics of the sample sets. The key items to consider include the similarity of the

sample records, the number of data records in each sample set, and the variance of each

sample set.
Reference

Gravetter, F. J., & Wallnau, L. B. (2002). Essentials of Statistics for the Behavioral Sciences

(4th Ed.). Wadsworth, California, USA.

Q.3 Why do we use regression analysis? Where is it applied?

Regression Analysis

Regression analysis is used when you want to predict a continuous dependent variable from

a number of independent variables. If the dependent variable is dichotomous, then logistic

regression should be used. (If the split between the two levels of the dependent variable is

close to 50-50, then both logistic and linear regression will end up giving you similar results.)
The independent variables used in regression can be either continuous or dichotomous.

Independent variables with more than two levels can also be used in regression analyses,

but they first must be converted into variables that have only two levels. This is called

dummy coding and will be discussed later. Usually, regression analysis is used with

naturally-occurring variables, as opposed to experimentally manipulated variables, although

you can use regression with experimentally manipulated variables. One point to keep in

mind with regression analysis is that causal relationships among the variables cannot be

determined. While the terminology is such that we say that X "predicts" Y, we cannot say

that X "causes" Y.

Applied of Regression

Number of cases

When doing regression, the cases-to-Independent Variables (IVs) ratio should ideally be

20:1; that is 20 cases for every IV in the model. The lowest your ratio should be is 5:1 (i.e., 5

cases for every IV in the model).

Accuracy of data

If you have entered the data (rather than using an established dataset), it is a good idea to

check the accuracy of the data entry. If you don't want to re-check each data point, you

should at least check the minimum and maximum value for each variable to ensure that all

values for each variable are "valid." For example, a variable that is measured using a 1 to 5

scale should not have a value of 8.

Missing data

You also want to look for missing data. If specific variables have a lot of missing values, you

may decide not to include those variables in your analyses. If only a few cases have any
missing values, then you might want to delete those cases. If there are missing values for

several cases on different variables, then you probably don't want to delete those cases

(because a lot of your data will be lost). If there are not too much missing data, and there

does not seem to be any pattern in terms of what is missing, then you don't really need to

worry. Just run your regression, and any cases that do not have values for the variables used

in that regression will not be included. Although tempting, do not assume that there is no

pattern; check for this. To do this, separate the dataset into two groups: those cases missing

values for a certain variable, and those not missing a value for that variable. Using t-tests,

you can determine if the two groups differ on other variables included in the sample. For

example, you might find that the cases that are missing values for the "salary" variable are

younger than those cases that have values for salary. You would want to do t-tests for each

variable with a lot of missing values. If there is a systematic difference between the two

groups (i.e., the group missing values vs. the group not missing values), then you would

need to keep this in mind when interpreting your findings and not overgeneralize.

After examining your data, you may decide that you want to replace the missing values with

some other value. The easiest thing to use as the replacement value is the mean of this

variable. Some statistics programs have an option within regression where you can replace

the missing value with the mean. Alternatively, you may want to substitute a group mean

(e.g., the mean for females) rather than the overall mean.

The default option of statistics packages is to exclude cases that are missing values for any

variable that is included in regression. (But that case could be included in another

regression, as long as it was not missing values on any of the variables included in that

analysis.) You can change this option so that your regression analysis does not exclude cases
that are missing data for any variable included in the regression, but then you might have a

different number of cases for each variable.

Outliers

You also need to check your data for outliers (i.e., an extreme value on a particular item) An

outlier is often operationally defined as a value that is at least 3 standard deviations above

or below the mean. If you feel that the cases that produced the outliers are not part of the

same "population" as the other cases, then you might just want to delete those cases.

Alternatively, you might want to count those extreme values as "missing," but retain the

case for other variables. Alternatively, you could retain the outlier, but reduce how extreme

it is. Specifically, you might want to recode the value so that it is the highest (or lowest) non-

outlier value.

Normality

You also want to check that your data is normally distributed. To do this, you can construct

histograms and "look" at the data to see its distribution. Often the histogram will include a

line that depicts what the shape would look like if the distribution were truly normal (and

you can "eyeball" how much the actual distribution deviates from this line).

You can also construct a normal probability plot. In this plot, the actual scores are ranked

and sorted, and an expected normal value is computed and compared with an actual normal

value for each case. The expected normal value is the position a case with that rank holds in

a normal distribution. The normal value is the position it holds in the actual distribution.

Basically, you would like to see your actual values lining up along the diagonal that goes

from lower left to upper right.


You can also test for normality within the regression analysis by looking at a plot of the

"residuals." Residuals are the difference between obtained and predicted DV scores.

(Residuals will be explained in more detail in a later section.) If the data are normally

distributed, then residuals should be normally distributed around each predicted DV score.

If the data (and the residuals) are normally distributed, the residuals scatterplot will show

the majority of residuals at the center of the plot for each value of the predicted score, with

some residuals trailing off symmetrically from the center. You might want to do the residual

plot before graphing each variable separately because if this residuals plot looks good, then

you don't need to do the separate plots. Below is a residual plot of a regression where age

of patient and time (in months since diagnosis) are used to predict breast tumor size.

Why Is It Called Regression?

Although there is some debate about the origins of the name, the statistical technique
described above most likely was termed "regression" by Sir Francis Galton in the 19th
century to describe the statistical feature of biological data (such as heights of people in a
population) to regress to some mean level. In other words, while there are shorter and taller
people, only outliers are very tall or short, and most people cluster somewhere around (or
"regress" to) the average.

What Is the Purpose of Regression?

In statistical analysis, regression is used to identify the associations between variables


occurring in some data. It can show both the magnitude of such an association and also
determine its statistical significance (i.e., whether or not the association is likely due to
chance). Regression is a powerful tool for statistical inference and has also been used to try
to predict future outcomes based on past observations.

How Do You Interpret a Regression Model?

A regression model output may be in the form of Y = 1.0 + (3.2)X1 - 2.0(X2) + 0.21.

Here we have a multiple linear regression that relates some variable Y with two explanatory
variables X1 and X2. We would interpret the model as the value of Y changes by 3.2x for
every one-unit change in X1 (if X1 goes up by 2, Y goes up by 6.4, etc.) holding all else
constant (all else equal). That means controlling for X2, X1 has this observed relationship.
Likewise, holding X1 constant, every one unit increase in X2 is associated with a
2x decrease in Y. We can also note the y-intercept of 1.0, meaning that Y = 1 when X 1 and
X2 are both zero. The error term (residual) is 0.21.

What Are the Assumptions That Must Hold for Regression Models?

In order to properly interpret the output of a regression model, the following main
assumptions about the underlying data process of what you analyzing must hold:

 The relationship between variables is linear


 , or that the variance of the variables and error term must remain constant
 All explanatory variables are independent of one another
 All variables are normally-distributed

Data scientists for professional sports teams often use linear regression to measure the
effect that different training regimens have on player performance.
For example, data scientists in the NBA might analyze how different amounts of weekly
yoga sessions and weightlifting sessions affect the number of points a player scores. They
might fit a multiple linear regression model using yoga sessions and weightlifting sessions as
the predictor variables and total points scored as the response variable. The regression
model would take the following form:
points scored = β0 + β1(yoga sessions) + β2(weightlifting sessions)
The coefficient β0 would represent the expected points scored for a player who participates
in zero yoga sessions and zero weightlifting sessions.
The coefficient β1 would represent the average change in points scored when weekly yoga
sessions is increased by one, assuming the number of weekly weightlifting sessions remains
unchanged.
The coefficient β2 would represent the average change in points scored when weekly
weightlifting sessions is increased by one, assuming the number of weekly yoga sessions
remains unchanged.
Depending on the values of β1 and β2, the data scientists may recommend that a player
participates in more or less weekly yoga and weightlifting sessions in order to maximize
their points scored.

Data scientists for professional sports teams often use linear regression to measure the
effect that different training regimens have on player performance.
For example, data scientists in the NBA might analyze how different amounts of weekly
yoga sessions and weightlifting sessions affect the number of points a player scores. They
might fit a multiple linear regression model using yoga sessions and weightlifting sessions as
the predictor variables and total points scored as the response variable. The regression
model would take the following form:
points scored = β0 + β1(yoga sessions) + β2(weightlifting sessions)
The coefficient β0 would represent the expected points scored for a player who participates
in zero yoga sessions and zero weightlifting sessions.
The coefficient β1 would represent the average change in points scored when weekly yoga
sessions is increased by one, assuming the number of weekly weightlifting sessions remains
unchanged.
The coefficient β2 would represent the average change in points scored when weekly
weightlifting sessions is increased by one, assuming the number of weekly yoga sessions
remains unchanged.
Depending on the values of β1 and β2, the data scientists may recommend that a player
participates in more or less weekly yoga and weightlifting sessions in order to maximize
their points scored.

Reference

Argyrous, G. (2012). Statistics for Research, with a guide to SPSS. India: SAGE Publications.

Q.4 Explain assumptions of applying One-way ANOVA and its

procedure?

ANOVA

The t-tests have one very serious limitation – they are restricted to tests of the significance

of the difference between only two groups. There are many times when we like to see if

there are significant differences among three, four, or even more groups. For example we

may want to investigate which of three teaching methods is best for teaching ninth class

algebra. In such case, we cannot use t-test because more than two groups are involved. To

deal with such type of cases one of the most useful techniques in statistics is analysis of
variance (abbreviated as ANOVA). This technique was developed by a British Statistician

Ronald A.

Analysis of Variance (ANOVA) is a hypothesis testing procedure that is used to evaluate

mean differences between two or more treatments (or population). Like all other inferential

procedures. ANOVA uses sample data to as a basis for drawing general conclusion about

populations. Sometime, it may appear that ANOVA and t-test are two different ways of

doing exactly same thing: testing for mean differences. In some cased this is true – both

tests use sample data to test hypothesis about population mean. However, ANOVA has

much more advantages over t-test. t-tests are used when we have compare only two groups

or variables (one independent and one dependent). On the other hand ANOVA is used when

we have two or more than two independent variables (treatment). Suppose we want to

study the effects of three different models of teaching on the achievement of students. In

this case we have three different samples to be treated using three different treatments. So

ANOVA is the suitable technique to evaluate the difference.

These data represent results of an independent-measure experiment comparing learning

performance under three temperature conditions. The scores are variable and we want to

measure the amount of variability (i.e. the size of difference) to explain where it comes

from. To compare the total variability, we will combine all the scores from all the separate

samples into one group and then obtain one general measure of variability for the complete

experiment. Once we have measured the total variability, we can begin to break it into

separate components. The word analysis means breaking into smaller parts. Because we are

going to analyze the variability, the process is called analysis of variance (ANOVA). This

analysis process divides the total variability into two basic components:
i) Between-Treatment Variance

Variance simply means difference and to calculate the variance is a process of measuring

how big the differences are for a set of numbers. The between-treatment variance is

measuring how much difference exists between the treatment conditions. In addition to

measuring differences between treatments, the overall goal of ANOVA is to evaluate the

differences between treatments. Specifically, the purpose for the analysis is to distinguish is

to distinguish between two alternative explanations.

a) The differences between the treatments have been caused by the treatment effects.

b) The differences between the treatments are simply due to chance.

Thus, there are always two possible explanations for the variance (difference) that exists

between treatments

1) Treatment Effect:

The differences are caused by the treatments. For the scores in sample 1 are obtained at

room temperature of 50o and that of sample 2 at 70 . It is possible that the difference

between sample is caused by the difference in room temperature.

2) Chance:

The differences are simply due to chance. It there is no treatment effect, even then we can

expect some difference between samples. The chance differences are unplanned and

unpredictable differences that are not caused or explained by any action of the researcher.

Researchers commonly identify two primary sources for chance differences.

Individual Differences
Each participant of the study has its own individual characteristics. Although it is reasonable

to expect that different subjects will produce different scores, it is impossible to predict

exactly what the difference will be.

Experimental Error

In any measurement there is a chance of some degree of error. Thus, if a researcher

measures the same individuals twice under same conditions, there is greater possibility to

obtain two different measurements. Often these differences are unplanned and

unpredictable, so they are considered to be by chance.

Thus, when we calculate the between-treatment variance, we are measuring differences

that could be either by treatment effect or could simply be due to chance. In order to

demonstrate that the difference is really a treatment effect, we must establish that the

differences between treatments are bigger than would be expected by chance alone. To

accomplish this goal, we will determine how big the differences is when there is no

treatment effect involved. That is, we will measure how much difference (variance) occurred

by chance. To measure chance differences, we compute the variance within treatments

Assumptions Underlying the One Way ANOVA

There are three main assumptions

i) Assumption of Independence

According to this assumption the observations are random and independent samples from

the populations. The null hypothesis actually states that the samples come from populations

that have the same mean. The samples must be random and independent if they are to be

representative of the populations. The value of one observation is not related to any other
observation. In other words, one individual’s score should not provide any clue as to how

any of the other individual should score. That is, one event does not depend on another.

A lack of assumption of independence leads to most serious consequences. If this

assumption is violated, one way ANOVA will be inappropriate to statistic,

ii) Assumption of Normality

The distributions of the population from which the samples are selected are normal. This

assumption implies that the dependent variable is normally distributed in each of the

groups. One way ANOVA is considered a robust test against the assumption of normality

and tolerates the violation of this assumption. As regards the normality of grouped data, the

one way ANOVA can tolerate data that is normal (skewed or kurtotic distribution) with only

a small effect on I error rate. However, platykurtosis can have profound effect when group

sizes are small. This leaves a researcher with two options:

i) Transform data using various algorithms so that the shape of the distribution becomes

normally distributed. Or

ii) Choose nonparametric Kruskal-Wallis H Test which does not require the assumption of

normality. (This test is available is SPSS).

iii) Assumptions of Homogeneity of Variance

The variances of the distribution in the populations are equal. This assumption provides that

the distribution in the population have the same shapes, means, and variances; that is, they

are the same populations. In other words, the variances on the dependent variable are

equal across the groups. If assumption of homogeneity of variances has been violated then

tow possible tests can be run.


i) Welch test, or

ii) Brown and Forsythe test

Alternatively, Kruskal-Wallis H Test can also be used. All these tests are available in SPSS.

Reference

Fraenkel, J. R., Wallen, N. E., & Hyun, H. H. (2012). How to Design and Evaluate in

Education. (8th Ed.) McGraw-Hill, New York

Pallant, J. (2005). SPSS Survival Manual – A Step by Step Guide to Data Analysis Using SPSS

for Windows (Version 12). Australia: Allen & Unwin.

Q.5 Explain rationale and procedure of Chi-Square Goodness-of-Fit

Test?

Chi-Square (χ2) Goodness-of-Fit Test

The chi-square (χ2) goodness of fit test (commonly referred to as one-sample chi-square) is

the most commonly used goodness of fit test. It explores the proportion of cases that fall

into the various categories of a single variable, and compares these with hypothesized

values. In some simple words we can say that it is used to find out how the observed value

of a given phenomena is significantly different from the expected value. Or we can also say

that it is used to test if sample data fits a distribution from a certain population. In other
words we can say that chi-square goodness of fit test tells us if the sample data represents

the data we expect to find in the actual population. It tells us whether sample data are

consistent with a hypothesized distribution. This is a variation of more general chi-square

test. The setting for this test is a single categorical variable that can have many levels.

In chi-square goodness of fit test sample data is divided into intervals. Then, the numbers

of points that fall into the intervals are compared with the expected numbers of points in

each interval. . The null hypothesis for the chi-square goodness of fit test is that the data

does not come from the specified distribution. The alternate hypothesis is that the data

comes from the specified distribution. The formula for chi-square goodness of fit test is:

Procedure for Chi-Square (χ2) Goodness of Fit Test

For using chi-square (χ2) goodness of fit test we will have to set up null and alternate

hypothesis. A null hypothesis assumes that there is no significance difference between

observed and expected value. Then, alternate hypothesis will become, there is significant

different difference between the observed and the expected value. Now compute the

value of chi-square of fit test using formula:


Two potential disadvantages of chi-square are:

a) The chi-square test can only be used to put data into classes. If there is data that have not

been put into classes then it is necessary to make a frequency table of histogram before

performing the test.

b) It requires sufficient sample size in order for chi-square approximation to be valid.

Reference

Gay, L. R., Mills, G. E., & Airasian, P. W. (2010). Educational Research: Competencies for

Analysis and Application, 10th Edition. Pearson, New York USA

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy