0% found this document useful (0 votes)
26 views28 pages

Statistics Notes

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views28 pages

Statistics Notes

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 28

Statistics notes

Central tendency:

Central tendency is a common tendency of all the scores in a distribution


gravitated towards a central point. That point may be considered to be a
representative point, representing all the scores in a distribution. The
scores in a distribution mostly concentrate surrounding the central point.
The tendency to concentrate in such a way is called the central tendency.
There are 3 measures of central tendency i.e., mean, median and mode.

Mean:
Mean is a magnitude based descriptive statistics which is actually a
representative score of a total distribution. Two distinct properties of mean
are-
○ Representativeness: It should represent all the scores in a
distribution. Thus,it must be efficient, reliable and should
consider all the magnitude.
○ Minimum variability: The representative score or the mean
should be in such a position where there is minimum variability.

Calculation of mean
Long method: X: Σfx/n
(x= midpoint)
Short method: X: AM+Ci
(C= Σfx'/n) (AM= Assumed mean) (i=class interval gap)
(x'= rank or assumed deviation)
(n= no. of frequencies)

10/08
Median:
When ungrouped scores or other measures are arranged in order the
median is the midpoint of the series. When scores in a continuous series
are grouped into frequency distribution the median by definition is the 50%
point in the distribution.

What are the uses of median?

1. When the exact midpoint of the distribution is one wanted.


2. When there are extreme scores which would markedly affect the
mean.
3. When it is desired that certain scores should influence the central
tendency but all that is known about them is that they are above or
below the median.

Calculation of median:

Odd ungrouped data: (1+N/2)th position


Even ungrouped data:
(N/2th position)+(1+N/2th position)/2

Grouped data: L+{(N/2-F)/fm}×i


L= exact lower limit of the class interval that contains the median.
N= number of class intervals.
F= cumulative of the class interval before the class interval in which
contains the median.
fm= frequency of the class interval in which median lies.
i= size of class interval.

Mode:

In a simple ungrouped series the crude or empirical mode is that single


measure which occurs most frequently. When data are grouped into
frequency distribution the mode is actually taken to be the midpoint of that
interval which contains the largest frequency.
Uses of mode:

1. When quick and appropriate measure of central tendencies is all that


is wanted.
2. When the measure of central tendency should be the most typical
value
3. When the description of the position of the magnitude can't fulfill the
demand of the research.

Calculation of mode:

Ungrouped mode= single measure which occurs most.


Grouped mode= (3×median-2×mean)

Variability:

Variability indicates scatter or spread of the separate score around their


central tendency or it indicates the extent of individual differences among
the scores. It states how the different scores vary from each other. The
statistical measure of variability estimates and express numerically the
deviations of individual scores of a sample from a given central value like
mean or median.

Measures of variability.
There are four measures which indicate variability or dispersion within the
set of measures
1. Range
2. Q or Quartile deviation
3. Average deviation or AD
4. Standard deviation or SD

Range:
Range may be simply defined as the interval between the highest and
lowest score. It is the most general measure of spread or scatter. Range is
used
1. When the datas are too scattered to justify the computation of a more
precise measure of variability.
2. When the knowledge of the extreme scores of the total spread is all
that is wanted.
Range = Highest - Lowest score

Quartile deviation or Q :

Quartile deviation or Q is the semi inter quartile range. Inter quartile range
indicates the spread of the middle 50% scores which lie in between the first
and the third quartile. Q is the one half of the range of the middle 50% of
the scores.

Q1=L+{(N/4-F)/fm}×i
L= exact lower limit of the class interval that contains the N/4
N= number of class intervals.
F= cumulative of the class interval before the class interval in which
contains the N/4
fm= frequency of the class interval in which N/4 lies.
i= size of class interval.

Q2=L+{(3N/4-F)/fm}×i
L= exact lower limit of the class interval that contains the 3N/4
N= number of class intervals.
F= cumulative of the class interval before the class interval in which
contains the 3N/4
fm= frequency of the class interval in which 3N/4 lies.
i= size of class interval.
Q = Q3 - Q1/2

Average deviation(AD):

Average deviation is the mean of the deviations of all the scores in a series
taken from the mean. In averaging the deviations no account is taken of
science and all deviations whether positive or negative are taken as
positive.

Use:
1. When it is desired to weight all deviations from the mean.
2. When extreme deviations would influence SD unduly.

Calculation of AD

AD=Σ|x|/n

Standard deviation (SD):

Standard deviation or SD is the most stable method of variability and it is


employed in experimental works and research works. The SD differs from
AD from several aspects. In computing AD we disregard the signs whereas
in finding sd we avoid difficulty of signs by squaring the separate
deviations. Again the squared deviations used in computing the SD are
always taken from the mean and never from the median or mode

Uses:

1. Statistics having greatest stability is wanted.


2. When extreme deviations would exercise a proportionately greater
effect upon the variability.
3. When coefficients of correlations and other higher statistics
subsequently to be computed.

Standard deviation calculation:

SD=X²/N - (Mean)²

Standard deviation grouped data calculation:

SD= i x f(x')²/N - C²

i = class interval
f = frequency
x' = assumed deviation
C = Σfx/N
N = total no. of frequencies.

Correlation:

Correlation is a kind of descriptive statistics which is used for comparing


variability that is coefficient of variables. Two lines of movement can be
correlated with the help of correlation. It involves more than one
distribution that is bivariate distribution. Correlation determines the
relationship between two lines of movement of two variables in a bivariate
distribution.

Coefficient of correlation indicates the magnitude as well as the direction of


association between the two lines of movements of variables. It should be
noted that correlation never indicates the cause and effect relationship. It
only states whether two variables are related or not. It indicates the
relationship for basically helping the researcher to infer the different values
of a particular variable. The coefficient of correlation is a no unit statistics. If
the significance of the magnitude is to be checked, it should be checked
against the tables by the statisticians.

Properties of correlation:

1. The coefficient of correlation indicates the extent of relationship. The


coefficient of correlation may vary from 0 to 1. 1 indicates a perfect
relationship whereas 0 indicates no relationship.
2. Coefficient of correlation also indicates the positivity or negativity of
the relationship. Thus the second property is direction.
3. The correlation is based on the fact that it is a no unit statistics.

Product moment correlation:

It is a simple linear correlation which was formulated by British


mathematician, Carl Pearson. This correlation is a measure of magnitude
of direction of the relation between two variables when their relationship
can be described by straight line.

Assumptions of product moment correlation:


1. Product moment correlation assumes that the population distribution
of the variables in hand are normal. It further assumes that the
variables are continuous.
2. The scores or items selected for the sample are selected or
randomly.
3. This correlation assumes homoscedasticity (It refers to the
homogeneity of the variances of the rows and columns of bivariate
distribution).

Formula:

r=
N ΣXY - {ΣX ΣY}/
{N X² - (X)²} × {N Y² - (Y)²}

Degrees of freedom

In statistics the degrees of freedom df is a measure of the no. Of


independent pieces of information on which the precision of a parameter
estimate is based. It can also be thought of as no of observations which
are freely available to vary. It is an estimate of no. Of independent
categories in a particular statistical test or experiment.

df = N - 2

Rank order correlation(Spearman's rho):

When the researcher uses non parametric statistics the researcher is not
interested in inferring population. Rank order correlation is a non
parametric statistics thus it's power function is quite low than parametric
statistics as it is not using the parameter. When n is small the rank
difference method will give as adequate a result as that obtained by finding
r. We will use rho when

1. The assumptions regarding normality are doubtful.


2. When n is very small.
3. When only ranks are available.

Rule for ranking:

1. Rank 1 is generally given beside the highest score and the last rank
is given to the lowest score.
2. If there are two same values in one set of scores then two
consecutive ranks has to be placed beside them and must be added
and then should be divided by 2. The obtained value should be
placed beside both the squares as their rank.

Calculation:

Formula: ρ = 1 - 6×ΣD² / N(N² - 1)

D= R²-R¹
(R= Ranking of the variable score)

N= Total no. Of frequencies

Biserial Correlation:

Biserial Correlation is a specialised form of product moment correlation ( r )


used for simple linear correlation between a continuous measurement
variable and either an apparently dichotomous variable or an artificially
dichotomised variable formed by dissecting the experimentally obtained
data of a continuous variable at a point near the median of the later.
Biserial Correlation may be concluded for correlating any of the above two
types of dichotomous variable with a continuous measurement variable.

The assumptions for Biserial Correlation are

1. One of the variables is a continuous measurement variable.


2. The other variable is apparently or artificially dichotomised.
3. If more information would be available the dichotomised variable
would be continuous and normally distributed.
4. The continuous measurement variable involves normal or near
normal populations without much skewness.
5. Dichotomised variable has been dichotomised at a point not far from
the median of the continuous distribution of it's scores.
6. In each class of the dichotomised variable every score of the
continuous variable occurs at random and independent of all other
scores.
7. There is a linear relationship between the variables.

Calculation:

r
bis = Mp-Mq/ST × PQ/Y

Mp= mean of the higher group in the dichotomized variable(positive


character variable group)

Mq= mean of the lower group in the dichotomized variable

P= proportion of cases of higher group

Q= proportion of cases of lower group

y= ordinate of the height of normal curve dividing two parts p and q

St= standard deviation of the total group

Point biserial:
The point biserial correlation coefficient, rpbi, is a special case of Pearson’s
correlation coefficient. It measures the relationship between two variables:
One continuous measurement variable and
One naturally dichotomous variable. If a variable is artificially dichotomized
the new dichotomous variable may be conceptualized as having an
underlying continuity. If this is the case, a biserial correlation would be the
more appropriate calculation.

Assumptions:

● One of the variables is a continuous measurement variable while the


other is genuinely dichotomous variable which cannot yield a
continuous or normal distribution even on further exploration.
● The continuous measurement variable involved has a normal or near
normal distribution in the population, without much skewness.
● The scores of each variable occur at random and independent of all
others scores.
● There is a linear relationship between the variables.

Calculation:
Rp bis= Mp-Mq/St × pq

Tetrachoric correlation:

tetra choric correlation is used when both the variables are dichotomous.
Tetrachoric correlation is especially useful when we wish to find the relation
between two characters of attributes neither of which are measurable in
scores but both of which are capable of being separated into two
categories. Assumptions of tetrachoric correlation are
1. The variables either have continuous metric data which have been
artificially dichotomized and may yield continuous score on further
exploration.
2. Such continuous series of scores of the dichotomous variable forms
unimodal and normal distribution in the population.
3. The point of dichotomy is close to the median of each variable so that
neither of the proportions of each classes is far from .80
4. There exists a linear relationship between the continuous scores of
the variables.

Calculation:

X (top variable)

+ _

A B

C D

Y (left hand side variable)

IF,

AD>BC +corr

rt = COS (180° × BC/AD + BC)

AD<BC

rt = COS (180° × AD/AD + BC)

Check values of cos in Table J (Garrett)

OR
IF,

AD>BC

AD/BC, then check value in table K for rt

AD<BC

BC/AD, then check value in table K for rt

Here, rt= tetrachoric coefficient value

Assumptions of phi coefficient:

1. Both the variables are genuinely dichotomous.


2. Each of the variables has two classes with the genuine intervening
gap.
3. The genuinely dichotomous variables cannot be thought of as
representing underlying normal distribution.

Calculation:

= AD-BC/(A+B)(C+D)(B+D)(A+C)

Check significance,

χ2= NΦ², Check the value against df in table E (Garett)

df=(r-1)(c-1)

Contingency Correlation:

Assumptions of contingency coefficient:


1. Each individual or case occurs in the sample at random and
independent of all others.
2. Both the variables under study can be classified into many
categories.
3. There are no assumptions for the genuineness of the discrete or the
dichotomous nature of the distribution of the variables.

Calculation:

C= S-N/S

S= Σfo²/fe

fe= row total × column total / N, for the particular cell.

fe= frequency expected


fo= observed frequency
N= total sample
C= contingency coefficient

Check significance,

χ2= NC²/1-C², Check the value against df in table E (Garett)

df=(r-1)(c-1)

T - test:

Assumptions of T-test.

1. The two population distributions from where the samples are


drawn are assumed to be normal
2. The sampling distribution of the statistic that is M1 and M2
differs from each other and is assumed to be normal.
3. It is assumed that the two N's are not significantly different.
4. It is assumed that the variances of the two distributions are
homogeneous.

Formulas:
Trick to identify correlation

1. Correlation value will be given. If correlated


2. Large different groups and unequal/equal = Uncorrelated
3. large same group diff test = correlated
4. Small equal paired= correlated
5. small equal different groups or unequal = uncorrelated

Large N=>25

1. Large unequal uncorrelated:

T = M1 ~ M2/ 1²/N1+σ2²/N2

M1= Mean of the first group


M2= Mean of the second group

σ1= Standard deviation of the first group


σ2= Standard deviation of the second group
N1= total sample of first group
N2= total sample of the second group
df= N1+N2-2

2. Large equal uncorrelated:

T = M1 ~ M2/ 1²+σ2²/Ni
M1= Mean of the first group
M2= Mean of the second group

σ1= Standard deviation of the first group


σ2= Standard deviation of the second group
Ni= total sample of either group
DF= 2(N-1)

3. Large equal correlated: sample size is always the same

T = M1 ~ M2/ 1²M1+σ2²M2-(2)(r12 × σ1M1 × σ2M2)


Or

T = M1 ~ M2/ 1²/N1+σ2/²N2-(2)(r12 × 1²/N1× 2²/N2)

M1= Mean of the first group


M2= Mean of the second group

σ1= Standard deviation of the first group


σ2= Standard deviation of the second group
N1= total sample of first group
N2= total sample of the second group
R12= Correlation value
df= 2(N-1)

Small= N<25

4. Small equal uncorrelated


T = M1 ~ M2/ x1 ²+ x2 ²/N(N-1)
M1= Mean of the first group
M2= Mean of the second group
X1= score-mean of first group
X2= score-mean of second group
df= 2(N-1)

5. Small unequal uncorrelated:


T = M1 ~ M2/ (x1 ²+ x2 ²/N1+N2-2)(1/N1+1/N2)
T = M1 ~ M2/ x1 ²+ x2 ²/N(N-1)
M1= Mean of the first group
M2= Mean of the second group
X1= score-mean of first group
X2= score-mean of second group
df= N1+N2-2

6. Small equal paired:


(correlated)

T = D_/Sd_

Sd= d ²/N(N-1)

d= D-D_

D_= D/ total no. of pairs

D= difference b/w scores of two groups

Df= no. of pairs-1

Normal probability curve:

Normal distribution is an arrangement of scores in a definite fashion.


Probability is defined as the ratio between the desired no. Of events and
the total no. Of events moreover probability of a given event can be defined
as the expected frequency of occurrence of this event among events of a
like sought. Thus the probability of the event may be stated mathematically
as:

Probability =
desired no. Of events/ total no. Of events

Thus to sum up

1. The probability of an event is the ratio of the no. Of favourable cases


to the total no. Of cases.
2. The probability of an event is a positive quantity.
3. The probability of an event which doesn't occur is zero.
4. The sum of probabilities of happening and non happening of an event
is always equal to 1.
5. The probability of an event which is sure to happen is 1 and therefore
the probability of happening of an event ranges between 0 and 1.

The two main concepts of normal probability distribution curve are:

1. The concept of probability


2. The concept of normal distribution

Therefore the normal probability distribution is a frequency distribution of


quantities, of which magnitudes are continuously variable. The graphical
expression of the normal distribution is called normal curve. It is the bell
shaped unimodal and bilaterally symmetrical curve.

In a NPC, "Z" is the unit of the baseline. There are different scores in the
baseline and therefore the baseline must be framed with a definite score
which can reflect all the scores. If all the scores are linearly transformed
into a particular score then all the scores can be kept in the distribution.
"Z" is a score obtained from linear transformation. It is a reference point for
comparing the different types of scores. Thus "Z" can be defined as a
standard score which provides a reference point for comparing different
scores coming from different origins.

Z = Score - mean / SD

Properties of NPC:

1. The NPC may be determined by its population mean and population


SD.
2. The mean, median and mode of the NPC occur in the same point.
3. The NPC attends it's highest ordinate (.3989) where X (Score) = M
(Mean)

at that point y = 1/2, Z=0. As the curve moves from that point where X=M, in any directio

4. The NPC is found to be asymptomatic, that is the two tails are


extended to infinity. The NPC is extended from negative infinity to
positive infinity.
5. The skewness of the the NPC is zero
6. The NPC is mesokurtic (it's neither platykurtic nor leptokurtic)
7. In the NPC the two inflection points changes it's curvature to obtain a
bell shape
8. In a NPC, the Q is generally called the probable error or PE, the
relationship between PE and SD in the NPC is fixed and constant. In
NPC, PE = .6745 σ and σ = 1.4826 PE
9. The areas under NPC are fixed and the proportion of the different
areas of NPC are as follows
Chi square-
A chi-square (χ2) statistic is a test that measures how a model compares to
actual observed data. The
chi-square statistic compares the size any discrepancies between the
expected results and the actual
results, given the size of the sample and the number of variables in the
relationship.

Yate's Correction:
If expected frequencies are very much less than the obtained frequencies it
should be corrected with Yate's Correction. While calculating Chi square if
one or more of expected frequencies are found to be very small a Yate's
Correction is to be done. This is a correction of continuity. In a 2×2 by table
that is fourfolds table a Yate's Correction is done if fe (Expected frequency)
is less than 10. However if the table is 2×3 or more we can tolerate upto fe
= 5.

Yate's Correction formula:


Subtract .5 from fo-fe in each case. If fe is less than 10 in 2×2 table or if fe
is less 5 in 2×3 table or more.

Annova
Analysis of variance is an extension of t test.
There are two types of Annova-
● One way
● Two way
There are 2 kinds of variance- middle group Variance, between group
variance.
a)Middle group- Thiis is the average variance of the members of each
group around the respective group
means that is the mean value of the scores in a sample.
b) Between group- this represents the variance of group means around the
total or grand mean of all
groups that is the best estimate of the population mean.
One way annova- A one way ANOVA is used to compare two means from
two independent (unrelated)
groups using the F-distribution. The null hypothesis for the test is that the
two means are equal.
Therefore, a significant result means that the two means are unequal.
Two way annova- With a Two Way ANOVA, there are two independents.
Use a two way ANOVA when
you have one measurement variable (i.e. a quantitative variable) and two
nominal variables. In other
words, if your experiment has a quantitative outcome and you have two
categorical explanatory variables,
a two way ANOVA is appropriate.
Standard error
What is SE? Why we use SE in statistics?
The standard error is a statistical term that measures the accuracy with
which a sample distribution
represents a population by using standard deviation.
The SE is considered part of descriptive statistics. It represents the SD of
the mean within a data set. This
serves as a measure of variation for a random variable, providing A
measurement for the spread. The
smaller the spread, the more accurate the data set. The SE is most useful
as a means of calculating a
confidence interval. For a large sample, a 95% confidence interval is
obtained as the values 1.96xSE either side of the mean.
Annova calculation:
Comparative table for annova:

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy