Statistics Notes
Statistics Notes
Central tendency:
Mean:
Mean is a magnitude based descriptive statistics which is actually a
representative score of a total distribution. Two distinct properties of mean
are-
○ Representativeness: It should represent all the scores in a
distribution. Thus,it must be efficient, reliable and should
consider all the magnitude.
○ Minimum variability: The representative score or the mean
should be in such a position where there is minimum variability.
Calculation of mean
Long method: X: Σfx/n
(x= midpoint)
Short method: X: AM+Ci
(C= Σfx'/n) (AM= Assumed mean) (i=class interval gap)
(x'= rank or assumed deviation)
(n= no. of frequencies)
10/08
Median:
When ungrouped scores or other measures are arranged in order the
median is the midpoint of the series. When scores in a continuous series
are grouped into frequency distribution the median by definition is the 50%
point in the distribution.
Calculation of median:
Mode:
Calculation of mode:
Variability:
Measures of variability.
There are four measures which indicate variability or dispersion within the
set of measures
1. Range
2. Q or Quartile deviation
3. Average deviation or AD
4. Standard deviation or SD
Range:
Range may be simply defined as the interval between the highest and
lowest score. It is the most general measure of spread or scatter. Range is
used
1. When the datas are too scattered to justify the computation of a more
precise measure of variability.
2. When the knowledge of the extreme scores of the total spread is all
that is wanted.
Range = Highest - Lowest score
Quartile deviation or Q :
Quartile deviation or Q is the semi inter quartile range. Inter quartile range
indicates the spread of the middle 50% scores which lie in between the first
and the third quartile. Q is the one half of the range of the middle 50% of
the scores.
Q1=L+{(N/4-F)/fm}×i
L= exact lower limit of the class interval that contains the N/4
N= number of class intervals.
F= cumulative of the class interval before the class interval in which
contains the N/4
fm= frequency of the class interval in which N/4 lies.
i= size of class interval.
Q2=L+{(3N/4-F)/fm}×i
L= exact lower limit of the class interval that contains the 3N/4
N= number of class intervals.
F= cumulative of the class interval before the class interval in which
contains the 3N/4
fm= frequency of the class interval in which 3N/4 lies.
i= size of class interval.
Q = Q3 - Q1/2
Average deviation(AD):
Average deviation is the mean of the deviations of all the scores in a series
taken from the mean. In averaging the deviations no account is taken of
science and all deviations whether positive or negative are taken as
positive.
Use:
1. When it is desired to weight all deviations from the mean.
2. When extreme deviations would influence SD unduly.
Calculation of AD
AD=Σ|x|/n
Uses:
SD=X²/N - (Mean)²
SD= i x f(x')²/N - C²
i = class interval
f = frequency
x' = assumed deviation
C = Σfx/N
N = total no. of frequencies.
Correlation:
Properties of correlation:
Formula:
r=
N ΣXY - {ΣX ΣY}/
{N X² - (X)²} × {N Y² - (Y)²}
Degrees of freedom
df = N - 2
When the researcher uses non parametric statistics the researcher is not
interested in inferring population. Rank order correlation is a non
parametric statistics thus it's power function is quite low than parametric
statistics as it is not using the parameter. When n is small the rank
difference method will give as adequate a result as that obtained by finding
r. We will use rho when
1. Rank 1 is generally given beside the highest score and the last rank
is given to the lowest score.
2. If there are two same values in one set of scores then two
consecutive ranks has to be placed beside them and must be added
and then should be divided by 2. The obtained value should be
placed beside both the squares as their rank.
Calculation:
D= R²-R¹
(R= Ranking of the variable score)
Biserial Correlation:
Calculation:
r
bis = Mp-Mq/ST × PQ/Y
Point biserial:
The point biserial correlation coefficient, rpbi, is a special case of Pearson’s
correlation coefficient. It measures the relationship between two variables:
One continuous measurement variable and
One naturally dichotomous variable. If a variable is artificially dichotomized
the new dichotomous variable may be conceptualized as having an
underlying continuity. If this is the case, a biserial correlation would be the
more appropriate calculation.
Assumptions:
Calculation:
Rp bis= Mp-Mq/St × pq
Tetrachoric correlation:
tetra choric correlation is used when both the variables are dichotomous.
Tetrachoric correlation is especially useful when we wish to find the relation
between two characters of attributes neither of which are measurable in
scores but both of which are capable of being separated into two
categories. Assumptions of tetrachoric correlation are
1. The variables either have continuous metric data which have been
artificially dichotomized and may yield continuous score on further
exploration.
2. Such continuous series of scores of the dichotomous variable forms
unimodal and normal distribution in the population.
3. The point of dichotomy is close to the median of each variable so that
neither of the proportions of each classes is far from .80
4. There exists a linear relationship between the continuous scores of
the variables.
Calculation:
X (top variable)
+ _
A B
C D
IF,
AD>BC +corr
AD<BC
OR
IF,
AD>BC
AD<BC
Calculation:
= AD-BC/(A+B)(C+D)(B+D)(A+C)
Check significance,
df=(r-1)(c-1)
Contingency Correlation:
Calculation:
C= S-N/S
S= Σfo²/fe
Check significance,
df=(r-1)(c-1)
T - test:
Assumptions of T-test.
Formulas:
Trick to identify correlation
Large N=>25
T = M1 ~ M2/ 1²/N1+σ2²/N2
T = M1 ~ M2/ 1²+σ2²/Ni
M1= Mean of the first group
M2= Mean of the second group
Small= N<25
T = D_/Sd_
Sd= d ²/N(N-1)
d= D-D_
Probability =
desired no. Of events/ total no. Of events
Thus to sum up
In a NPC, "Z" is the unit of the baseline. There are different scores in the
baseline and therefore the baseline must be framed with a definite score
which can reflect all the scores. If all the scores are linearly transformed
into a particular score then all the scores can be kept in the distribution.
"Z" is a score obtained from linear transformation. It is a reference point for
comparing the different types of scores. Thus "Z" can be defined as a
standard score which provides a reference point for comparing different
scores coming from different origins.
Z = Score - mean / SD
Properties of NPC:
at that point y = 1/2, Z=0. As the curve moves from that point where X=M, in any directio
Yate's Correction:
If expected frequencies are very much less than the obtained frequencies it
should be corrected with Yate's Correction. While calculating Chi square if
one or more of expected frequencies are found to be very small a Yate's
Correction is to be done. This is a correction of continuity. In a 2×2 by table
that is fourfolds table a Yate's Correction is done if fe (Expected frequency)
is less than 10. However if the table is 2×3 or more we can tolerate upto fe
= 5.
Annova
Analysis of variance is an extension of t test.
There are two types of Annova-
● One way
● Two way
There are 2 kinds of variance- middle group Variance, between group
variance.
a)Middle group- Thiis is the average variance of the members of each
group around the respective group
means that is the mean value of the scores in a sample.
b) Between group- this represents the variance of group means around the
total or grand mean of all
groups that is the best estimate of the population mean.
One way annova- A one way ANOVA is used to compare two means from
two independent (unrelated)
groups using the F-distribution. The null hypothesis for the test is that the
two means are equal.
Therefore, a significant result means that the two means are unequal.
Two way annova- With a Two Way ANOVA, there are two independents.
Use a two way ANOVA when
you have one measurement variable (i.e. a quantitative variable) and two
nominal variables. In other
words, if your experiment has a quantitative outcome and you have two
categorical explanatory variables,
a two way ANOVA is appropriate.
Standard error
What is SE? Why we use SE in statistics?
The standard error is a statistical term that measures the accuracy with
which a sample distribution
represents a population by using standard deviation.
The SE is considered part of descriptive statistics. It represents the SD of
the mean within a data set. This
serves as a measure of variation for a random variable, providing A
measurement for the spread. The
smaller the spread, the more accurate the data set. The SE is most useful
as a means of calculating a
confidence interval. For a large sample, a 95% confidence interval is
obtained as the values 1.96xSE either side of the mean.
Annova calculation:
Comparative table for annova: