Presentation 3
Presentation 3
and
Sampling Distributions
R as a set of statistical tables
• The R suite of programs provides a simple way for statistical tables of just about
any probability distribution of interest and also allows for easy plotting of the form
of these distribution
• There are four basic R commands that apply to the various distributions defined in R.
• Letting DIST denotes the particular distribution and parameters the parameters to
f2
-3 -2 -1 0 1 2 3 0 1 2 3 4 5
z y
Probability distributions
>z=seq(-4,4,by=.1)
>y=dnorm(z)
>y2=dt(z,df=3)
>lines(z,y2,type="l",col="red")
• One can see that Student's t density is very similar to standard Normal
density except that the t density has an additional parameter called degrees
of freedom (df).
Exercise: Change df=3 to various values (e.g. df=30) and see the two curves
Note: t density in red color has fatter tails
Probability mass function
• For discrete distributions, where variables can take on only distinct values, it is preferable to
• par(mfrow=c(2,1))
• The distribution drawn corresponds to, for example, the number of 5s or 6s in 50 throws of a
symmetrical die.
Example 2 the Poisson distribution with lambda 0.2
> x<-0:10
> y<-dpois(0:10,0.2)
>data.frame("Prob"=y,row.names=x)
> plot(0:10, dpois(0:10,0.2), type='h', xlab="Sequence Errors", ylab="Probability" )
Cumulative distribution & Quantiles
• R has useful mechanism for determining p-values instead of searching
through statistical tables and they can be easily achieved using the p(dist)
and q(dist) functions. Some examples are shown below.
> pnorm(1.96, 0,1) # the probability that Z is less than or equal to 1.96
[1] 0.9750021
>2*pnorm(-1.96) # 2-sided p-value for normal distribution
[1] 0.0249979
> qnorm(0.975)
[1] 1.959964
> 2*pt(-2.43,df=13) # 2-sided p-value for t distribution
[1] 0.0303309
To find the probability of getting t=1.50 (or greater) when df=15.
• Method 1
• [1] 0.07718333
• Method 2
• [1] 0.07718333
what’s the probability of getting 12.1 or greater for a chi-square distribution with 8 degrees of
freedom?
• #Method 1
• > pchisq(12.1, df =8, lower.tail= FALSE)
• [1] 0.1467976
• #Method2
• > 1 - pchisq(12.1, df =8)
• qt() calculates the quantile for a given prob-value and degrees of freedom
separately.
• The default argument lower.tail = TRUE is used for two-sided and one-sided
seceding tests (X <=x). It has to be set on FALSE for a one-sided acceding test.
0.025
[1] -2.160369
> qchisq(0.975,1)
[1] 5.023886
>t.test(Method _1,mu=80) or
> t.test(Method_1,mu=80, alternative="two.sided", conf.level = 0.95)
Two Sample t-test
For the previous sample data
a) Test for the equality of means of the two samples
(b) Test for equality of the variances of the two samples.
#Box plots provide a simple graphical comparison of the two samples.
>boxplot(Method _1, Method _2) #which indicates that the first group tends to give
higher results than the second.
• To test for the equality of the means of the two examples, we can use an unpaired t-
test by
>t.test(Method _1, Method _2, alternative = c("two.sided"))
## which does indicate a significant difference (p < 0.05)., assuming normality.
Comparison of variances
• By default the R function does not assume equality of variances in the
two samples. We can however use the F-test to test for the equality of
variances in the two samples provided that the two samples are from
normal populations.
• Checking homogeneity (approximate equality) of variances is, on the one
hand, a necessary precondition for a number of methods (for example
comparison of mean values) and on the other hand the heart of a number
of more sophisticated methods (such as analysis of variance).
• F-test for variance equality
• This test depends on the ratio of variances. The null hypothesis asserts
the ratio to be one.
• Given the above data sets, the R code for this test where the level of confidence is
0.95 will be:
##which shows no evidence of a significant difference, and so we can use the classical
t-test that assumes equality of the variances.
t.test(Method _1, Method _2,var.equal=T, alternative = c("two.sided"))
Paired t-Test
• Paired tests are used when there are two measurements on the same
experimental unit.
• The theory is essentially based on taking differences and thus
reducing the problem to that of a one-sample test.
• First generate the following data
• x<-sample(Method_1,7,replace=FALSE)
• y<-sample(Method_2,7,replace=FALSE)
• t.test(x,y,paired=TRUE)
• All the tests seen so far assume normality of the two samples. The
two-sample Wilcoxon (or Mann Whitney) test only assumes a
common continuous distribution under the null hypothesis.
> wilcox.test(A, B)
Wilcoxon rank sum test with continuity correction
data: A and B
W = 89, p-value = 0.007497
alternative hypothesis: true location shift is not equal to 0
• Package exactRankTests is required when there is ties in the data to
conduct a better test.
Nonparametric Tests of Group Differences
R provides functions for carrying out Mann-Whitney U, Wilcoxon
Signed Rank, Kruskal Wallis, and Friedman tests.# independent 2-
group Mann-Whitney U Test
wilcox.test(y~A)
# where y is numeric and A is A binary factor
• # independent 2-group Mann-Whitney U Test
wilcox.test(y,x) # where y and x are numeric
• # dependent 2-group Wilcoxon Signed Rank Test
wilcox.test(y1,y2,paired=TRUE) # where y1 and y2 are numeric
• # Kruskal Wallis Test One Way Anova by Ranks
kruskal.test(y~A) # where y1 is numeric and A is a factor
• # Randomized Block Design - Friedman Test
friedman.test(y~A|B)
# where y are the data values, A is a grouping factor
# and B is a blocking factor
χ2 test for I × J contingency table
Consider the following two categorical variables:
x<-
as.factor(c("Milk","Milk","Milk","Milk","Tea","Tea","Tea","T
ea"))
y<-
as.factor(c("Milk","Milk","Milk","Tea","Milk","Tea","Tea","T
ea"))
These vectors of categorical variables are converted into contingency
tables by the R as:
table(x,y)
y
x Milk Tea
Milk 3 1
Tea 1 3
For making a test, we use the following R code:
chisq.test(x,y)
R output:
Pearson's Chi-squared test with Yates' continuity correction #read more
on Yate’s continuity correction,..
data: x and y
X-squared = 0.5, df = 1, p-value = 0.4795
• There is no significant association between the two variables.
Correlation coefficients for continuous
variables
• x <- c(1,2,3,5,7,9)
• y <- c(3,2,5,6,8,11)
>cor.test(x, y, method="pearson")
• If the linearity of a relationship or the normality of the residuals is
doubtful, a rank correlation test can be carried out. Mostly,
Spearman’s rank correlation coefficient is used:
>cor.test(x, y, method="spearman")
Exercises
5.1 Do the values of the react data set (notice that this is a single
• 5.2 In the data set vitcap, use a t test to compare the vital capacity
for the two groups. Calculate a 99% confidence interval for the
• 5.3 Perform the analyses of the react and vitcap data using
nonparametric techniques.