0% found this document useful (0 votes)
9 views29 pages

Presentation 3

Uploaded by

yonasante2121
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views29 pages

Presentation 3

Uploaded by

yonasante2121
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 29

Probability

and

Sampling Distributions
R as a set of statistical tables
• The R suite of programs provides a simple way for statistical tables of just about

any probability distribution of interest and also allows for easy plotting of the form

of these distribution

General syntax for distribution functions

• There are four basic R commands that apply to the various distributions defined in R.

• Letting DIST denotes the particular distribution and parameters the parameters to

specify that particular distribution

• d DIST(x, parameters)--- probability density of DIST evaluated at x

• p DIST(x, parameters)- returns Pr(DIST(parameters)<=x)

• q DIST(p, parameters)-- returns x satisfying Pr(DIST(parameters)<=x)=p

• rDIST(n, parameters)- generates a random variables from DIST(parameters)


Distribution Parameters R Function
Binomial size, prob binom
Poison lambda pois
Geometric prob geom
Negative binomial size, prob nbinom
Normal mean , sd norm
Student t df t
Chi- square df chisq
F df1,df2 f
Exponential rate exp
Gamma Shape, rate gamma
Uniform min , max unif
Probability density function

• dnorm(2) # returns probability density of standard normal distribution


evaluated at z=2
## let z be a sequence of random normal deviates ranging b/n -3.5 and 3.5, increasing by 0.1
• z<-seq(from=-3.5,to=3.5,by=0.1) # defines z-values for the pdf
• f1<-dnorm(z,0,1)
• par(mfrow=c(2,2))
• plot(z,f1,type="l") # type=“l” will plot the lines
• title("Normal distribution")
• y<-seq(0,5,length=100)
• f2<-dchisq(y,1) # computes the probability of y for χ2 with
df=1
• plot(y,f2,type="l")
• title("Chi-square distribution")
Normal distribution Chi-square distribution
0.0 0.1 0.2 0.3 0.4

0.0 0.5 1.0 1.5


f1

f2
-3 -2 -1 0 1 2 3 0 1 2 3 4 5

z y

Probability distributions
>z=seq(-4,4,by=.1)

>y=dnorm(z)

>plot(z,y,type="l", main="Standard Normal and t (df=3) Densities")

>y2=dt(z,df=3)

>lines(z,y2,type="l",col="red")

• One can see that Student's t density is very similar to standard Normal
density except that the t density has an additional parameter called degrees
of freedom (df).

• Each new choice of df will produce a new t density.

• If df=100 or larger, t density is almost the same as standard Normal.

Exercise: Change df=3 to various values (e.g. df=30) and see the two curves
Note: t density in red color has fatter tails
Probability mass function
• For discrete distributions, where variables can take on only distinct values, it is preferable to

draw a pin diagram,

• Example1 the binomial distribution with n = 50 and p = 0.33

• par(mfrow=c(2,1))

> x <- 0:50 # The Binomial distribution takes values 0,...,n

> plot(x,dbinom(x,size=50,prob=.33),type="h",main = "Binomial mass")

> plot(x, pbinom(x,size=50,prob=.33),type = "s", main = "Binomial distribution")

• The distribution drawn corresponds to, for example, the number of 5s or 6s in 50 throws of a

symmetrical die.
Example 2 the Poisson distribution with lambda 0.2
> x<-0:10
> y<-dpois(0:10,0.2)
>data.frame("Prob"=y,row.names=x)
> plot(0:10, dpois(0:10,0.2), type='h', xlab="Sequence Errors", ylab="Probability" )
Cumulative distribution & Quantiles
• R has useful mechanism for determining p-values instead of searching
through statistical tables and they can be easily achieved using the p(dist)
and q(dist) functions. Some examples are shown below.
> pnorm(1.96, 0,1) # the probability that Z is less than or equal to 1.96
[1] 0.9750021
>2*pnorm(-1.96) # 2-sided p-value for normal distribution
[1] 0.0249979
> qnorm(0.975)
[1] 1.959964
> 2*pt(-2.43,df=13) # 2-sided p-value for t distribution
[1] 0.0303309
To find the probability of getting t=1.50 (or greater) when df=15.

• Use either of the methods

• Method 1

• > pt(1.50, df=15, lower.tail = FALSE)

• [1] 0.07718333

• Method 2

• > 1- (pt (1.50, df = 15))

• [1] 0.07718333

what’s the probability of getting 12.1 or greater for a chi-square distribution with 8 degrees of
freedom?
• #Method 1
• > pchisq(12.1, df =8, lower.tail= FALSE)
• [1] 0.1467976
• #Method2
• > 1 - pchisq(12.1, df =8)
• qt() calculates the quantile for a given prob-value and degrees of freedom

separately.

• qt(p, df, lower.tail = TRUE)

• The default argument lower.tail = TRUE is used for two-sided and one-sided

seceding tests (X <=x). It has to be set on FALSE for a one-sided acceding test.

>qt(0.03391, 13, lower.tail = FALSE)


qt(0.025,df=13) # to calculate the quantile for probability

0.025

[1] -2.160369

> qchisq(0.975,1)

[1] 5.023886

>qf(0.01, 2, 7, lower.tail = FALSE) ## upper 1% point for an F(2, 7)


Random number generators in R
• E.g. random numbers from the normal distribution can be drawn using
rnorm() function
• rnorm(10) # draws 10 random numbers from a standard
normal distribution
• rnorm(10,5,2) #draws 10 random samples from a normal
distribution with mean 5 and standard deviation of 2
• rnorm(1000,5,2) #draws 1000 random samples from a normal
distribution with mean 5 and standard deviation of 2
• runif(10)
• rbinom(20,size=5,prob=.2)
• rpois(20,6) # to generate a random sample of size=20
Continued…
• x <- rnorm(100, 50, 10)
• hist(x, probability = TRUE)
• mean(x)
• sd(x)
• xnew <- seq(min(x), max(x), length=100)
• lines(xnew, dnorm(xnew, mean(x), sd(x)), col="red")
Simulating the Sample Distribution of the Mean
• The CLT is perhaps the most important concept in statistics
>data<-rnorm(25,100,15)
mean(data)
sd(data)
• We know that when the population is normal, the sample mean has a
normal distribution with mean 100 and a standard deviation of 3.
• Let’s verify that with a statistical simulation
>mean(rnorm(25,100,15))
>replicate(10, mean(rnorm(25,100,15))) #replicate 10 times
>data<- replicate(100000, mean(rnorm(25,100,15))) # replicate
100000 times and save in data
>mean(data)
>sd(data)
• These results are very close to our theoretical expectation
Simulating the Sample Distribution of the Mean
• Let’s look at histogram of our means

>hist(data,breaks=100, main=“Histogram of Sample Means”,


col="blue“, xlab = "Sample Means")
>plot.density(density(data)) # it certainly looks normal
Example2
>x <- rnorm(1000,5,2)
>hist(x, probability = TRUE)
>mean(x)
>lines(density(x), col="blue")
>xnew <- seq(min(x), max(x), length=100)
>lines(xnew, dnorm(xnew, mean(x), sd(x)), col="red"
Examining the distribution of a set of data
• Given a (univariate) set of data we can examine its distribution in a large number of
ways. The simplest is to examine the numbers.
>attach(faithful)
> summary(eruptions)
• Min. 1st Qu. Median Mean 3 rd Qu. Max.
1.600 2.163 4.000 3.488 4.454 5.100
Graphical examination of normality
• R has a function hist to plot histograms
>hist(eruptions) # It has an unusual
distribution.
>hist(eruptions, seq(1.6, 5.2, 0.2),prob=TRUE)
>lines (density(eruptions), bw=0.1)
>rug(eruptions) # show the actual data
points
##Q-Q plots can help us examine this more carefully
>qqnorm(eruptions)
>qqline(eruptions)
Cont’d…..
• We might want a more formal test of agreement with normality (or not).
• R provides the Shapiro-Wilk test
> shapiro.test(eruptions),
Shapiro-Wilk normality test
data: eruptions),
W = 0.8459, p-value = 9.036e-16
• Since the p-value is very small, so we reject H0 and conclude that the sample are unlikely to be from a
normal distribution
Kolmogorov-Smirnov test

> long <- eruptions[eruptions > 3]


> ks.test(long, "pnorm", mean = mean(long), sd = sqrt(var(long)))
• One-sample Kolmogorov-Smirnov test
data: long
D = 0.0661, p-value = 0.4284 alternative hypothesis: two-sided
• (Note that the distribution theory is not valid here as we have estimated the parameters of the normal
distribution from the same sample.)
Hypothesis Tests in R

Recall the basic structure of hypothesis tests:


• An overall model and related assumptions are made. (The most common
being observations following a normal distribution.)
• The null (H0) and alternative (H1 or HA) hypothesis are specified. Usually
the null specifies a particular value of a parameter.
• With given data, the value of the test statistic is calculated.
• Under the general assumptions, as well as assuming the null hypothesis is
true, the distribution of the test statistic is known.
• Given the distribution and value of the test statistic, as well as the form of
the alternative hypothesis, we can calculate a p-value of the test.
• Based on the p-value and pre-specified level of significance, we make a
decision. One of:
--Fail to reject the null hypothesis.
One and two sample tests

• Consider the following data on the latent heat of fusion of ice


(cal/gm) from rice (1995, p490).
>Method_1<-c(79.98, 80.04, 80.02, 80.04, 80.03, 80.03, 80.04, 79.97,
80.05, 80.03, 80.02, 80.00, 80.02)
>Method_2<-c(80.02, 79.94, 79.98, 79.97, 79.97, 80.03, 79.95, 79.97)
One Sample t-test
Suppose we want to test if the mean of Method _1 is 80:

We specify µ0 with mu=80.

>t.test(Method _1,mu=80) or
> t.test(Method_1,mu=80, alternative="two.sided", conf.level = 0.95)
Two Sample t-test
For the previous sample data
a) Test for the equality of means of the two samples
(b) Test for equality of the variances of the two samples.
#Box plots provide a simple graphical comparison of the two samples.
>boxplot(Method _1, Method _2) #which indicates that the first group tends to give
higher results than the second.
• To test for the equality of the means of the two examples, we can use an unpaired t-
test by
>t.test(Method _1, Method _2, alternative = c("two.sided"))
## which does indicate a significant difference (p < 0.05)., assuming normality.
Comparison of variances
• By default the R function does not assume equality of variances in the
two samples. We can however use the F-test to test for the equality of
variances in the two samples provided that the two samples are from
normal populations.
• Checking homogeneity (approximate equality) of variances is, on the one
hand, a necessary precondition for a number of methods (for example
comparison of mean values) and on the other hand the heart of a number
of more sophisticated methods (such as analysis of variance).
• F-test for variance equality

• This test depends on the ratio of variances. The null hypothesis asserts
the ratio to be one.
• Given the above data sets, the R code for this test where the level of confidence is
0.95 will be:

##which shows no evidence of a significant difference, and so we can use the classical
t-test that assumes equality of the variances.
t.test(Method _1, Method _2,var.equal=T, alternative = c("two.sided"))
Paired t-Test
• Paired tests are used when there are two measurements on the same
experimental unit.
• The theory is essentially based on taking differences and thus
reducing the problem to that of a one-sample test.
• First generate the following data
• x<-sample(Method_1,7,replace=FALSE)
• y<-sample(Method_2,7,replace=FALSE)
• t.test(x,y,paired=TRUE)
• All the tests seen so far assume normality of the two samples. The
two-sample Wilcoxon (or Mann Whitney) test only assumes a
common continuous distribution under the null hypothesis.
> wilcox.test(A, B)
Wilcoxon rank sum test with continuity correction
data: A and B
W = 89, p-value = 0.007497
alternative hypothesis: true location shift is not equal to 0
• Package exactRankTests is required when there is ties in the data to
conduct a better test.
Nonparametric Tests of Group Differences
R provides functions for carrying out Mann-Whitney U, Wilcoxon
Signed Rank, Kruskal Wallis, and Friedman tests.# independent 2-
group Mann-Whitney U Test
wilcox.test(y~A)
# where y is numeric and A is A binary factor
• # independent 2-group Mann-Whitney U Test
wilcox.test(y,x) # where y and x are numeric
• # dependent 2-group Wilcoxon Signed Rank Test
wilcox.test(y1,y2,paired=TRUE) # where y1 and y2 are numeric
• # Kruskal Wallis Test One Way Anova by Ranks
kruskal.test(y~A) # where y1 is numeric and A is a factor
• # Randomized Block Design - Friedman Test
friedman.test(y~A|B)
# where y are the data values, A is a grouping factor
# and B is a blocking factor
χ2 test for I × J contingency table
Consider the following two categorical variables:
x<-
as.factor(c("Milk","Milk","Milk","Milk","Tea","Tea","Tea","T
ea"))
y<-
as.factor(c("Milk","Milk","Milk","Tea","Milk","Tea","Tea","T
ea"))
These vectors of categorical variables are converted into contingency
tables by the R as:
table(x,y)
y
x Milk Tea
Milk 3 1
Tea 1 3
For making a test, we use the following R code:
chisq.test(x,y)
R output:
Pearson's Chi-squared test with Yates' continuity correction #read more
on Yate’s continuity correction,..
data: x and y
X-squared = 0.5, df = 1, p-value = 0.4795
• There is no significant association between the two variables.
Correlation coefficients for continuous
variables
• x <- c(1,2,3,5,7,9)
• y <- c(3,2,5,6,8,11)

>cor.test(x, y, method="pearson")
• If the linearity of a relationship or the normality of the residuals is
doubtful, a rank correlation test can be carried out. Mostly,
Spearman’s rank correlation coefficient is used:
>cor.test(x, y, method="spearman")
Exercises

5.1 Do the values of the react data set (notice that this is a single

vector, not a data frame) look reasonably normally distributed? Does

the mean differ significantly from zero according to a t test?

• 5.2 In the data set vitcap, use a t test to compare the vital capacity

for the two groups. Calculate a 99% confidence interval for the

difference. The result of this comparison may be misleading. Why?

• 5.3 Perform the analyses of the react and vitcap data using

nonparametric techniques.

• 5.4 Perform graphical checks of the assumptions for a paired t test

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy