R Unit 4
R Unit 4
PROBABILITY DISTRIBUTION:
Probability distribution of a random variable (x) shows how the probabilities
of the events are distributed over different values of the random variable. When all
values of a random variable are aligned on a graph, the values of its probabilities
generate a shape.
Binomial Distribution:
The binomial distribution is a discrete probability distribution. It describes the
outcome of n independent trials in an experiment. Each trial is assumed to have only
two outcomes, either success or failure. If the probability of a successful trial is p,
then the probability of having x successful outcomes in an experiment of n
independent trials is as follows.
n!
P(x) = px qn−x
(n − x)! x!
5
c2 * (0.167)2 * (0.833)3
R code: dbinom(2,size=5,prob=0.167)
Types of problems:
dbinom: It is used to find the probability of a specific number of successes in a
fixed number of independent Bernoulli trails
pbinom: It gives the probability that the number of successes is greater than or
equal to a specified value
qbinom: It gives the value for which the cumulative probability is less than or
equal to a specified probability
rbinom: It is used to cumulative random experiment based on the binomial
distribution
Problem: In a restaurant seventy percent of people order for Chinese food and thirty
percent for Italian food. A group of three people enter the restaurant. Find
the probability of at least 2 of them ordering for Italian food
Hence the probability for at least two persons ordering Italian food is,
P(x>=2) = P(x=2) + P(x=3)
= 0.189 + 0.027
=0.216
Problem: Suppose there are twelve multiple choice questions in an English class quiz.
Each question has five possible answers, and only one of them is correct. Find
the probability of having four or less correct answers if a student attempts to
answer every question at random.
Solution: Since only one out of five possible answers is correct, the probability of
answering a question correctly by random is 1/5=0.2
R code:
dbinom(0,size=12,prob=0.2)+dbinom(1,size=12,prob=0.2)+dbinom(2,size=12,
prob=0.2)+dbinom(3,size=12,prob=0.2)+dbinom(4,size=12,prob=0.2)
Poisson Distribution:
The Poisson distribution is the probability distribution of independent event
occurrences in an interval. If λ is the mean occurrence per interval, then the
probability of having x occurrences within a given interval is:
λx e−λ
p(x) = x!
Examples:
1. The number of defective electric bulbs manufactured by a reputed company
2. The number of telephone calls per minute at a switch board
3. The number of cars passing a certain point in one minute
4. The number of printing mistakes per page in a large text
R has four in-built functions to generate binomial distribution. They are described
below:
dpois(x,lambda,log=FALSE) : This function gives the probability density
distribution at each point
ppois(q, lambda,lower.tail=TRUE,log.p=FALSE) : This function gives the
cumulative probability of an event. It is a single value representing the
probability
qpois(p,lambda,lower.tail=TRUE,log.p=FALSE) : This function takes the
probability value and gives a number whose cumulative value matches the
probability value
rpois(n,lambda) : This function generates required number of random values of
given probability from a given sample
Problem: If there are twelve cars crossing a bridge per minute on average, find the
probability of having seventeen or more cars crossing the bridge in a
particular minute
Solution: The probability of having 16 or less cars crossing the bridge in a particular
minute is given by the function ppois
Hence the probability of having seventeen or more cars crossing the bridge in
a minute is in the upper tail of the probability density function
ppois(16,lambda=12,lower.tail=FALSE)
Problem: The average number of homes sold by the acme reality company is 2
homes per day. What is the probability that exactly 3 homes will be sold
tomorrow?
Solution: This is the Poisson experiment in which we know the following:
μ = 2 since 2 homes are sold per day, an average
X=3 since wee want to find the likelihood that 3 homes will be sold tomorrow
e=2.71828 since e is constant equal to approximately
Poisson formula
P(x;μ)=(� −� )(μx )/x!
=(2.71821-2)(23)/3!
=(0.13532)(8)/6
=0.180
R code:
dpois(3,lambda=2)
Problem: Suppose the average number of lions seen on 1-day safari is 5. what is the
probability that tourists will see fewer than four lions on the next 1-day safari
To solve this problem, we need to find the probability that tourists will see
0,1,2 or 3 lions. Thus, we need to calculate the sum of four probabilities:
P(x<=3,5)=P(0;5)+P(1;5)+P(2;5)+P(3;5)
=[(e-5)(50)/0!] + [(e-5)(51)/1!] + [(e-5)(52)/2!] + [(e-5)(53)]
=[0.0067] + [0.03369] + [0.084224] + [0.140375]
=0.2650
Thus the probability of seeing at no more than 3 lions is 0.2650
R code:
>ppois(3,lambda=5)
Normal Distribution:
A continuous random variable x follows a normal distribution with mean μ
and variance σ is a statistical distribution with probability density function
(μ−x)2
1 −
f(x) = e 2σ2
2π
Standard normal distribution:
It is the distribution that occurs when a normal random variable has a mean
of zero and a standard deviation of one.
The normal random variable of a standard normal distribution is called a standard
score or a Z score. Every normal random variable x can be transformed into a Z score
via the following equation
z = (x − μ)/σ
Where x is a normal random variable, μ is the mean, and σ is the standard deviation,
yielding
1 2
P(x)dx= �−� /2 ��
2�
Mean=median=mode
Symmetry about the center
50% of values less than the mean and 50% greater than the mean
R functions:
dnorm(x,mean=0,sd=1,log=FALSE) : This function gives the probability density
distribution at each point.
pnorm(q,mean=0,sd=1,lower.tail=TRUE,log.p=FALSE) : This function gives the
cumulative probability of an event. It is a single value representing the
probability
qnorm(p,mean=0,sd=1,lower.tail=TRUE,log.p=FALSE) : This function takes the
probability value and gives a number whose cumulative value matches the
probability value
rnorm(n,mean=0,sd=1) : This function generates required number of random
values of given probability from a given sample
Solution: μ = 12
σ=2
(a) Less than 7 months
P(x=7)
For x =7, then
Z=(x − μ)/σ
=(7-12)/2
=-2.5(=z1 say)
Hence P(x<7)=P(z<-2.5)
=0.5+(-0.4938)
=0.0062
R code: pnorm(7,mean=12,sd=2)
Problem: Assume that the test scores of a college entrance exam fits a normal
distribution. Furthermore, the mean test score is 72, and the standard
deviation is 15.2. what is the percentage of students scoring 84 or more in
the exam?
Solution: R code:
pnorm(84,mean=72,sd=15.2, lower.tail=FALSE)
CORRELATION:
A correlation is a relationship between two variables. Typically, we take x to
be the independent variable. We take y to be the dependent variable. Data is
represented by a collection of ordered pairs (x,y)
�
�=1
(�� − �)(�� − �)
��� =
� �
�=1
(�� − �)2 �=1
(�� − �) 2
The default method is “Pearson” so you may omit this if that is what you want. If you
type “Kendall” or “Spearman” then you will get the appropriate significance test.
Problem: The local ice cream shop keeps track of how much ice cream they sell
versus the temperature on that day, here are their figures for the last 12 days
Temp 14.2 16.4 11.9 15.2 18.5 22.1 19.4 25.1 23.4 18.1 22.6 17.2
℃
Ice $215 $325 $185 $332 $406 $522 $412 $614 $544 $421 $445 $408
cream
sales
Syntax:
>cov(x,y,mean)
Where,
X and y represents the data vector
Method defines the type of method to be used to compute covariance
Default is “Pearson”
Ex:
>x<-c(1,3,5,10)
>y<-c(2,4,6,20)
>print(cov(x,y))
>print(cov(x,y,method=”Pearson”))
>print(cov(x,y,method=”Kendall”))
>print(cov(x,y,method=”Spearman”))
x − μ0
t=
s/√n
To evaluate whether the difference is statistically significant, you first have to read in
t test table the critical value of students t distribution corresponding to the
significance level alpha of your choice (5%). The degrees of freedom (df) used in this
test are df=n-1
Problem: A professor wants to know if her introductory statistics class has a good
grasp of basic math. Six students are chosen at random from the class and
given a math proficiency test. The professor wants the class to be able to
score above 70 on the test. The six students get scores of 62,92,75,68,83 and
95. can the professor have 90 percent confidence that the mean score for the
class on the test would be above 70?
Solution:
Null Hypothesis H0: μ=70
Alternative Hypothesis Ha: μ>70
First, compute the sample mean and standard deviation
62+92+75+68+83+95
�= 6
476
= 6
= 79.71 and standard deviation=13.17
Null Hypothesis H0: The sample meet upto standard I.e., μ>70 hours
Alternative Hypothesis Ha : μ is not greater than 70,
Level of significance : α = 0.05
x−μ
The test statistic is t = s/√n0
79.71−70
t= 13.17/√6
9.17
t = 5.38
R code:
> t.test(x,alternative=”two.sided”,mu=70)
Problem: A sample of 26 bulbs gives a mean life of 990 hours with S.D of 20 hours.
The manufacturer claims that the mean life of bulbs is 1000 hours. Is sample
meet up to the standard.
Solution:
Here n=26
Sample mean � = 990 hours
S.D s=20 hours
Population mean μ=1000 hours
Df=n-1
= 26-1
=25
Null Hypothesis H0: The sample meet up to standard I.e., μ=1000 hours
Alternative Hypothesis Ha: μ not equal to 1000
Level of significance α =0.05
The test statistic is
x−μ0
t=
s/√n
79.71−70
t=
13.17/√6
=2.5(calculate value of t)
Table values of t with 25 df is 1.708
The calculate value of t is more than table value of t, so null hypothesis is
rejected at 5% level
d−μ
t=
s/√n
Problem: The blood pressure of 5 women before and after intake of a certain drug
are given below: Test whether there is significant change in blood pressure at
1% level of significance
Before 110 120 125 132 125
After 120 118 125 136 121
Solution: Let μ be the mean of population of differences
Null Hypothesis H0: μ1=μ2 I.e., no change in BP
Alternative Hypothesis Ha: μ1 ≠ μ2 I.e., no change in BP
Level of significance α=0.01
Computation: Differences di’s (before and after drug) are -10,2,0,-4,4
−10 + 2 + 0 + ( − 4) + 4
�=
5
=-8/5
=-1.6
1 �
S2=�−1 �=1
(�� − �)2
1 5
=4 �=1
(�� − �)2
1
=4 [( − 10 + 1.6)2+(2 + 1.6)2 + (0 + 1.6)2 + ( − 4 + 1.6)2 + (4 + 1.6)2 ]
123.20
= 4
=30.8
S= 30.8 =5.55
−1.16
=
5.55/ 5
=0.645
Calculated |t| value is 0.645
Tabulates t0.01 with 5-1=4 degrees of freedom is 3.747
Since calculated t<t0.01, we accept the NULL hypothesis and conclude that there
is no significant change in blood pressure
R code:
>x<-c(110,120,125,132,125)
>y<-c(120,118,125,136,121)
>t.test(x,y,paired=TRUE)
x−y
t=
1 1
s +
n1 n2
~ tn1+n2-2
Where
�1 �21 +�2 �22
S2= �1 +�2 −2
Or
(��−�)2 + (�� −�)2
S2= �1+�2−2
Problem: Two horses A and B were tested according to the time (in seconds) to
run a particular track with the following results
Horse A 28 30 32 33 33 29 34
Horse B 29 30 30 24 27 29
Test whether the two horses have the same running capacity
Solution:
Given n1=7 and n2=6
We first compute the same means and standard deviations.
�=Mean of the first sample
1
= (28 + 30 + 32 + 33 + 33 + 29 + 34)
7
1
=7(219)
=31.286
31.4359+26.8336
= 7+6−2
=5.26
Therefore: S= 5.26
=2.3
31.86 − 28.16
t=
1 1
(2.3) +
7 6
=2.443
ANOVA:(Analysis of variance)
When we have only two samples we can use the t-test to compare the means
of the samples but it might become unreliable in case of more than two samples. If
we only compare two means, then the t-test (independent samples) will give the
same results as the ANOVA. Anova is performed with F-test Null Hypothesis H0.
There are no differences among the mean values of the groups being compared (I.e.,
the group means are all equal)
H0:�1 = �2 = �3 = . . . = ��
Alternative Hypothesis H1: (conclusion if H0 rejected)?
Not all group means are equal(I.e., at least one group mean is different from the rest)
Error n-K 2
� � = ��� − ���� �2 �
�2�=�−�
BASIC STATISTICS:
Average a number expressing the central or typical value in a set of data, in
particular the mean,mode,median or(most commonly) the mean, which is calculated
by dividing the sum of the values in the set by their number. The basic formula for
the average of n numbers x1,x2,…xn is
A=(x1+x2+…+xn)/n
Ex:
Suppose there are 8 data points 2,4,4,4,5,5,7,9. The average of these 8 data points is,
A=2+4+4+4+5+5+7+9/8
=5
R code:
List=c(2,4,4,4,5,5,7,9)
Print(mean(list))
Example: Let’s consider the same dataset that we have taken in average. First,
calculate the deviations of each data point from the mean, and square the result of
each.
(2-5)2=(-3)2=9
(4-5)2=(-1)2=1
(4-5)2=(-1)2=1
(4-5)2=(-1)2=1
(5-5)2=02=0
(5-5)2=02=0
(7-5)2=22=4
(9-5)2=42=16
9+1+1+1+0+0+4+16
Variance= 8
=4
Computing variance in R programming
Syntax: var(x)
R code:
>list=c(2,4,4,4,5,5,7,9)
>print(var(list))
R code:
>list=c(2,4,4,4,5,5,7,9)
>sd(list)
Mean:
It is the sum of observations divided by the total number of observations. It is
also defined as average which is the sum divided by count
�
Mean(�) = �
R code:
>x<-c(2,4,4,4,5,5,7,9)
>mean(x)
Median:
It is the middle value of the data set. It splits the data into two halves. If the
number of elements in the data set is odd then the center element is median and if it
is even then the median value would be the average of two central elements.
Odd Even
�
N+1/2 n/2, 2 +1
R code:
>x<-c(2,4,4,4,5,5,7,9)
>median(x)
Mode:
It is the value that has the highest frequency in the given data set. The data
set may have no mode if the frequency of all data points is the same. Also, we can
have more than one mode if we encounter two or more data points having the same
frequency. There is no inbuilt function for finding the mode in R, so we can create
our own function for finding the mode or we can use the package called modest.
R code:
>mode<-function(v){
>uniqv<-unique(v)
>uniqv[which.max(table(match(v,uniqv)))]
>}
>v<-c(2,4,4,4,5,5,7,9)
>result<-mode(v)
>print(result)