0% found this document useful (0 votes)

17 views21 pages

R Module 5

Uploaded by

kingofera6890

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views21 pages

R Module 5

Uploaded by

kingofera6890

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

CHAPTER 5

Statistical Analysis Using R

 OBJECTIVES

On completion of this Chapter you will be able to:

 obtain the basic statistical measures like mean, median, mode, standard
deviation, variation etc., using R
 obtain the summary statistics of a given data
 understand and plot the normal distribution of data using R functions
 understand and plot the binomial distribution of data using R functions
 perform correlation analysis on the given data using R
 perform regression analysis on the given data using R
 perform ANOVA, ANCOVA on the given data using R
 perform chi-square and hypothesis testing on the given data using R

Statistical analysis in R is performed by using many in-built functions. Most of

these functions are part of the R base package. These functions take R vector as an
input along with the arguments and give the result. The other important R package
for statistical analysis is the stats package.

5.1. Basic Statistical Measures

Any dataset available in R or that is been imported into R for further analysis will
have both categorical data as well as numeric data. So, we can apply the statistical
functions available in R on the numeric data and understand the statistical
measures of the fields. The basic statistical measures are the minimum, maximum,
mean and median represented by the functions min(), max(), mean() and median()
142 R Programming — An Approach for Data Analytics

respectively. Let us use the dataset named mtcars that is available in R by default to
understand these statistical measures.
> data(mtcars)
> colnames(mtcars)
[1] “mpg” “cyl” “disp” “hp” “drat” “wt” “qsec” “vs” “am” “gear” “carb”
> min(mtcars$cyl)
[1] 4
> max(mtcars$cyl)
[1] 8
> mean(mtcars$cyl)
[1] 6.1875
> median(mtcars$cyl)
[1] 6

All the above results can also be obtained by one function summary() and this
can also be applied on all the fields of the dataset at one shot. The range() function
gives the minimum and maximum values of a numeric field at one go.
> summary(mtcars$cyl)
Min. 1st Qu. Median Mean 3rd Qu. Max.
4.000 4.000 6.000 6.188 8.000 8.000
> range(mtcars$cyl)
[1] 4 8

5.1.1. Mean

Mean is calculated by taking the sum of the values and dividing with the number of
values in a data series. The function mean() is used to calculate this in R. The basic
syntax for calculating mean in R is given below along with its parameters.
mean(x, trim = 0, na.rm = FALSE, ...)
x - numeric vector
143 Statistical Analysis Using R

trim - to drop some observations from both end of the sorted vector
na.rm - to remove the missing values from the input vector

> x <- c(45, 56, 78, 12, 3, -91, -45, 15, 1, 24)
> mean(x)
[1] 9.8

When trim parameter is supplied, the values in the vector get sorted and then
the required numbers of observations are dropped from calculating the mean.
When trim = 0.2, 2 values from each end will be dropped from the calculations to
find mean. In this case the sorted vector is (-91, -45, 1, 3, 12, 15, 24, 45, 56, 78) and
the values removed from the vector for calculating mean are (−91, −45) from left
and (56, 78) from right.
> mean(x, trim = 0.2)
[1] 16.66667

If there are missing values, then the mean() function returns NA. To drop the
missing values from the calculation use na.rm = TRUE, which means remove the
NA values.
> x <- c(45, 56, 78, 12, 3, -91, NA, -45, 15, 1, 24, NA)
> mean(x)
[1] NA
> mean(x, na.rm = TRUE)
[1] 9.8

5.1.2. Median

The middle most value in a data series is called the median. The median() function
is used in R to calculate this value. The basic syntax for calculating median in R is
given below along with its parameters.
median(x, na.rm = FALSE)
x - numeric vector
na.rm - to remove the missing values from the input vector
144 R Programming — An Approach for Data Analytics

> x <- c(45, 56, 78, 12, 3, -91, -45, 15, 1, 24)
> median(x)
[1] 13.5

5.1.3. Mode

The mode is the value that has highest number of occurrences in a set of data.
Unlike mean and median, mode can have both numeric and character data. R does
not have a standard in-built function to calculate mode. So we create a user function
to calculate mode of a data set in R. This function takes the vector as input and
gives the mode value as output.
Mode <- function(x)
{
y <- unique(x)
y[which.max(tabulate(match(x, y)))]
}

> x <- c(1,2,3,4,5,5,5)

> Mode(x)
[1] 5
> ch <- c(“a”, “e”, “i”, “o”, “u”, “u”, “a”, “a”)
> Mode(ch)
[1] “a”

The function unique() returns a vector, data frame or array like x but with
duplicate elements/rows removed. The function match() returns a vector of
the positions of (first) matches of its first argument in its second. The function
tabulate() takes the integer-valued vector bin and counts the number of times each
integer occurs in it. The function which.max() determines the location, i.e., index
of the (first) maximum of a numeric (or logical) vector.
145 Statistical Analysis Using R

5.1.4. Standard Deviation and Variance

The functions to calculate the standard deviation, variance and the mean absolute
deviation are sd(), var() and mad() respectively.
> sd(mtcars$cyl)
[1] 1.785922
> var(mtcars$cyl)
[1] 3.189516
> mad(mtcars$cyl)
[1] 2.9652

5.1.5. Quartile Ranges

The quantile() function provides the quartiles of the numeric values. An alternative
function for quartiles is fivenum(). The IQR() function provides the inter quartile
range of the numeric fields.
> quantile(mtcars$cyl)
0% 25% 50% 75% 100%
4 4 6 8 8
> fivenum(mtcars$cyl)
[1] 4 4 6 8 8
> IQR(mtcars$cyl)
[1] 4

5.1.6. Other Statistical Functions

The function cor() and cov() are used to find the correlation and covariance between
two numeric fields respectively. In the below example the value shows that there is
negative correlation between the two numeric fields.
> cor(mtcars$mpg, mtcars$cyl)
[1] -0.852162
146 R Programming — An Approach for Data Analytics

> cov(mtcars$mpg, mtcars$cyl)

[1] -9.172379

There are other statistics functions such as pmin(), pmax() [parallel equivalents
of min() and max() respectively], cummin() [cumulative minimum value], cummax()
[cumulative maximum value], cumsum() [cumulative sum] and cumprod()
[cumulative product].
> nrow(mtcars)
[1] 32
> mtcars$cyl
[1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
> pmin(mtcars$cyl)
[1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
> pmax(mtcars$cyl)
[1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
> cummin(mtcars$cyl)
[1] 6 6 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
> cummax(mtcars$cyl)
[1] 6 6 6 6 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
> cumsum(mtcars$cyl)
[1] 6 12 16 22 30 36 44 48 52 58 64 72 80 88 96 104 112 116 120 124
[21] 128 136 144 152 160 164 168 172 180 186 194 198
> cumprod(mtcars$cyl)
[1] 6.000000e+00 3.600000e+01 1.440000e+02 8.640000e+02 6.912000e+03
4.147200e+04
[7] 3.317760e+05 1.327104e+06 5.308416e+06 3.185050e+07 1.911030e+08
1.528824e+09
[13] 1.223059e+10 9.784472e+10 7.827578e+11 6.262062e+12 5.009650e+13
2.003860e+14
147 Statistical Analysis Using R

[19] 8.015440e+14 3.206176e+15 1.282470e+16 1.025976e+17 8.207810e+17

6.566248e+18
[25] 5.252999e+19 2.101199e+20 8.404798e+20 3.361919e+21 2.689535e+22
1.613721e+23
[31] 1.290977e+24 5.163908e+24

5.2. Summary Statistics

Thus the summary() function can be applied on the entire dataset to get all the
statistical values of all the numeric fields.
> summary(mtcars)
mpg cyl disp hp drat
Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0 Min. :2.760
1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5 1st Qu.:3.080
Median :19.20 Median :6.000 Median :196.3 Median :123.0 Median :3.695
Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7 Mean :3.597
3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0 3rd Qu.:3.920
Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0 Max. :4.930

wt qsec vs am gear
Min. :1.513 Min. :14.50 Min. :0.0000 Min. :0.0000 Min. :3.000
1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:3.000
Median :3.325 Median :17.71 Median :0.0000 Median :0.0000 Median :4.000
Mean :3.217 Mean :17.85 Mean :0.4375 Mean :0.4062 Mean :3.688
3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:4.000
Max. :5.424 Max. :22.90 Max. :1.0000 Max. :1.0000 Max. :5.000

carb
Min. :1.000
1st Qu.:2.000
Median :2.000
148 R Programming — An Approach for Data Analytics

Mean :2.812
3rd Qu.:4.000
Max. :8.000

5.3. Normal Distribution

In a random collection of data from independent sources, it is generally observed
that the distribution of data is normal. Which means, on plotting a graph with the
value of the variable in the horizontal axis and the count of the values in the vertical
axis we get a bell shape curve. The centre of the curve represents the mean of the
data set. In the graph, half of values lie to the left of the mean and the other half lie
to the right of the graph. This is referred as normal distribution in statistics. R has
four in-built functions to generate normal distribution. They are described below.
dnorm(x, mean, sd)
pnorm(x, mean, sd)
qnorm(p, mean, sd)
rnorm(n, mean, sd)

x - vector of numbers
p - vector of probabilities
n - sample size
mean - mean (default value is 0)
sd - standard deviation (default value is 1)

5.3.1. dnorm()

For a given mean and standard deviation, this function gives the height of the
probability distribution. Below is an example in which the result of the dnorm()
function is plotted in a graph in Fig. 5.1.
> x <- seq(-5,5, by = 0.05)
> y <- dnorm(x, mean = 1.5, sd = 0.5)
> plot(x, y)
149 Statistical Analysis Using R

Figure 5.1 Plot of dnorm()

5.3.2. pnorm()

The pnorm() function returns the probability of a normally distributed random

number which is less than the value of a given number. The other name for this is
“Cumulative Distribution Function”. Below is an example in which the result of the
pnorm() function is plotted in a graph as in Fig. 5.2.
> x <- seq(-5,5, by = 0.05)
> y <- pnorm(x, mean = 1.5, sd = 1)
> plot(x, y)

Figure 5.2 Plot of pnorm()

150 R Programming — An Approach for Data Analytics

5.3.3. qnorm()

The qnorm() function takes the probability value as input and returns a cumulative
value that matches the probability value. Below is an example in which the result of
the qnorm() function is plotted in a graph as in Fig. 5.3.
> x <- seq(0, 1, by = 0.02)
> y <- qnorm(x, mean = 2, sd = 1)
> plot(x, y)

Figure 5.3 Plot of qnorm()

5.3.4. rnorm()

This function is used to generate random numbers whose distribution is normal. It

takes the sample size as input and generates that many random numbers. We draw
a histogram to show the distribution of the generated numbers as in Fig. 5.4.
151 Statistical Analysis Using R

Figure 5.4 Histogram Using rnorm()

> x <- rnorm(80)
> hist(x, main = “Normal Distribution”)

5.4. Binomial Distribution

The probability of success of an event is found by the binomial distribution model and
this has only two possible outcomes in a series of experiments. For example, tossing of
a coin always gives a head or a tail. During the binomial distribution, the probability
of finding exactly 3 heads when tossing a coin for 10 times is estimated. R has four
in-built functions to generate binomial distribution. They are described below.
dbinom(x, size, prob)
pbinom(x, size, prob)
qbinom(p, size, prob)
rbinom(n, size, prob)

x - vector of numbers
p - vector of probabilities
n - sample size
size – number of trials
prob – probability of success of each trial
152 R Programming — An Approach for Data Analytics

5.4.1. dbinom()

This function gives the probability density distribution at each point. Below is an
example in which the result of the dbinom() function is plotted in a graph as in Fig. 5.5.
> x <- seq(0, 25, by = 1)
> y <- dbinom(x,25,0.5)

> plot(x, y)

Figure 5.5 Plot Using dbinorm()

5.4.2. pbinom()

This function gives the cumulative probability of an event. It is a single value

representing the probability. The probability of getting 25 or less heads from a 50
tosses of a coin is given by the below code.
> x <- pbinom(25,50,0.5)
>x
[1] 0.5561376
153 Statistical Analysis Using R

5.4.3. qbinom()

The function qbinom() takes the probability value as input and returns a number
whose cumulative value matches the probability value. The below example finds how
many heads will have a probability of 0.5 will come out when a coin is tossed 50 times.
> x <- qbinom(0.5, 50, 1/2)
>x
[1] 25

5.4.4. rbinom()

The function rbinom() returns the required number of random values of the given
probability from a given sample. The below code is to find 5 random values from a
sample of 50 with a probability of 0.5.
> x <- rbinom(5,50,0.5)
>x
[1] 24 21 22 29 32

5.5. Correlation Analysis

To evaluate the relation between two or more variables, the correlation test is used.
Correlation coefficient in R can be computed using the functions cor() or cor.test().
The basic syntax for the correlation functions in R are as below.
cor(x, y, method)

cor.test(x, y, method)

x, y - numeric vectors with the same length

method - correlation method (“pearson” or “kendall” or “spearman”)

Consider the data set “mtcars” available in the R environment. Let us first find
the correlation between the horse power (“hp”) and the mileage per gallon (“mpg”)
of the cars and then between the horse power (“hp”) and the cylinder displacement
(“disp”) of the cars. From the test we find that the horse power (“hp”) and the
154 R Programming — An Approach for Data Analytics

mileage per gallon (“mpg”) of the cars have negative correlation (-0.7761684) and
the horse power (“hp”) and the cylinder displacement (“disp”) of the cars have
positive correlation (0.7909486).
> cor(mtcars$hp, mtcars$mpg, method = “pearson”)
[1] -0.7761684

> cor.test(mtcars$hp, mtcars$mpg, method = “pearson”)

Pearson’s product-moment correlation
data: mtcars$hp and mtcars$mpg
t = -6.7424, df = 30, p-value = 1.788e-07
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.8852686 -0.5860994
sample estimates:
cor
-0.7761684

> cor(mtcars$hp, mtcars$disp, method = “pearson”)

[1] 0.7909486
> cor.test(mtcars$hp, mtcars$disp, method = “pearson”)

Pearson’s product-moment correlation

data: mtcars$hp and mtcars$disp
t = 7.0801, df = 30, p-value = 7.143e-08
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.6106794 0.8932775
sample estimates:
cor
0.7909486
155 Statistical Analysis Using R

The correlation results can also be viewed graphically as in Fig. 5.6. The corrplot()
function can be used to analyze the correlation between the various columns of a
dataset, say mtcars. After this, the correlation between individual columns can be
compared by plotting it in separate graphs as in Fig. 5.7 and Fig. 5.8.
> library(corrplot)
> M <- cor(mtcars)
> corrplot(M, method = “number”)

Figure 5.6 Cor Plot of mtcars Dataset

> plot(mtcars$hp, mtcars$mpg, xlab=”Horse Power of the Cars”,
ylab=”Mileage per Gallon of the Cars”, pch=21)

Figure 5.7 Scatter Plot of Negative Correlation

156 R Programming — An Approach for Data Analytics

> plot(mtcars$hp, mtcars$disp, xlab=”Horse Power of the Cars”,

ylab=”Cylinder Displacement of the Cars”, pch=21)

Figure 5.8 Scatter Plot of Positive Correlation

It can be noted that the graph with negative correlation (Fig. 5.7) has the dots
from top left corner to bottom right corner and the graph with positive correlation
(Fig. 5.8) has the dots from the bottom left corner to the top right corner.

5.6. Regression Analysis

5.6.1. Linear Regression

Regression analysis is a widely used statistical tool. A relationship model is established

between the two variables used in the regression analysis. One of these variable
is called predictor variable whose value is gathered through experiments. The
other variable is called response variable whose value is derived from the predictor
variable. Linear Regression of the two variables are related through an equation.
The exponent of both the variables in this equation is 1. A linear relationship is
represented by a straight line when plotted as a graph. A non-linear relationship
where the exponent of any variable is not equal to 1 creates a curve.

The general mathematical equation for a linear regression is: y = ax + b, where

y is the response variable, x is the predictor variable and a and b are constants which
are called the coefficients.
157 Statistical Analysis Using R

A simple example of regression is predicting income of a person when his

expenditure is known. To do this we need to have the relationship between income
and expenditure of a person. First, carry out the experiment of gathering a sample
of observed values of income and corresponding expenditures. Then create a
relationship model using the lm() function in R. Find the coefficients from the
model created and create the mathematical equation using these values. Now, get a
summary of the relationship model to know the average error in prediction, which
is also called residuals. Finally, to predict the income of the new persons, use the
predict() function in R.

The function lm() creates the relationship model between the predictor and the
response variable. The basic syntax for lm() function in linear regression is as given
below.
lm(formula,data)

formula - relation between x and y

data - numeric vector

> x <- c(1510, 1740, 1380, 1860, 1280, 1360, 1790, 1630, 1520, 1310)
> y <- c(6300, 8100, 5600, 9100, 4700, 5700, 7600, 7200, 6200, 4800)
> model <- lm(y~x)
> model

Call:
lm(formula = y ~ x)

Coefficients:
(Intercept) x
-3845.509 6.746

> summary(model)

Call:
lm(formula = y ~ x)
158 R Programming — An Approach for Data Analytics

Residuals:

Min 1Q Median 3Q Max

-630.02 -166.29 4.12 189.44 397.75

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3845.5087 804.9013 -4.778 0.00139 **
x 6.7461 0.5191 12.997 1.16e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 325.3 on 8 degrees of freedom

Multiple R-squared: 0.9548, Adjusted R-squared: 0.9491
F-statistic: 168.9 on 1 and 8 DF, p-value: 1.164e-06

The basic syntax for the function predict() in linear regression is as given below.
predict(object, newdata)

object - model created using lm() function

newdata - new numeric vector for predictor variable
> expense <- data.frame(x = 4700)
> income <- predict(model, expense)
> income
1
27861.18

The graphical representation of this linear regression is drawn by the below

code and Fig. 5.9.
> plot(y,x,col = “blue”,main = “Income & Expenditure Regression”,
abline(lm(x~y)),cex = 1.3,pch = 16,
xlab = “Income in Rs.”,ylab = “Expenditure in Rs.”)
159 Statistical Analysis Using R

Figure 5.9 Plot of Linear Regression

5.6.2. Multiple Regressions

Multiple regressions is an extension of linear regression into relationship between

more than two variables. In simple linear relation we have one predictor and one
response variable, but in multiple regressions we have more than one predictor
variable and one response variable.

The equation for multiple regressions consists of the below variables.

y = a + b1x1 + b2x2 +...bnxn

In this equation y is the response variable, a, b1, b2...bn are the coefficients and
x1, x2, ...xn are the predictor variables.

In R, the lm() function is used to create the regression model. The model
determines the value of the coefficients using the input data. Next we can predict
the value of the response variable for a given set of predictor variables using these
coefficients. The relationship model is built between the predictor variables and
the response variables. The basic syntax for lm() function in multiple regression is
as given below.
lm(y ~ x1+x2+x3..., data)
168 R Programming — An Approach for Data Analytics

We can conclude that the value of b1 is more close to 1 (1.253135) while the
value of b2 is more close to 2 (2.496484) and not 3.

5.7. Analysis of Variance (ANOVA)

Analysis of Variance (ANOVA) is a statistical measure that is used for investigating
data by comparing the means of subsets of the data. The base case is the one-way
ANOVA. In one-way ANOVA the data is sub-divided into groups based on a single
classification factor. The one-way ANOVA is used to verify if the means of many
groups are equals. But this analysis may not be very useful for complex problems.
For example, it may be necessary to take into account two factors of variability to
determine if the averages between the groups depend on the group classification or
the second variable that is to consider. In this case the two-way analysis of variance
(two-way ANOVA) should be used.

Consider the dataset PlantGrowth available in R for performing one-way

ANOVA using R. This dataset has two columns, the control group / treatment and
the weight of the plant indicating its growth. We want to check if the hypothesis
that the control group / treatment has effect on the plant weight / plant growth. The
below code does the same.
> plant = lm(PlantGrowth$weight ~ PlantGrowth$group)
> anova(plant)
Analysis of Variance Table

Response: PlantGrowth$weight
Df Sum Sq Mean Sq F value Pr(>F)
PlantGrowth$group 2 3.7663 1.8832 4.8461 0.01591 *
Residuals 27 10.4921 0.3886
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The result shows that the F-value is 4.8461 and the p-value is 0.01591 which is less
than 0.05 (5% level of significance). This shows that the null hypothesis is rejected,
that is the control group / treatment has effect on the plant growth / plant weight.
169 Statistical Analysis Using R

For two-way ANOVA, consider the below example of revenues collected for 5
years in each month. We want to see if the revenue depends on the Year and / or
Month or if they are independent of these two factors.
> revenue = c(15,18,22,23,24, 22,25,15,15,14, 18,22,15,19,21,
+ 23,15,14,17,18, 23,15,26,18,14, 12,15,11,10,8, 26,12,23,15,18,
+ 19,17,15,20,10, 15,14,18,19,20, 14,18,10,12,23, 14,22,19,17,11,
+ 21,23,11,18,14)

> months = gl(12,5)

> years = gl(5, 1, length(revenue))
> fit = aov(revenue ~ months + years)

> anova(fit)
Analysis of Variance Table

Response: revenue
Df Sum Sq Mean Sq F value Pr(>F)
months 11 308.45 28.041 1.4998 0.1660
years 4 44.17 11.042 0.5906 0.6712
Residuals 44 822.63 18.696

The significance of the difference between months is: F = 1.4998. This value is
lower than the value tabulated and indeed p-value > 0.05. So we cannot reject the
null hypothesis: the means of revenue evaluated according to the months are not
proven to be not equal, hence we remain in our belief that the variable “months”
has no effect on revenue.

The significance of the difference between years is: F = 0.5906. This value is
lower than the value tabulated and indeed p-value > 0.05. So we fail to reject the
null hypothesis: the means of revenue evaluated according to the years are not found
to be un-equal, then the variable “years” has no effect on revenue.

Numerical Methods in Environmental Data Analysis Moses Eterigho Emetere download
100% (2)
Numerical Methods in Environmental Data Analysis Moses Eterigho Emetere download
63 pages
Robust Analysis 5725-5
No ratings yet
Robust Analysis 5725-5
2 pages
Gcse Statistics Coursework Hypothesis Ideas
100% (2)
Gcse Statistics Coursework Hypothesis Ideas
6 pages
Means and Variance of The Sampling Distribution of Sample Means
No ratings yet
Means and Variance of The Sampling Distribution of Sample Means
19 pages
Y. B. Almquist, S. Ashir, L. Brännström - A Guide To Quantitative Methods-Stockholm University (2019)
100% (1)
Y. B. Almquist, S. Ashir, L. Brännström - A Guide To Quantitative Methods-Stockholm University (2019)
343 pages
Levey-Jennings_Charts
No ratings yet
Levey-Jennings_Charts
6 pages
R Module 6 - Data Summarization
No ratings yet
R Module 6 - Data Summarization
25 pages
The First Quarterly Assessment Results of Grade 2
No ratings yet
The First Quarterly Assessment Results of Grade 2
13 pages
R Module 11 - Statistics
No ratings yet
R Module 11 - Statistics
35 pages
Mean Median Mode
No ratings yet
Mean Median Mode
4 pages
Sample-Final-Exam-Q-6
No ratings yet
Sample-Final-Exam-Q-6
5 pages
PRACTICAL4
No ratings yet
PRACTICAL4
4 pages
practice_questions_on_central_tendency_on_mtcars
No ratings yet
practice_questions_on_central_tendency_on_mtcars
3 pages
Beyond "Urban Planning": An Overview of Challenges Unique To Planning Rural California
No ratings yet
Beyond "Urban Planning": An Overview of Challenges Unique To Planning Rural California
45 pages
Module V 1
No ratings yet
Module V 1
7 pages
Starting With R
No ratings yet
Starting With R
34 pages
Tutorial 1 - R Programming
No ratings yet
Tutorial 1 - R Programming
40 pages
Unit 1 - Statistics With r
No ratings yet
Unit 1 - Statistics With r
25 pages
R Programming-Chapiter 6
No ratings yet
R Programming-Chapiter 6
10 pages
Statistics
No ratings yet
Statistics
10 pages
Nummerical Summaries
No ratings yet
Nummerical Summaries
11 pages
Assignment 2 2023
No ratings yet
Assignment 2 2023
9 pages
WWWWWW WWWWWW WWWWWW WWWWWW WWWW WWWW WWWWWW: Data Transformation With Dplyr
No ratings yet
WWWWWW WWWWWW WWWWWW WWWWWW WWWW WWWW WWWWWW: Data Transformation With Dplyr
2 pages
Course 9 statistical Functions._a4045a80c16df603a82b5d0c5ae38704
No ratings yet
Course 9 statistical Functions._a4045a80c16df603a82b5d0c5ae38704
5 pages
Descriptive and Inferential Statistics With R
No ratings yet
Descriptive and Inferential Statistics With R
6 pages
DSBDL 3: 2.1 Getting The Count of A Particular Column
No ratings yet
DSBDL 3: 2.1 Getting The Count of A Particular Column
11 pages
Exercise - Commands in Blue, Comments in Green, Outputs in Black
No ratings yet
Exercise - Commands in Blue, Comments in Green, Outputs in Black
4 pages
Final DSR Lab Record
No ratings yet
Final DSR Lab Record
16 pages
Probability and Distribution
No ratings yet
Probability and Distribution
43 pages
ECON 1100 R04 - R.Commands PDF
No ratings yet
ECON 1100 R04 - R.Commands PDF
15 pages
Module 5-6
No ratings yet
Module 5-6
12 pages
data analysis in r
No ratings yet
data analysis in r
10 pages
Research Article: Test Anxiety Levels of Board Exam Going Students in Tamil Nadu, India
No ratings yet
Research Article: Test Anxiety Levels of Board Exam Going Students in Tamil Nadu, India
10 pages
Business Analytics-1: STR (Crew - Data)
No ratings yet
Business Analytics-1: STR (Crew - Data)
16 pages
Using R For Basic Statistical Analysis
No ratings yet
Using R For Basic Statistical Analysis
11 pages
Mean in R-1
No ratings yet
Mean in R-1
14 pages
unit3_R[1] (1)
No ratings yet
unit3_R[1] (1)
30 pages
Business Analytics Unit - IV Notes_60637706_2025_05!15!02_16
No ratings yet
Business Analytics Unit - IV Notes_60637706_2025_05!15!02_16
28 pages
Unit 3
No ratings yet
Unit 3
11 pages
EMTIV_Assignment_2020
No ratings yet
EMTIV_Assignment_2020
3 pages
Advanced Statistical Methods using R Notes
No ratings yet
Advanced Statistical Methods using R Notes
55 pages
Collection of Portfolio Tracking Error Calculation
No ratings yet
Collection of Portfolio Tracking Error Calculation
23 pages
Vivek Jain SPM 8th Ed (OCR) - 868
No ratings yet
Vivek Jain SPM 8th Ed (OCR) - 868
1 page
Jins Edwin Chemistry Assignment Semester 1 Term 2 2024
No ratings yet
Jins Edwin Chemistry Assignment Semester 1 Term 2 2024
15 pages
DEV_Lab_Manual
No ratings yet
DEV_Lab_Manual
27 pages
Unit V Statistics R
No ratings yet
Unit V Statistics R
60 pages
Capital Gains
No ratings yet
Capital Gains
8 pages
Wadell Volume, Shape, and Roundness of Quartz Particles 1935
No ratings yet
Wadell Volume, Shape, and Roundness of Quartz Particles 1935
31 pages
MDPN460 Lecture05
No ratings yet
MDPN460 Lecture05
32 pages
Business Analytics Unit 4
No ratings yet
Business Analytics Unit 4
24 pages
Data Preprocessing
No ratings yet
Data Preprocessing
27 pages
Business Line
No ratings yet
Business Line
12 pages
Module 1
No ratings yet
Module 1
19 pages
Teaching Notes of R
No ratings yet
Teaching Notes of R
78 pages
Lab1: Introduction To R: Islr2
No ratings yet
Lab1: Introduction To R: Islr2
10 pages
DSA1101 2019 Week1 Part2
No ratings yet
DSA1101 2019 Week1 Part2
38 pages
The Inter-Rater Reliability of Estimating The Size of Burns From Various Burn Area Chart Drawings
No ratings yet
The Inter-Rater Reliability of Estimating The Size of Burns From Various Burn Area Chart Drawings
15 pages
Lec 4
No ratings yet
Lec 4
18 pages
Basic Statistics
No ratings yet
Basic Statistics
66 pages
Seminar Report - Chhetra Bahadur Rana
No ratings yet
Seminar Report - Chhetra Bahadur Rana
37 pages
r Module 5
No ratings yet
r Module 5
21 pages
Data Science Using R
No ratings yet
Data Science Using R
11 pages
Engineering Data Analysis 2
No ratings yet
Engineering Data Analysis 2
10 pages
Unit3__R
No ratings yet
Unit3__R
19 pages
Analysing Data Using Linear Models 5th Ed January 2021
No ratings yet
Analysing Data Using Linear Models 5th Ed January 2021
388 pages
Mtcars: Choosing The Most Related Variable (S) To The Response
No ratings yet
Mtcars: Choosing The Most Related Variable (S) To The Response
13 pages
DS Lab
No ratings yet
DS Lab
31 pages
Exploratory Data Analysis - NOTES
No ratings yet
Exploratory Data Analysis - NOTES
31 pages
Da Lab It
No ratings yet
Da Lab It
20 pages
Stats Lab1
No ratings yet
Stats Lab1
11 pages
Uncertainty
No ratings yet
Uncertainty
46 pages
MultivariateRGGobi PDF
No ratings yet
MultivariateRGGobi PDF
60 pages
Final Cost Practical
No ratings yet
Final Cost Practical
29 pages
R Programming Slides
No ratings yet
R Programming Slides
73 pages
X - 15 x-1 2. Print ('Hello Word!') ## (1) "Hello Word!" 3. X - 4 y - 5 Z - X+y Print (Z) 4. X - 4 y - 5 Cat ('The Sum of X and y Is', X+y)
No ratings yet
X - 15 x-1 2. Print ('Hello Word!') ## (1) "Hello Word!" 3. X - 4 y - 5 Z - X+y Print (Z) 4. X - 4 y - 5 Cat ('The Sum of X and y Is', X+y)
15 pages
R For Data Exploration
No ratings yet
R For Data Exploration
52 pages
M11 - 12SP IIIe 2 3 Illustrate The Central Limit Theorem and Defines The Sampling Distribution of The Sample Mean Using The Central Limit Theorem
25% (4)
M11 - 12SP IIIe 2 3 Illustrate The Central Limit Theorem and Defines The Sampling Distribution of The Sample Mean Using The Central Limit Theorem
3 pages
Chapter 11-Without Aggregation
No ratings yet
Chapter 11-Without Aggregation
42 pages
Mean, Median and Mode
No ratings yet
Mean, Median and Mode
4 pages
A Study On The Financial Performance of General Insurance Companies in India
No ratings yet
A Study On The Financial Performance of General Insurance Companies in India
7 pages
Functions and Packages
No ratings yet
Functions and Packages
7 pages
Measures of Centraltendency
No ratings yet
Measures of Centraltendency
33 pages
Practice Exam
No ratings yet
Practice Exam
38 pages
Unit-2 PDS (Final) PDF (G)
No ratings yet
Unit-2 PDS (Final) PDF (G)
14 pages
Gauging Gage Minitab
No ratings yet
Gauging Gage Minitab
16 pages
Introduction To R
No ratings yet
Introduction To R
20 pages
ML Lab Final R22
No ratings yet
ML Lab Final R22
67 pages
Chapter - 3 Common Statistical Procedure
No ratings yet
Chapter - 3 Common Statistical Procedure
20 pages
Topic 1 - Estimating Market Risk Measures Question PDF
No ratings yet
Topic 1 - Estimating Market Risk Measures Question PDF
16 pages
Random Sample Consensus: Robust Estimation in Computer Vision
From Everand
Random Sample Consensus: Robust Estimation in Computer Vision
Fouad Sabry
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

R Module 5

Uploaded by

R Module 5

Uploaded by

CHAPTER 5

Statistical Analysis Using R

On completion of this Chapter you will be able to:

Statistical analysis in R is performed by using many in-built functions. Most of

5.1. Basic Statistical Measures

> x <- c(1,2,3,4,5,5,5)

5.1.4. Standard Deviation and Variance

5.1.5. Quartile Ranges

5.1.6. Other Statistical Functions

> cov(mtcars$mpg, mtcars$cyl)

[19] 8.015440e+14 3.206176e+15 1.282470e+16 1.025976e+17 8.207810e+17

5.2. Summary Statistics

5.3. Normal Distribution

Figure 5.1 Plot of dnorm()

The pnorm() function returns the probability of a normally distributed random

Figure 5.2 Plot of pnorm()

Figure 5.3 Plot of qnorm()

This function is used to generate random numbers whose distribution is normal. It

Figure 5.4 Histogram Using rnorm()

5.4. Binomial Distribution

Figure 5.5 Plot Using dbinorm()

This function gives the cumulative probability of an event. It is a single value

5.5. Correlation Analysis

x, y - numeric vectors with the same length

> cor.test(mtcars$hp, mtcars$mpg, method = “pearson”)

> cor(mtcars$hp, mtcars$disp, method = “pearson”)

Pearson’s product-moment correlation

Figure 5.6 Cor Plot of mtcars Dataset

Figure 5.7 Scatter Plot of Negative Correlation

> plot(mtcars$hp, mtcars$disp, xlab=”Horse Power of the Cars”,

Figure 5.8 Scatter Plot of Positive Correlation

5.6. Regression Analysis

5.6.1. Linear Regression

Regression analysis is a widely used statistical tool. A relationship model is established

The general mathematical equation for a linear regression is: y = ax + b, where

A simple example of regression is predicting income of a person when his

formula - relation between x and y

Min 1Q Median 3Q Max

Residual standard error: 325.3 on 8 degrees of freedom

object - model created using lm() function

The graphical representation of this linear regression is drawn by the below

Figure 5.9 Plot of Linear Regression

5.6.2. Multiple Regressions

Multiple regressions is an extension of linear regression into relationship between

The equation for multiple regressions consists of the below variables.

5.7. Analysis of Variance (ANOVA)

Consider the dataset PlantGrowth available in R for performing one-way

> months = gl(12,5)

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.