R Lab: Assumptions of Normality: Part 1. Assessing Parametric Assumptions

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 18

R Lab: Assumptions of Normality

Part 1. Assessing parametric assumptions


from: Labs using R: 8. The normal distribution and sample means
https://whitlockschluter3e.zoology.ubc.ca/RLabs/R_tutorial_Normal_and_sample_means.html

Assumption of Normality
Many statistical tests assume that the variable being analyzed has a normal distribution. Fortunately,
many of these tests are fairly robust to this assumption—that is, they work reasonably well even when
this assumption is not quite true, especially when sample size is large. However, we need to be able to
definitively determine if our data are from a normal distribution, and how to proceed if they are not
normal.

Descriptive statistics

First, we could attempt to assess this using descriptive statistics. Let’s use the weight of birds from the
Bumpus data set for an example.

bumpusData <- read.csv("DataForLabs/bumpus.csv")

We already know how to ask R for the basic descriptive statistics that might help us determine
normality: mean, median, standard deviation, IQR.

mean(bumpusData$weight_g) #calculates the mean

median(bumpusData$weight_g) #calculates the median

sd(bumpusData$weight_g) #calculates the standard deviation

IQR(bumpusData$weight_g) #calculates the distance between the third and first quartiles of a
variable

But we now know that there are other measures of a normal distribution that might be helpful here to.
Specifically measures of skewness and kurtosis. To do this we need to either calculate these values by
hand, or install a new package that can calculate these for us. I’m going to use package fBasics to do this
for me.

install.packages("fBasics")

You’ll potentially get a bunch of text in your console, but the key words you hope to see are “package
‘fBasics’ successfully unpacked and MD5 sums checked”. Once you see this, you’re good to go.

library(fBasics)
skewness(bumpusData$weight_g) #calculates the skewness

# [1] 0.6533563
# attr(,"method")
# [1] "moment"

kurtosis(bumpusData$weight_g) #calculates the kurtosis

# [1] 1.027414
# attr(,"method")
# [1] "excess"

The values we obtain tell us about the degree of skew and the degree of kurtosis. The extra lines here
tell us the method that was used to perform the calculation (we can ignore this). Because the mean is
greater than the median, and skewness is > 0, we know our data are positively skewed (a long right-
hand tail). Because our kurtosis is > 0, we know that our distribution is leptokurtic (a narrow
distribution).

Is this close enough to call “normal”? Honestly, it’s probably hard to know just from this information (we
expect mean = median, standard deviation is close to 1, and skewness and kurtosis are close to zero)
because we’re not experienced in seeing this level of information. Maybe we should actually visualize
these data to help.

Visual inspection

The second way we can attempt to assess normality is by visualizing the data. A good way to start is to
simply visualize the frequency distribution of the variable in the data set by drawing a histogram.

Remember we can use ggplot() to draw histograms.

ggplot(bumpusData, aes(x = weight_g)) + geom_histogram(bins = 10)

We specified the data set bumpusData for the function to use. Note that here the numerical variable is
entered as the variable on the x axis (x = weight_g). No y variable is specified because that is simply the
count of individuals at each weight. We’ve also specified the number of bins, but this is not required.

We can also use ggplot() to draw density plots, here I’ll overlay the density plot on the histogram. This
code has a new argument we need to provide to the histogram geometry “aes(y = ..density..)”. This
argument tells ggplot to adjust the y-axis so that the density plot can be shown on top of the histogram.
Note that the y-axis now shows the data as a proportion (not a count/frequency).

ggplot(bumpusData, aes(x = weight_g)) + geom_histogram(aes(y = ..density..), bins = 10) +


geom_density(color = "red")
This is great and helps us better visualize our data, but we have a small problem here. If we are using
histograms to visualize data we plan to use in a 2-sample t-test (or ANOVA, coming soon!), we don’t
know if our samples came from one population or two populations. Specifically, that is what we’ll be
testing with a t-test or an ANOVA: do these samples come from the same population? If we are doing
one of these tests, we’ll need to examine data from each category separately. To do this, we can use the
“facet” function in ggplot2 to produce multiple histograms at the same time.

Multiple histograms

A multiple histogram plot is used to visualize the frequency distribution of a numerical variable
separately for each of two or more groups. It allows easy comparison of the location and spread of the
variable in the different groups, and it helps to assess whether the assumptions of relevant statistical
methods are met.

Plotting age again as a function of survive, we can write:

ggplot(bumpusData, aes(x = weight_g)) + geom_histogram(bins = 10)


+ facet_wrap(~ sex, ncol = 1)

The categorical variable is specified in the facet_wrap() function with a formula (~ sex). The “facets”
here are the separate plots for each category of “sex”. The argument “ncol = 1” tells ggplot how many
columns to create to display the histograms; this is not necessary but allows the graphs to be more
spread out horizontally.

If we look at these 2 graphs, we now notice that the weights of males are fairly evenly distributed (i.e.,
not too normal), and both males and females have a value that appears extreme compared to the rest
of the values.

One problem with using histograms is that it is very hard to tell if a sample came from a normal
distribution when sample sizes are small. For example, look at these graphs and try to determine which
are from a normal distribution:

It was a trick! All of these graphs represent data that was randomly sampled from a normal distribution.
It is not always easy to visually determine normality from these types of graphs. Let’s look at one more
way to visualize our data.

QQ Plots

Another graphical technique that can help us visualize whether a variable is approximately normal is
called a quantile plot (or a QQ plot). The QQ plot shows the data on the vertical axis ranked in order
from smallest to largest (“sample quantiles” in the figure below). On the horizontal axis, it shows the
expected value of an individual with the same quantile if the distribution were normal (“theoretical
quantiles” in the same figure). The QQ plot should follow more or less along a straight line if the data
come from a normal distribution (with some tolerance for sampling variation). Essentially, we’re trying
to visualize the relationship between a given sample and the normal distribution.

QQ plots can be made in R using a function called qqnorm(). Simply give the vector of data as input and
it will draw a QQ plot for you. (qqline() will draw a line through that Q-Q plot to make the linear
relationship easier to see.)

qqnorm(bumpusData$weight_g) # create a QQ plot of data to visualize normality

qqline(bumpusData$weight_g) # adds a “normal” line onto the QQ plot created with qqnorm()
funciton

This is what the resulting graph looks like for the weight data. The dots do not land along a perfectly
straight line. In particular, the graph curves at the upper end. This looks fairly close to normal (I would
actually call it normal), but so far, all of these methods are somewhat unreliable (how do I know you and
I are applying the same standard?).

It is difficult to interpret QQ plots without experience. One of the goals of today’s exercises will be to
develop some visual experience about what these graphs look like when the data is truly normal. When
data are not normally distributed, the dots in the quantile plot will not follow a straight line, even
approximately. For example, here is a histogram and a QQ plot for the population size of various
counties, from the data in countries.csv. These data are very skewed to the right, and do not follow a
normal distribution at all.

Shapiro-Wilk test of Normality

The third way we can assess normality is through hypothesis testing, specifically by conducting a
Shapiro-Wilk test of normality. This is the method I recommend for our class. It is definitive, reliable, the
least dependent on your experience, and a good place to start with your data even if you are adept at
other methods.

This is how I expect you will test assumptions of normality for our class. This is a preliminary test you
will conduct before conducting a parametric test in our class.

The Shapiro-Wilk method tests the null hypothesis that the sample distribution is not different from
the normal distribution, the alternative hypothesis is that the distribution is different from normal. If
this test returns a significant result (P < 0.05), then the data are not from a normal distribution. This is
essentially a goodness-of-fit test. To conduct this test, we can use the function shapiro.test().

This test requires one argument, the variable we wish to test for normality.

shapiro.test(bumpusData$weight_g)
## Shapiro-Wilk normality test
## data: bumpusData$weight_g
W = 0.96966, p-value = 0.003952

The output here is the test statistic “W”, and the P-value. This result has a very small P-value, indicating
that these data do not appear to have come from a normal distribution. Our weight data do not meet
parametric assumptions! We need to consider our options from here: ignore it, transformation, or non-
parametric tests. In the next part we’ll explore some of these solutions.

Assumption of Homogeneity of Variance


from: Labs using R: 9: Comparing the means of two groups
https://whitlockschluter3e.zoology.ubc.ca/RLabs/R_tutorial_Comparing_means_of_2_groups.ht
ml

Another key assumption of many statistical tests is that the variance among groups is approximately the
same. Many tests are sensitive to this assumption; they tend to not return reliable results if this
assumption is not met. There are several ways to determine if your data meet this assumption. For now
though, I want to introduce the conceptual basis for this (also see the lecture / textbook), and one of the
primary methods we can test this assumption.
The method I recommend (and the method we’ll use) is called Levene’s test. Like the Shapiro-
Wilk test of normality, this is a preliminary hypothesis test that we will conduct prior to a
parametric test in our class. For the time being, this is how I expect you will test
assumptions of homogeneity of variance for our class. The Levene’s method tests the
null hypothesis that group variances are equal, the alternative hypothesis is that at least
two of the group’s variances are not equal.
Unfortunately, the basic version of R does not do Levene’s test, but it is available from the
package car with the function leveneTest(). As with other packages you would need to install
the package car first, and then use library to load its functions into R. To use leveneTest(), you
need three arguments for the input:
data: the dataframe that contains the data
formula: an R formula that specifies the response variable and the explanatory variables
center: how to calculate the center of each group. For the Levene’s test, use center = mean
Let’s see this with the Bumpus data, with weight as the numerical response variable and sex as
the categorical explanatory variable:
install.packages(“car”)

library(car)

leveneTest(data = bumpusData, weight_g ~ sex, center = mean)

## Levene's Test for Homogeneity of Variance (center = mean)


## Df F value Pr(>F)
## group 1 0.0121 0.9127
## 134
## Warning message:
## In leveneTest.default(y = y, group = group, ...) :group coerced to factor.

The output is a test statistic “F value” and the P-value “Pr(>F)”. For these data, the P-value of
Levene’s test is P = 0.9127. We fail to reject the null hypothesis that the group that male and
female birds have different variances for the variable weight.
Our data meet the assumption of homogeneity of variance! But we still need to consider our
options for dealing with the failure to meet the normality assumption: ignore it, transformation, or
non-parametric tests. In the next part we’ll explore some of these solutions.

Part 2. Dealing with Assumption Violations


Transformations
One solution for dealing with data that violate the assumption of normality is to try to use simple
mathematical transformation on each data point to change the scale of the original information in a way
that may be better matched to the assumptions of our statistical tests. With a transformation, we apply
the same mathematical function to each value of a given numerical variable for individual in the data
set.

Log

With a log-transformation, we take the logarithm of each individual’s value for a numerical variable. We
can only use the log-transformation if all values are greater than zero.

To take the log transformation for a variable in R is very simple. We simply use the function log(), and
apply it to the vector of the numerical variable in question. For example, to calculate the log of weight
for birds in the Bumpus data, we use the command:

log(bumpusData$weight_g)

This will return a vector of values, each of which is the log of weight for the birds. We should probably
save this vector as a variable in our dataframe so we can assess normality for the transformed variable
(see methods above).

To add a new column / variable to your dataframe you can simply assign a vector to a new column
name. For example, to add the log of weight as a column in the bumpusData dataframe, we can write:
bumpusData$log_weight <- log(bumpusData$weight_g)

You should inspect the dataframe to see that the variable “log_weight“ is now a column in
bumpusData.

At this point, you would now assess the normality of the transformed variable using the Shapiro-Wilk
test, and compare variances between the two groups using the Leven’s test to see if it meets our
assumptions. If it does, you can proceed with your analysis using the new variable. If it still doesn’t meet
assumptions, we need to keep trying alternate solutions.

Square root

With a log-transformation, we take the square root of each individual’s value for a numerical variable.
We can only use this transformation if all values are greater than zero.

To perform the square root transformation for a variable (e.g., the weight variable in the Bumpus
dataframe), we use the command:

bumpusData$sqrt_weight <- sqrt(bumpusData$weight_g)

At this point, you would now assess the normality of the transformed variable using the Shapiro-Wilk
test, and compare variances between the two groups using the Leven’s test to see if it meets our
assumptions. If it does, you can proceed with your analysis using the new variable. If it still doesn’t meet
assumptions, we need to keep trying alternate solutions.

Arcsine

If we have proportion data that is not normally distributed, we can use the arcsine transformation to
bring it closer to normality. This transformation takes the arcsine of the square root of our data – its
arcane, but it works in specific situations.

The Bumpus data set we’re working with right now doesn’t have any proportional data, so I’m going to
create some by creating new, make-believe variable called “proportion_weight” like this:

proportion_weight <- bumpusData$weight_g / max(bumpusData$weight_g)

This is totally manufactured. You should not do this with a real data set, we’re only doing it for the
purposes of learning the code to perform the transformation. Again, you can only use this
transformation for actual proportion / percentage data.

Now we can transform our make-believe variable:

bumpusData$arcsin_weight <- asin(sqrt(proportion_weight))

At this point, you would now assess the normality of the transformed variable using the Shapiro-Wilk
test, and compare variances between the two groups using the Leven’s test to see if it meets our
assumptions. If it does, you can proceed with your analysis using the new variable. If it still doesn’t meet
assumptions, we need to keep trying alternate solutions.
Mann-Whitney U-test
Another solution for dealing with data that violate assumptions of normality (and sometimes
homogeneity of variance) is to use a branch of statistics that does not require these assumptions be
met: non-parametric statistics. Today we’ll see a non-parametric alternative of a two-sample t-test, the
Mann-Whitney U-test.

The Mann-Whitney U-test is used to compare the distributions of two groups. The test essentially
transforms our data into ranks (ordinal data), and removes the need to rely on the normal / t-
distribution to perform our calculations / inference.

Annoyingly, in R, this test is called a Wilcoxon ranked-sum test (same test, different name). So we’ll
perform this test using the function wilcox.test().

To use wilcox.test(), you need one argument: a formula that specifies the response variable
and the explanatory variables.
wilcox.test(bumpusData$weight_g ~ bumpusData$sex)

OR

wilcox.test(weight_g ~ sex, data = bumpusData)

## Wilcoxon rank sum test with continuity correction

## data: weight_g by sex


## W = 1438.5, p-value = 0.001687
## alternative hypothesis: true location shift is not equal to 0

The output here is the test statistic “W”, which is equivalent to the “U” statistic, and the P-value for the
test. This result suggests that the null hypothesis can be rejected: there is a difference in the distribution
of weights for male and female birds.

We need to find sample sizes to report in our statistical evidence (the number of males and females).

length(bumpusData$weight_g[bumpusData$sex == "m"])

## [1] 87
length(bumpusData$weight_g[bumpusData$sex == "f"])

## [1] 49

Male and female house sparrows have significantly different age distributions (U = 1438.5, n 1 = 87, n2 =
49, P = 0.002).
Questions
1. It is often difficult to interpret histograms and QQ plots without experience. One of today’s
goals is to develop some visual experience about what these graphs look like when the data is
truly normal. To do that, we will take advantage of a function built into R to generate random
numbers drawn from a normal distribution. This function is called rnorm().

The function rnorm() will return a vector of numbers, all drawn randomly from a normal
distribution. It takes three arguments:
n: how many random numbers to generate (the length of the output vector)
mean: the mean of the normal distribution to sample from
sd: the standard deviation of the normal distribution

a) Use function rnorm() to create a vector named “normal” that contains 100 random
numbers drawn from a normal distribution with mean 13 and standard deviation 4. Provide
your R code.

normal <- (rnorm(n = 100, mean = 13, sd = 4))

b) Use ggplot() to create a histogram with an overlaid density plot of your “normal” vector
(remember, this function requires data to be in a dataframe). Provide your R code and your
graph.

normal_data_frame <- as.data.frame(normal)


ggplot(normal_data_frame, aes(x = normal)) + geom_histogram(bins = 10)
ggplot(normal_data_frame, aes(x = normal)) + geom_histogram(aes(y = ..density..), bins = 10)
+ geom_density(color = "red")
c) Create a QQ plot (using functions qqnorm() and qqline() of the “normal” vector you
created in the previous step. Provide your R code and your graph.
qqnorm(normalDataFrame$normal)
qqline(normalDataFrame$normal)
d) Rerun your code in parts A through C at least a 10 times. Each time you run the code, look
at the histograms and QQ plots. Think about the ways in which these look different from the
expectation of a normal distribution (but remember that each of these samples comes from a
truly normal population). In a sentence or two, describe variation you see in these graphs.

Each time the graphs and density lines look slightly different from one another because the
data changes each time since the mean and sd are set but the individual number quantities
can change. That being said, each time the graph continued to follow a relatively normal
distribution with little variation because the data is random each time.

2. In 1898, Hermon Bumpus collected house sparrows that had been caught in a severe winter
storm in Chicago. He made several measurements on these sparrows, and his data are in the file
“bumpus.csv” in the “DataForLabs” ABD folder. Bumpus used these data to observe differences
between the birds that survived and those that died from the storm. Let’s use these data to
practice testing for the fit of a normal distribution.
a) Use ggplot() to plot the distribution of total length (“total_length_mm”; this is the length
of the bird from beak to tail). Does the data look as though it comes from distribution that is
approximately normal? Provide your R code and your graph.

ggplot(SparrowsData, aes(x=total_length_mm)) +geom_histogram(bins=10)


ggplot(SparrowsData, aes(x = total_length_mm)) + geom_histogram(aes(y = ..density..), bins
= 10) + geom_density(color = "red")

b) Use ggplot() to plot the distribution of total length separately for survivors versus non-
survivors. Do these distributions both appear normal? Provide your R code and your graph.

ggplot(SparrowsData, aes(x = total_length_mm)) + geom_histogram(bins = 10) + facet_wrap(~


survival, ncol = 1)
Both of these distributions appear to be evenly distributed, but neither one of them appears
to be normally distributed.

c) Use qqnorm() and qqline() to plot a QQ plot for total length. What is your interpretation
of the fit of these data to a normal distribution? Provide your R code and your graph.

qqnorm(SparrowsData$total_length_mm)
qqline(SparrowaData$total_length_mm)

d) Let’s make it more official. Use a hypothesis test to determine if the total length data
come from a normal distribution. Provide your R code and state whether the data meet the
assumption of normality or not.

shapiro.test(SparrowsData$total_length_mm)

The data do not meet the assumption of normality because the p-value is less than 0.05.

e) Test for homogeneity of variances in total length across survivors versus non-survivors.
Provide your R code and state whether the data meet this assumption or not.

leveneTest(data = SparrowsData, total_length_mm ~ survival, center = mean)


The data meet the assumption of homogeneity of variance since the p-value is greater than
0.05.

3. Using the Bumpus data from question #2, let’s see if we can make the data fit a normal
distribution through transformation. Perform a square root transformation on the total length
data, use qqnorm() and qqline() to visualize the transformed variable, then perform a Shapiro-
Wilk test. Provide your R code for all steps, your QQ plot, and a statement indicating if the
transformed data meet the assumption of normality.

SparrowsData$sqrt_total_length_mm <- sqrt(SparrowsData$total_length_mm)


SparrowsData$sqrt_total_length_mm
qqnorm(SparrowsData$sqrt_total_length_mm)
qqline(SparrowsData$sqrt_total_length_mm)
shapiro.test(SparrowsData$sqrt_total_length_mm)
wilcox.test(SparrowsData$sqrt_total_length_mm ~ SparrowsData$survival)
The transformed data do not meet the assumption of normality.

4. Again, let’s use the Bumpus data from the previous question. This time, let’s test if there is a
difference in size (in total length) between birds that survived the storm versus those that did not
survive the storm. Based on your results from question #2 of the untransformed data, make a
determination of which statistical test you should use to answer this question. Perform an
appropriate test on the untransformed data to answer the question.

a) What statistical test will you use?


Mann-Whitney U-test

b) State an appropriate null hypothesis and alternative hypothesis.


Null Hypothesis: There is no difference in size (in total length) between birds that survived
the storm versus those that did not survive the storm.
Alternative Hypothesis: There is a difference in size (in total length) between birds that
survived the storm versus those that did not survive the storm.
c) Calculate and state your degrees of freedom or sample size (show your work).
length(SparrowsData$total_length_mm[SparrowsData$survival == "survived"])
length(SparrowsData$total_length_mm[SparrowsData$survival == "died"])

d) Use R to conduct your test (calculate a test statistic and P-value) and provide your R code
for the test. (2 points)
wilcox.test(SparrowsData$total_length_mm ~ SparrowsData$survival)

e) State a biological conclusion regarding your hypotheses and properly cite your statistical
evidence (test statistic, df or sample size, and P-value).
There is a difference in size (in total length) between birds that survived the storm versus
those that did not survive the storm (U = 3011.5, n 1 = 72, n2 = 64, P = 0.001973).

Mikhayla’s R Script:
https://docs.google.com/document/d/1ybMh32TP6paEC3eRTWVtRn1_2CzzoeVCSS-0DKbT1Co/edit?
usp=sharing

Hannah’s R Script:
https://docs.google.com/document/d/138Zn6R_GnvaggxjivlULR_NQR9qDaT1_la6iWxwoh8w/edit?
usp=sharing

Viktoria’s R Script:
https://docs.google.com/document/d/1jd21sM584tYlbAuI8Eecq4g8RXnNXSnSu_i1XHzTqjM/e
dit?usp=sharing

Kaylene’s R Script:
https://drive.google.com/file/d/1FRhECz-ZzzBGQ20ZgD6uS8UsbWVvY4Hy/view?usp=sharing
Carly’s Meye:

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy