Lab3Instructions Knitr
Lab3Instructions Knitr
2024-09-16
In this lab we are going to learn how to read data into R and perform some descriptive statistics.
First, set a working directory. You can also set this using the menu in R-Studio.
Session>Set Working Directory>Choose Directory. . . Note that your file will be invisible - navigate to the
folder it resides within and hit ‘Open’
It will post something similar to the line below in your console.
setwd("~/Library/CloudStorage/OneDrive-DePaulUniversity/DePaul/Teaching/2025WQ/BIO206/Labs/Day_4")
list.files()
The data is in CSV format (comma separated value). You can see the commas if you open this file in a text
editor.
Excel cannot save any plots in this format. It will only save the text data.
Read in our data and name it something meaningful. Here I have named it BatData - BatData is now an
object in R.
BatData<-read.csv("batbrains.csv")
Let’s examine some data and plot a grouped frequency distribution. We’ll begin with a histogram.
In the code chunk below, the dollar sign allows us to access a variable directly. You can also using indexing
(i.e., X[,3]) to access a variable. In R-Studio it will give you options that you can click as a shortcut. We
add two other arguments separated by commas, xlab and main.
xlab lets us change the axis labels.
main is the title, I set it as NULL so it removes it.
What does the distribution look like?
1
12
10
8
Frequency
6
4
2
0
What if I want to see separate histograms for each bat family? I can achieve this using the subset function.
The subset function will create two new data frames that I have named Hip and Mol based on the names of
the bat families.
The subset function uses a logical statement (using the == symbol), to ask the data to return things that
are either true or false.
It provides all the data for which the family column entry reads ‘Hipposideridae’ or ‘Molossidae’ in the case
of the second line.
I also add some formatting to the plots using the par function. The pty argument allows me to make the
plot square, and the mfrow argument allows me to define how I want the two plots arranged. In this case I
say give me 1 row and 2 columns. Note that mfrow needs two values, hence why I use the c function to join
the two values together in a vector.
par(pty='s', mfrow=c(1,2))
hist(Hip$brain_size, xlab="Hipposideridae Brain Size (mm3)", main = NULL)
hist(Mol$brain_size, xlab="Molossidae Brain Size (mm3)", main = NULL)
2
8
5
4
6
Frequency
Frequency
3
4
2
2
1
0
0
100 300 500 700 100 300 500 700
mean(BatData$brain_size)
## [1] 387.3432
median(BatData$brain_size)
## [1] 369.6
sd(BatData$brain_size)
## [1] 183.1476
But, what if I wanted a mean for a given bat family. I can use the aggregate function.
First, you join all the continuous variables together that you’re interested in using the function cbind. Then
you tell it which categorical variable you want to find the mean/median/sd for, in this case, family.
FUN in this case means the ‘function’ we wish to apply.
3
aggregate(x = cbind(brain_size,amygdala,hippocampus)~family, FUN="mean", data = BatData)
Boxplots are a great way to illustrate a continuous variable grouped by a discrete variable. They show you
the range (excluding outliers), the interquartile range, and the median of your data. You can see how your
continuous data is distributed.
The general formula for producing a boxplot is as follows: boxplot(continuous~categorical). Or, in other
words, boxplot(Dependent~Independent). You give the boxplot function your whole data frame (data =
BatData), so you do not need to use the $ here.
par(pty='s')
boxplot(hippocampus~family,data = BatData, xlab = "Family", ylab = "Brain Size (mm3)")
50
Brain Size (mm3)
40
30
20
10
Hipposideridae Molossidae
Family
Let’s produce a scatter plot with two continuous variables. I can produce my plot in two different ways the
first involves the use of the ~ symbol. Here, you place the y variable (dependent) first, then x.
4
par(pty='s')
plot(amygdala~brain_size, data = BatData,
xlab = "Brain Size (mm3)", ylab = "Amygdala (mm3)")
30
25
Amygdala (mm3)
20
15
10
5
You can also explicitly define x and y. However, you must use the $ sign to tell R exactly what you’re doing.
Note that you can get superscript characters, it’s just a really complex function so I wanted to introduce it
at the very end.
par(pty='s')
plot(x = BatData$brain_size, y = BatData$amygdala,
xlab = expression('Brain Size (mm'ˆ3*')'), ylab = expression('Amygdala (mm'ˆ3*')'))
30
25
Amygdala (mm3)
20
15
10
5