0% found this document useful (0 votes)
4 views5 pages

Lab3Instructions Knitr

This lab focuses on reading data into R and performing descriptive statistics, specifically using a dataset on bat brain sizes. It covers setting a working directory, reading CSV files, creating histograms, calculating mean, median, and standard deviation, and producing boxplots and scatter plots. The lab emphasizes the importance of visualizing data and understanding the distribution of variables within different bat families.

Uploaded by

Jai Calatrava
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views5 pages

Lab3Instructions Knitr

This lab focuses on reading data into R and performing descriptive statistics, specifically using a dataset on bat brain sizes. It covers setting a working directory, reading CSV files, creating histograms, calculating mean, median, and standard deviation, and producing boxplots and scatter plots. The lab emphasizes the importance of visualizing data and understanding the distribution of variables within different bat families.

Uploaded by

Jai Calatrava
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Lab #3

2024-09-16

In this lab we are going to learn how to read data into R and perform some descriptive statistics.
First, set a working directory. You can also set this using the menu in R-Studio.
Session>Set Working Directory>Choose Directory. . . Note that your file will be invisible - navigate to the
folder it resides within and hit ‘Open’
It will post something similar to the line below in your console.

setwd("~/Library/CloudStorage/OneDrive-DePaulUniversity/DePaul/Teaching/2025WQ/BIO206/Labs/Day_4")

Is my file in the directory I just selected?

list.files()

## [1] "batbrains.csv" "batbrainsFull.csv"


## [3] "Day_4_Script.R" "Lab3Instructions_Knitr.pdf"
## [5] "Lab3Instructions_Knitr.Rmd" "LabWorksheet_3_Complete.docx"
## [7] "LabWorksheet_3_KEY.pdf" "LabWorksheet_3.docx"
## [9] "Worksheet_Hist.pdf" "Worksheet_scatter.pdf"

The data is in CSV format (comma separated value). You can see the commas if you open this file in a text
editor.
Excel cannot save any plots in this format. It will only save the text data.
Read in our data and name it something meaningful. Here I have named it BatData - BatData is now an
object in R.

BatData<-read.csv("batbrains.csv")

Let’s examine some data and plot a grouped frequency distribution. We’ll begin with a histogram.
In the code chunk below, the dollar sign allows us to access a variable directly. You can also using indexing
(i.e., X[,3]) to access a variable. In R-Studio it will give you options that you can click as a shortcut. We
add two other arguments separated by commas, xlab and main.
xlab lets us change the axis labels.
main is the title, I set it as NULL so it removes it.
What does the distribution look like?

hist(BatData$brain_size, xlab="Brain Size (mm3)", main = NULL)

1
12
10
8
Frequency

6
4
2
0

100 200 300 400 500 600 700 800

Brain Size (mm3)

What if I want to see separate histograms for each bat family? I can achieve this using the subset function.
The subset function will create two new data frames that I have named Hip and Mol based on the names of
the bat families.

Hip<-subset(BatData, subset = BatData$family == "Hipposideridae")


Mol<-subset(BatData, subset = BatData$family == "Molossidae")

The subset function uses a logical statement (using the == symbol), to ask the data to return things that
are either true or false.
It provides all the data for which the family column entry reads ‘Hipposideridae’ or ‘Molossidae’ in the case
of the second line.
I also add some formatting to the plots using the par function. The pty argument allows me to make the
plot square, and the mfrow argument allows me to define how I want the two plots arranged. In this case I
say give me 1 row and 2 columns. Note that mfrow needs two values, hence why I use the c function to join
the two values together in a vector.

par(pty='s', mfrow=c(1,2))
hist(Hip$brain_size, xlab="Hipposideridae Brain Size (mm3)", main = NULL)
hist(Mol$brain_size, xlab="Molossidae Brain Size (mm3)", main = NULL)

2
8

5
4
6
Frequency

Frequency

3
4

2
2

1
0

0
100 300 500 700 100 300 500 700

Hipposideridae Brain Size (mm3) Molossidae Brain Size (mm3)

Let’s calculate some other descriptive statistics.


Let’s begin with the mean, median, and standard deviation. Note that the mean and median are quite
different as measures of central tendency, indicating there is likely skew in the data.

mean(BatData$brain_size)

## [1] 387.3432

median(BatData$brain_size)

## [1] 369.6

sd(BatData$brain_size)

## [1] 183.1476

But, what if I wanted a mean for a given bat family. I can use the aggregate function.
First, you join all the continuous variables together that you’re interested in using the function cbind. Then
you tell it which categorical variable you want to find the mean/median/sd for, in this case, family.
FUN in this case means the ‘function’ we wish to apply.

3
aggregate(x = cbind(brain_size,amygdala,hippocampus)~family, FUN="mean", data = BatData)

## family brain_size amygdala hippocampus


## 1 Hipposideridae 380.8600 15.22500 27.04500
## 2 Molossidae 394.9706 19.32941 20.00588

aggregate(x = cbind(brain_size,amygdala,hippocampus)~family, FUN="median", data = BatData)

## family brain_size amygdala hippocampus


## 1 Hipposideridae 279.05 11.75 19.65
## 2 Molossidae 391.80 19.50 20.80

aggregate(x = cbind(brain_size,amygdala,hippocampus)~family, FUN="sd", data = BatData)

## family brain_size amygdala hippocampus


## 1 Hipposideridae 213.3609 8.729857 14.930170
## 2 Molossidae 145.9420 6.037256 6.020742

Boxplots are a great way to illustrate a continuous variable grouped by a discrete variable. They show you
the range (excluding outliers), the interquartile range, and the median of your data. You can see how your
continuous data is distributed.
The general formula for producing a boxplot is as follows: boxplot(continuous~categorical). Or, in other
words, boxplot(Dependent~Independent). You give the boxplot function your whole data frame (data =
BatData), so you do not need to use the $ here.

par(pty='s')
boxplot(hippocampus~family,data = BatData, xlab = "Family", ylab = "Brain Size (mm3)")
50
Brain Size (mm3)

40
30
20
10

Hipposideridae Molossidae

Family

Let’s produce a scatter plot with two continuous variables. I can produce my plot in two different ways the
first involves the use of the ~ symbol. Here, you place the y variable (dependent) first, then x.

4
par(pty='s')
plot(amygdala~brain_size, data = BatData,
xlab = "Brain Size (mm3)", ylab = "Amygdala (mm3)")

30
25
Amygdala (mm3)

20
15
10
5

200 400 600

Brain Size (mm3)

You can also explicitly define x and y. However, you must use the $ sign to tell R exactly what you’re doing.
Note that you can get superscript characters, it’s just a really complex function so I wanted to introduce it
at the very end.

par(pty='s')
plot(x = BatData$brain_size, y = BatData$amygdala,
xlab = expression('Brain Size (mm'ˆ3*')'), ylab = expression('Amygdala (mm'ˆ3*')'))
30
25
Amygdala (mm3)

20
15
10
5

200 400 600

Brain Size (mm3)

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy