0% found this document useful (0 votes)
4 views28 pages

3.DataFrames.GGPlot2

The document provides an introduction to programming and data science using R, focusing on key concepts such as error messages, data frames, and visualizations with ggplot2. It covers important R data objects, commands for data manipulation, and the creation of visual representations of data, including categorical and continuous variables. Additionally, it discusses factors in R and their significance in handling categorical data.

Uploaded by

Nia Mamporia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views28 pages

3.DataFrames.GGPlot2

The document provides an introduction to programming and data science using R, focusing on key concepts such as error messages, data frames, and visualizations with ggplot2. It covers important R data objects, commands for data manipulation, and the creation of visual representations of data, including categorical and continuous variables. Additionally, it discusses factors in R and their significance in handling categorical data.

Uploaded by

Nia Mamporia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Introduction to Programming and Data Science with R

3.
R DataFrames
Visualizing Results

1
Knirsch
Introduction to Programming and Data Science with R

Error Messages are Your Friends!

• If R returns an error you know that you need to correct your code.

• If your code runs through without error, the code is syntactically correct but may nevertheless be meaningless,
non-sensical, misleading or logically wrong.

If your code runs through it does NOT necessarily mean that the code is meaningful or correct.

Re-create your tibble sales_complete with 5 columns and 91 rows.


Make sure that you have the 5 vectors available.

2
Knirsch
Introduction to Programming and Data Science with R

What do the following comands return?

table(cost_cover)

table(sales_categories)

table(sales_rev_after_tax)

sales_rev_after_tax[2,4]

sales_complete[2,4]

3
Knirsch
Introduction to Programming and Data Science with R

Important R Data Objects

Dimension One Data Type Multiple Data Types

1 Vector

DataFrame
2
tibble (dplyr)

4
Knirsch
Introduction to Programming and Data Science with R

DataFrame: Addressing rows, columns and cells


• Reading out a cell:

option 1: sales_complete[row, column]


option 2: sales_complete$sales_categories[row]

• Reading out a column (vector):

sales_complete$sales_categories

• Reading out a row (observation)

In order to read out a row, we need the index (= position). A dataframe is a two-dimensional object
→ there are two index positions: row-index and column-index:

sales_complete[row, ] #reads out a whole row

sales_complete[ ] #reads out row 3

5
Knirsch
Introduction to Programming and Data Science with R

Commands running against the tibble:

Count frequencies of df sales_complete, column cost_cover:

Count frequencies of df sales_complete, column sales_categories:

Read out row Nr. 5 of df sales_complete

6
Knirsch
Introduction to Programming and Data Science with R

Subsetting / Extracting rows that match a condition – base syntax

We want to read out all (complete) rows with category "Very Poor"

1. Write the command for the condition

2. What is the commmand to get a row?

3. Write the whole subsetting commmand

7
Knirsch
Introduction to Programming and Data Science with R

Subsetting / Filtering – dplyr function filter()

dplyr::filter() syntax:
output_df <- filter(input_df, condition)

We want to read out all (complete) rows with category "Very Poor"

veryPoorDays <- filter(sales_complete, sales_categories == "Very Poor")

8
Knirsch
Introduction to Programming and Data Science with R

Visualizing with ggplot2


There are many ways and packages to visualize analysis results with R. The package of choice and most used
package is ggplot2. ggplot2 comes with a coherent syntax – the two g’s stand for – grammar of graphics.

One can do almost everything with ggplot2 - graphic-wise. Yet, one type of graphic is not available: the pie-graphic.
The reason is that the pie graphic is considered for amateurs, not very professional, not recommended and very
limited in its possible use: a pie can be deciphered only as long as there is a very limited number of values or
groups. It is basically only meaningful for categorical data with very few categories.

ggplot2 is a package in the tidyverse collection.

Documentation:
ggplot2, CheatSheet (RStudio documentation)
https://raw.githubusercontent.com/rstudio/cheatsheets/master/data-visualization-2.1.pdf
Hadley Wickham, R for Data Science, chapter 3
https://r4ds.had.co.nz/data-visualisation.html

9
Knirsch
Introduction to Programming and Data Science with R

ggplot2

The basic concept of the ggplot2 grammar of graphics is, that each diagram consists of 3 mandatory layers:

1. Layer: Data (data = x)


Input is ALWAYS a dataframe / tibble.

2. Layer: Mapping mapping = aes()


Defines which variables are mapped to the plot (x, y, size, shape, ...)

3. Layer: geometrical object(s) geom_..........()


Give the chart its form (barchart, historgram, boxplot, line, scatterplot, …)

• Additional layers are added by the + symbol


• There are lots of additional optional layers, like labels, text, title, scales.....

10
Knirsch
Introduction to Programming and Data Science with R

Visualizing a Categorical Variable

Chart the absolute frequency of sales categories.

ggplot(data= .................................) +
aes(x= .........................) +
geom_bar()

Add a bit of color:

ggplot(data= sales_complete) + aes(x= sales_categories) + geom_bar(color= "black",fill=terrain.colors(5))

Which of the following two commands is correct and which is not and why?

ggplot(data= sales_complete) + aes(x= sales_categories) + geom_bar(color="black",fill=rainbow(10))


ggplot(data= sales_complete) + aes(x= sales_categories) + geom_bar(color= "black",fill=rainbow(1))

11
Knirsch
Introduction to Programming and Data Science with R

Visualizing a Categorical Variable

Give the chart a title and label the axis with the labs() layer:

+
labs(title = "Revenue Categories", x= "Revenue Categories", y = "Number of Days")

Remove the grid out of the chart:

+ theme_classic()

12
Knirsch
Introduction to Programming and Data Science with R

What is counter-intuitive and


not user-friendly in the plot?

Knirsch
Introduction to Programming and Data Science with R

Factors
The sales_categories are categorical – ordinal data.

→we want them to be sorted according to their ranking.


 not sorted alphabetically

R uses datatype factor for categorical - ordinal data.

Technically factors are vectors of type integer: R stores integers and assigns the categorical values to these integers.
factor(x, levels = ....., ordered = TRUE) #creates an ordered factor with x as input vector and
# the given levels values as categories,

We create an additional factor vector of the sales categories:

sales_complete$sales_categories_fac <- factor(sales_complete$sales_categories,


levels= c("Very Good", "Good","Fair","Poor","Very Poor"), ordered = TRUE)

14
Knirsch
Introduction to Programming and Data Science with R

Factors
Did this command work?
Do you get NAs in your new vector?

Attention: A factor only accepts values that are predefined in the levels = .... argument. If there are other
values in the vector, they will get converted to NAs.
The intention of a factor is to only work with the predefined values.
→when defining a factor make sure there are no typos or other syntax errors.
→sales_complete$sales_categories_fac <- factor(sales_complete$sales_categories,
levels= c("Very Good", "Good","Fair","Poor","Very Poor"), ordered = TRUE)
→Check each levels value whether it is spelled the same in the existing vector.
→Check that you list all predefined values.

15
Knirsch
Introduction to Programming and Data Science with R

Factors
Once your factor vector is completely correct, you can overwrite the original vector and remove the additional
factor-vector:

#overwrite original vector


sales_complete$sales_categories <- sales_complete$sales_categories_fac
#remove additional factor
sales_complete$sales_categories_fac <- NULL

Check the structure and attributes:


str(sales_complete)
attributes(sales_complete$sales_categories)
is.factor(sales_complete$sales_categories)

16
Knirsch
Introduction to Programming and Data Science with R

Factors
• Command to convert vector to a factor

• Argument to set the order (ranking) of the values

• command to check the structure

17
Knirsch
Introduction to Programming and Data Science with R

Factors

• Factors always have two attributes:

• levels #sets the ranking of the category values


• class #specifies the vector as a factor

- By default, the levels are assigned to the integer vector in alphabetical or ascending order.
- levels = .........., ordered = TRUE enforces a different order / ranking on factor
values.

$levels
[1] "Very Good" "Good" "Fair" "Poor" "Very Poor"

18
Knirsch
Introduction to Programming and Data Science with R

Factors

• Functions needed to work with factors:

• factor() # creates a factor


• attributes() #shows the levels and the class attributes
• typeof() # shows that factors are stored internally as integers
• is.factor() # returns TRUE / FALSE if object is a factor or not

19
Knirsch
Introduction to Programming and Data Science with R

Important R Data Objects

Dimension One Data Type Multiple Data Types

Vector,
1
Factors

DataFrame
2
tibble (dplyr)

Vectors or factors are the building


blocks for data frames: columns of a
data frame must all be vectors /
factors of equal length.

20
Knirsch
Introduction to Programming and Data Science with R

Plot the barchar with the sales categories again.

Knirsch
Introduction to Programming and Data Science with R

• Plot the boolean variable cost_cover. What is the command?


• #plotting boolean cost cover
ggplot

• Plot the daily revenue. What is the command?


#plotting daily revenue
ggplot(data= sales_complete) + aes(x= daily_sales) + geom_bar(color= "green",fill=rainbow(1)) +
labs(title = "Sales Revenue", x= "Sales", y = "Days") + theme_classic()

What is wrong with the second plot?

Knirsch
Introduction to Programming and Data Science with R

Continous (Numerical) Data

What command returns the basic descriptive statistics results?

How to interpret median > mean?

23
Knirsch
Introduction to Programming and Data Science with R

Chart Types for Continous (Numerical) Data

What chart types make sense to use for continous (numerical) data like the daily revenue data?

24
Knirsch
Introduction to Programming and Data Science with R

Histogram

We plot a histogram of the variable daily_sales


#plotting a histogram on the variable daily_sales

ggplot(data =..........................) +
aes(x =....................................) +
geom_histogram(bins = ................, color = "blue", fill = "lightblue", alpha = 0.5) +
labs(title = "Daily Revenue", x = "Revenue", y = "Frequency")

Fewer bins: Easy overwiew


More bins: more detailed view

25
Knirsch
Introduction to Programming and Data Science with R

• Histograms look similar to bar charts: The x-axis shows the values
and the y-axis their frequency in bars.
• How do we get to the bars? The continuous values are classified
into groups with equal intervals.
• The number of groups or their width must always be specified when
creating a histogram (argument bins = .......).
• The bars are displayed directly next to each other with no space
between them. Thus, to remind us that there are actually no
boundaries between values.
• From a histogram, no concrete values can be read.
• However, shape, i.e. symmetry versus skewedness, and the
modality (peak or the number of peaks or highest points) can be
seen well.

26
Knirsch
Introduction to Programming and Data Science with R

Summary

27
Knirsch
Introduction to Programming and Data Science with R

Data Frame Example: Old Faithful

Old Faithful is a geyser in Yellowstone


National Park, Wyoming-USA. It is a tourist
attraction and geographic phenomenon.It
has erupted every 44 to 125 minutes since
2000, spewing 3,700 to 8,400 US gallons of
boiling water to a height of 106 to 185 feet.
The average height of an eruption is 145 feet.
Rstudio contains an Old Faithful data frame.
You can call the dataset with the command:

data(faithful)
faithful
The R dataset “faithful” contains a list of 272
observations of geyser eruptions during
October 1980.

28
Knirsch

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy