0% found this document useful (0 votes)

4 views28 pages

3.DataFrames.GGPlot2

The document provides an introduction to programming and data science using R, focusing on key concepts such as error messages, data frames, and visualizations with ggplot2. It covers important R data objects, commands for data manipulation, and the creation of visual representations of data, including categorical and continuous variables. Additionally, it discusses factors in R and their significance in handling categorical data.

Uploaded by

Nia Mamporia

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views28 pages

3.DataFrames.GGPlot2

Uploaded by

Nia Mamporia

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 28

Introduction to Programming and Data Science with R

3.
R DataFrames
Visualizing Results

1
Knirsch
Introduction to Programming and Data Science with R

Error Messages are Your Friends!

• If R returns an error you know that you need to correct your code.

• If your code runs through without error, the code is syntactically correct but may nevertheless be meaningless,
non-sensical, misleading or logically wrong.

If your code runs through it does NOT necessarily mean that the code is meaningful or correct.

Re-create your tibble sales_complete with 5 columns and 91 rows.

Make sure that you have the 5 vectors available.

2
Knirsch
Introduction to Programming and Data Science with R

What do the following comands return?

table(cost_cover)

table(sales_categories)

table(sales_rev_after_tax)

sales_rev_after_tax[2,4]

sales_complete[2,4]

3
Knirsch
Introduction to Programming and Data Science with R

Important R Data Objects

Dimension One Data Type Multiple Data Types

1 Vector

DataFrame
2
tibble (dplyr)

4
Knirsch
Introduction to Programming and Data Science with R

DataFrame: Addressing rows, columns and cells

• Reading out a cell:

option 1: sales_complete[row, column]

option 2: sales_complete$sales_categories[row]

• Reading out a column (vector):

sales_complete$sales_categories

• Reading out a row (observation)

In order to read out a row, we need the index (= position). A dataframe is a two-dimensional object
→ there are two index positions: row-index and column-index:

sales_complete[row, ] #reads out a whole row

sales_complete[ ] #reads out row 3

5
Knirsch
Introduction to Programming and Data Science with R

Commands running against the tibble:

Count frequencies of df sales_complete, column cost_cover:

Count frequencies of df sales_complete, column sales_categories:

Read out row Nr. 5 of df sales_complete

6
Knirsch
Introduction to Programming and Data Science with R

Subsetting / Extracting rows that match a condition – base syntax

We want to read out all (complete) rows with category "Very Poor"

1. Write the command for the condition

2. What is the commmand to get a row?

3. Write the whole subsetting commmand

7
Knirsch
Introduction to Programming and Data Science with R

Subsetting / Filtering – dplyr function filter()

dplyr::filter() syntax:
output_df <- filter(input_df, condition)

We want to read out all (complete) rows with category "Very Poor"

veryPoorDays <- filter(sales_complete, sales_categories == "Very Poor")

8
Knirsch
Introduction to Programming and Data Science with R

Visualizing with ggplot2

There are many ways and packages to visualize analysis results with R. The package of choice and most used
package is ggplot2. ggplot2 comes with a coherent syntax – the two g’s stand for – grammar of graphics.

One can do almost everything with ggplot2 - graphic-wise. Yet, one type of graphic is not available: the pie-graphic.
The reason is that the pie graphic is considered for amateurs, not very professional, not recommended and very
limited in its possible use: a pie can be deciphered only as long as there is a very limited number of values or
groups. It is basically only meaningful for categorical data with very few categories.

ggplot2 is a package in the tidyverse collection.

Documentation:
ggplot2, CheatSheet (RStudio documentation)
https://raw.githubusercontent.com/rstudio/cheatsheets/master/data-visualization-2.1.pdf
Hadley Wickham, R for Data Science, chapter 3
https://r4ds.had.co.nz/data-visualisation.html

9
Knirsch
Introduction to Programming and Data Science with R

ggplot2

The basic concept of the ggplot2 grammar of graphics is, that each diagram consists of 3 mandatory layers:

1. Layer: Data (data = x)

Input is ALWAYS a dataframe / tibble.

2. Layer: Mapping mapping = aes()

Defines which variables are mapped to the plot (x, y, size, shape, ...)

3. Layer: geometrical object(s) geom_..........()

Give the chart its form (barchart, historgram, boxplot, line, scatterplot, …)

• Additional layers are added by the + symbol

• There are lots of additional optional layers, like labels, text, title, scales.....

10
Knirsch
Introduction to Programming and Data Science with R

Visualizing a Categorical Variable

Chart the absolute frequency of sales categories.

ggplot(data= .................................) +
aes(x= .........................) +
geom_bar()

Add a bit of color:

ggplot(data= sales_complete) + aes(x= sales_categories) + geom_bar(color= "black",fill=terrain.colors(5))

Which of the following two commands is correct and which is not and why?

ggplot(data= sales_complete) + aes(x= sales_categories) + geom_bar(color="black",fill=rainbow(10))

ggplot(data= sales_complete) + aes(x= sales_categories) + geom_bar(color= "black",fill=rainbow(1))

11
Knirsch
Introduction to Programming and Data Science with R

Visualizing a Categorical Variable

Give the chart a title and label the axis with the labs() layer:

+
labs(title = "Revenue Categories", x= "Revenue Categories", y = "Number of Days")

Remove the grid out of the chart:

+ theme_classic()

12
Knirsch
Introduction to Programming and Data Science with R

What is counter-intuitive and

not user-friendly in the plot?

Knirsch
Introduction to Programming and Data Science with R

Factors
The sales_categories are categorical – ordinal data.

→we want them to be sorted according to their ranking.

 not sorted alphabetically

R uses datatype factor for categorical - ordinal data.

Technically factors are vectors of type integer: R stores integers and assigns the categorical values to these integers.
factor(x, levels = ....., ordered = TRUE) #creates an ordered factor with x as input vector and
# the given levels values as categories,

We create an additional factor vector of the sales categories:

sales_complete$sales_categories_fac <- factor(sales_complete$sales_categories,

levels= c("Very Good", "Good","Fair","Poor","Very Poor"), ordered = TRUE)

14
Knirsch
Introduction to Programming and Data Science with R

Factors
Did this command work?
Do you get NAs in your new vector?

Attention: A factor only accepts values that are predefined in the levels = .... argument. If there are other
values in the vector, they will get converted to NAs.
The intention of a factor is to only work with the predefined values.
→when defining a factor make sure there are no typos or other syntax errors.
→sales_complete$sales_categories_fac <- factor(sales_complete$sales_categories,
levels= c("Very Good", "Good","Fair","Poor","Very Poor"), ordered = TRUE)
→Check each levels value whether it is spelled the same in the existing vector.
→Check that you list all predefined values.

15
Knirsch
Introduction to Programming and Data Science with R

Factors
Once your factor vector is completely correct, you can overwrite the original vector and remove the additional
factor-vector:

#overwrite original vector

sales_complete$sales_categories <- sales_complete$sales_categories_fac
#remove additional factor
sales_complete$sales_categories_fac <- NULL

Check the structure and attributes:

str(sales_complete)
attributes(sales_complete$sales_categories)
is.factor(sales_complete$sales_categories)

16
Knirsch
Introduction to Programming and Data Science with R

Factors
• Command to convert vector to a factor

• Argument to set the order (ranking) of the values

• command to check the structure

17
Knirsch
Introduction to Programming and Data Science with R

Factors

• Factors always have two attributes:

• levels #sets the ranking of the category values

• class #specifies the vector as a factor

- By default, the levels are assigned to the integer vector in alphabetical or ascending order.
- levels = .........., ordered = TRUE enforces a different order / ranking on factor
values.

$levels
[1] "Very Good" "Good" "Fair" "Poor" "Very Poor"

18
Knirsch
Introduction to Programming and Data Science with R

Factors

• Functions needed to work with factors:

• factor() # creates a factor

• attributes() #shows the levels and the class attributes
• typeof() # shows that factors are stored internally as integers
• is.factor() # returns TRUE / FALSE if object is a factor or not

19
Knirsch
Introduction to Programming and Data Science with R

Important R Data Objects

Dimension One Data Type Multiple Data Types

Vector,
1
Factors

DataFrame
2
tibble (dplyr)

Vectors or factors are the building

blocks for data frames: columns of a
data frame must all be vectors /
factors of equal length.

20
Knirsch
Introduction to Programming and Data Science with R

Plot the barchar with the sales categories again.

Knirsch
Introduction to Programming and Data Science with R

• Plot the boolean variable cost_cover. What is the command?

• #plotting boolean cost cover
ggplot

• Plot the daily revenue. What is the command?

#plotting daily revenue
ggplot(data= sales_complete) + aes(x= daily_sales) + geom_bar(color= "green",fill=rainbow(1)) +
labs(title = "Sales Revenue", x= "Sales", y = "Days") + theme_classic()

What is wrong with the second plot?

Knirsch
Introduction to Programming and Data Science with R

Continous (Numerical) Data

What command returns the basic descriptive statistics results?

How to interpret median > mean?

23
Knirsch
Introduction to Programming and Data Science with R

Chart Types for Continous (Numerical) Data

What chart types make sense to use for continous (numerical) data like the daily revenue data?

24
Knirsch
Introduction to Programming and Data Science with R

Histogram

We plot a histogram of the variable daily_sales

#plotting a histogram on the variable daily_sales

ggplot(data =..........................) +
aes(x =....................................) +
geom_histogram(bins = ................, color = "blue", fill = "lightblue", alpha = 0.5) +
labs(title = "Daily Revenue", x = "Revenue", y = "Frequency")

Fewer bins: Easy overwiew

More bins: more detailed view

25
Knirsch
Introduction to Programming and Data Science with R

• Histograms look similar to bar charts: The x-axis shows the values
and the y-axis their frequency in bars.
• How do we get to the bars? The continuous values are classified
into groups with equal intervals.
• The number of groups or their width must always be specified when
creating a histogram (argument bins = .......).
• The bars are displayed directly next to each other with no space
between them. Thus, to remind us that there are actually no
boundaries between values.
• From a histogram, no concrete values can be read.
• However, shape, i.e. symmetry versus skewedness, and the
modality (peak or the number of peaks or highest points) can be
seen well.

26
Knirsch
Introduction to Programming and Data Science with R

Summary

27
Knirsch
Introduction to Programming and Data Science with R

Data Frame Example: Old Faithful

Old Faithful is a geyser in Yellowstone

National Park, Wyoming-USA. It is a tourist
attraction and geographic phenomenon.It
has erupted every 44 to 125 minutes since
2000, spewing 3,700 to 8,400 US gallons of
boiling water to a height of 106 to 185 feet.
The average height of an eruption is 145 feet.
Rstudio contains an Old Faithful data frame.
You can call the dataset with the command:

data(faithful)
faithful
The R dataset “faithful” contains a list of 272
observations of geyser eruptions during
October 1980.

28
Knirsch

CRC.Data.Science
No ratings yet
CRC.Data.Science
443 pages
Modern Statistics With R
100% (3)
Modern Statistics With R
580 pages
R For Health Data Science
100% (1)
R For Health Data Science
365 pages
Data Science - A First Introduction With Python (Z-Lib - Io)
No ratings yet
Data Science - A First Introduction With Python (Z-Lib - Io)
452 pages
Exploratory Data Analysis With R
No ratings yet
Exploratory Data Analysis With R
218 pages
Unit 2
No ratings yet
Unit 2
32 pages
WST 212 - Lecture Notes 2025
No ratings yet
WST 212 - Lecture Notes 2025
67 pages
MSDR PDF
No ratings yet
MSDR PDF
479 pages
Unit2
No ratings yet
Unit2
76 pages
Week13 Slides Review
No ratings yet
Week13 Slides Review
23 pages
R For Data Science Sample Chapter
100% (1)
R For Data Science Sample Chapter
39 pages
W01 Introduction To R
No ratings yet
W01 Introduction To R
67 pages
Ida PDF
No ratings yet
Ida PDF
62 pages
R_code_intro
No ratings yet
R_code_intro
46 pages
Introduction To R
No ratings yet
Introduction To R
39 pages
4.Ggplot2.Density.boxplots.bi Variate
No ratings yet
4.Ggplot2.Density.boxplots.bi Variate
29 pages
5.Interactive.plots
No ratings yet
5.Interactive.plots
27 pages
Mastering Software Development in R
100% (1)
Mastering Software Development in R
468 pages
pdf copy
No ratings yet
pdf copy
19 pages
R For Data Engineers: Greg Wilson
No ratings yet
R For Data Engineers: Greg Wilson
249 pages
mtech final
No ratings yet
mtech final
16 pages
ProgrammingForDS15_dataviz (1)
No ratings yet
ProgrammingForDS15_dataviz (1)
40 pages
Starting With R
No ratings yet
Starting With R
34 pages
R Programming for Data Science. A comprehensive guide to R programming...2024
No ratings yet
R Programming for Data Science. A comprehensive guide to R programming...2024
235 pages
Data Analystics With R Programming - Bhuvaneswari - Contents
No ratings yet
Data Analystics With R Programming - Bhuvaneswari - Contents
6 pages
WEEK 1
No ratings yet
WEEK 1
10 pages
DS-R Block 4 All
No ratings yet
DS-R Block 4 All
50 pages
Working With Data
No ratings yet
Working With Data
38 pages
Assignments 6 & 7
No ratings yet
Assignments 6 & 7
9 pages
Afin8015 Topic 1 2023.
No ratings yet
Afin8015 Topic 1 2023.
64 pages
R in Action, Second Edition
0% (2)
R in Action, Second Edition
2 pages
R Course Own English HS
No ratings yet
R Course Own English HS
70 pages
GitHub Copilot
100% (2)
GitHub Copilot
12 pages
Tutorial-Introduction To Dplyr
No ratings yet
Tutorial-Introduction To Dplyr
54 pages
Coding Introduction
No ratings yet
Coding Introduction
46 pages
Data Manipulation Workshop Handout
No ratings yet
Data Manipulation Workshop Handout
46 pages
7 K-Means Clustering
No ratings yet
7 K-Means Clustering
27 pages
Data Science - Copy
No ratings yet
Data Science - Copy
13 pages
Handout 3
No ratings yet
Handout 3
24 pages
LEARNING R PROGRAMMING FOR DATA SCIENCE ENTHUSIASTS
No ratings yet
LEARNING R PROGRAMMING FOR DATA SCIENCE ENTHUSIASTS
8 pages
Unit 1 Big Data Analytics - An Introduction (Final)
No ratings yet
Unit 1 Big Data Analytics - An Introduction (Final)
65 pages
Introduction To R, Version 2
No ratings yet
Introduction To R, Version 2
51 pages
Basic R For Finance
100% (1)
Basic R For Finance
312 pages
Lab1 411 Eman Yahya 7773225
No ratings yet
Lab1 411 Eman Yahya 7773225
16 pages
WST 212: Introduction To Data Science
No ratings yet
WST 212: Introduction To Data Science
67 pages
Data Visualization in R Sem-III 2021 PDF
No ratings yet
Data Visualization in R Sem-III 2021 PDF
57 pages
Beginner Guide To R and R Studio V1
No ratings yet
Beginner Guide To R and R Studio V1
27 pages
R Workshop Material 18-19, Oct-2023
No ratings yet
R Workshop Material 18-19, Oct-2023
67 pages
MIT14 381F13 EcnomtrisInR PDF
No ratings yet
MIT14 381F13 EcnomtrisInR PDF
70 pages
R Data Science Essentials - Sample Chapter
No ratings yet
R Data Science Essentials - Sample Chapter
26 pages
Lecture Notes 02 (CSI2372 - Advanced Programming Concepts With C++)
No ratings yet
Lecture Notes 02 (CSI2372 - Advanced Programming Concepts With C++)
46 pages
R Advbeginner v5
No ratings yet
R Advbeginner v5
73 pages
Modelling With R
No ratings yet
Modelling With R
3 pages
Week 3_Reduction of ER Model to Tables
No ratings yet
Week 3_Reduction of ER Model to Tables
55 pages
DAA UNIT-V Branch and Bound and P &NP
No ratings yet
DAA UNIT-V Branch and Bound and P &NP
47 pages
EDB Postgres Advanced Server Guide v10
100% (1)
EDB Postgres Advanced Server Guide v10
328 pages
Chap 1 Dhamdhere
75% (4)
Chap 1 Dhamdhere
84 pages
Praktikum Modul 3
No ratings yet
Praktikum Modul 3
5 pages
8051 Lab Manual
86% (14)
8051 Lab Manual
28 pages
Project Report (1)
No ratings yet
Project Report (1)
13 pages
SWE218 Lesson8
No ratings yet
SWE218 Lesson8
43 pages
Batch-5 Competitive Programming Complete Course
No ratings yet
Batch-5 Competitive Programming Complete Course
4 pages
DFS and BFS Algorithm
100% (1)
DFS and BFS Algorithm
11 pages
One
No ratings yet
One
41 pages
Introduction To HTML CSS and JavaScript
No ratings yet
Introduction To HTML CSS and JavaScript
8 pages
Xpath cheatsheet
No ratings yet
Xpath cheatsheet
7 pages
dm00104451 Cortexm0 Programming Manual For stm32l0 stm32g0 stm32wl and stm32wb Series Stmicroelectronics PDF
No ratings yet
dm00104451 Cortexm0 Programming Manual For stm32l0 stm32g0 stm32wl and stm32wb Series Stmicroelectronics PDF
110 pages
Assembler Notes - SS
No ratings yet
Assembler Notes - SS
23 pages
Assignment of C Programming
No ratings yet
Assignment of C Programming
10 pages
WBFuncs
No ratings yet
WBFuncs
21 pages
Analyst Technical Interview Prep
No ratings yet
Analyst Technical Interview Prep
11 pages
Big O Cheat Sheet 1620815140
0% (1)
Big O Cheat Sheet 1620815140
1 page
How-To Add A New Billing Variant
No ratings yet
How-To Add A New Billing Variant
7 pages
What Is Burst Time, Arrival Time, Exit Time, Response Time, Waiting Time, Turnaround Time, and Throughput
No ratings yet
What Is Burst Time, Arrival Time, Exit Time, Response Time, Waiting Time, Turnaround Time, and Throughput
7 pages
VBA Excel
No ratings yet
VBA Excel
3 pages
Target Code Generation: Utkarsh Jaiswal 11CS30038
No ratings yet
Target Code Generation: Utkarsh Jaiswal 11CS30038
15 pages
Insert - Update - Delete in Gridview
No ratings yet
Insert - Update - Delete in Gridview
5 pages
Introduction To IF Statements in Excel
No ratings yet
Introduction To IF Statements in Excel
7 pages
What Is Flyway and How Does It Work?
No ratings yet
What Is Flyway and How Does It Work?
4 pages
Writing Audio Applications Using GStreamer
No ratings yet
Writing Audio Applications Using GStreamer
6 pages
Advantages of Python
No ratings yet
Advantages of Python
2 pages
Most Frequently Used Tables in HCM
No ratings yet
Most Frequently Used Tables in HCM
4 pages
Beginning R: The Statistical Programming Language
From Everand
Beginning R: The Statistical Programming Language
Mark Gardener
4.5/5 (4)
R Fast Track Guide - 86 Key Points Every Programmer from Other Languages Should Master
From Everand
R Fast Track Guide - 86 Key Points Every Programmer from Other Languages Should Master
Ginno
No ratings yet
Mastering Data Structures and Algorithms in Python & Java
From Everand
Mastering Data Structures and Algorithms in Python & Java
Sachin Naha
No ratings yet
Learn R Programming in 24 Hours
From Everand
Learn R Programming in 24 Hours
Alex Nordeen
No ratings yet
The Tech Interview Playbook: From DSA to System Design
From Everand
The Tech Interview Playbook: From DSA to System Design
Chinmoy Mukherjee
No ratings yet
Algorithms and Data Structures: An Easy Guide to Programming Skills
From Everand
Algorithms and Data Structures: An Easy Guide to Programming Skills
Rigdon Jonathan
No ratings yet
C++ Programming: From Novice to Expert in a Step-by-Step Journey
From Everand
C++ Programming: From Novice to Expert in a Step-by-Step Journey
Ryan Campbell
No ratings yet
Mastering Data Structures and Algorithms in C and C++
From Everand
Mastering Data Structures and Algorithms in C and C++
Sachin Naha
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

3.DataFrames.GGPlot2

Uploaded by

3.DataFrames.GGPlot2

Uploaded by

Introduction to Programming and Data Science with R

Error Messages are Your Friends!

Re-create your tibble sales_complete with 5 columns and 91 rows.

What do the following comands return?

Important R Data Objects

Dimension One Data Type Multiple Data Types

DataFrame: Addressing rows, columns and cells

option 1: sales_complete[row, column]

• Reading out a column (vector):

• Reading out a row (observation)

sales_complete[row, ] #reads out a whole row

sales_complete[ ] #reads out row 3

Commands running against the tibble:

Count frequencies of df sales_complete, column cost_cover:

Count frequencies of df sales_complete, column sales_categories:

Read out row Nr. 5 of df sales_complete

Subsetting / Extracting rows that match a condition – base syntax

1. Write the command for the condition

2. What is the commmand to get a row?

3. Write the whole subsetting commmand

Subsetting / Filtering – dplyr function filter()

veryPoorDays <- filter(sales_complete, sales_categories == "Very Poor")

Visualizing with ggplot2

ggplot2 is a package in the tidyverse collection.

1. Layer: Data (data = x)

2. Layer: Mapping mapping = aes()

3. Layer: geometrical object(s) geom_..........()

• Additional layers are added by the + symbol

Visualizing a Categorical Variable

Chart the absolute frequency of sales categories.

Add a bit of color:

ggplot(data= sales_complete) + aes(x= sales_categories) + geom_bar(color= "black",fill=terrain.colors(5))

ggplot(data= sales_complete) + aes(x= sales_categories) + geom_bar(color="black",fill=rainbow(10))

Visualizing a Categorical Variable

Remove the grid out of the chart:

What is counter-intuitive and

→we want them to be sorted according to their ranking.

R uses datatype factor for categorical - ordinal data.

We create an additional factor vector of the sales categories:

sales_complete$sales_categories_fac <- factor(sales_complete$sales_categories,

#overwrite original vector

Check the structure and attributes:

• Argument to set the order (ranking) of the values

• command to check the structure

• Factors always have two attributes:

• levels #sets the ranking of the category values

• Functions needed to work with factors:

• factor() # creates a factor

Important R Data Objects

Dimension One Data Type Multiple Data Types

Vectors or factors are the building

Plot the barchar with the sales categories again.

• Plot the boolean variable cost_cover. What is the command?

• Plot the daily revenue. What is the command?

What is wrong with the second plot?

Continous (Numerical) Data

What command returns the basic descriptive statistics results?

How to interpret median > mean?

Chart Types for Continous (Numerical) Data

We plot a histogram of the variable daily_sales

Fewer bins: Easy overwiew

Data Frame Example: Old Faithful

Old Faithful is a geyser in Yellowstone

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.