0% found this document useful (0 votes)
580 views2 pages

Data Wrangling Cheatsheet PDF

Uploaded by

sreedhar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
580 views2 pages

Data Wrangling Cheatsheet PDF

Uploaded by

sreedhar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

Data Wrangling Tidy Data - A foundation for wrangling in R

with dplyr and tidyr F MA F MA


Tidy data complements Rs vectorized M * A F

Cheat Sheet In a tidy


data set: & operations. R will automatically preserve
observations as you manipulate variables.
Each variable is saved Each observation is No other format works as intuitively with R. M * A
in its own column saved in its own row

Syntax - Helpful conventions for wrangling Reshaping Data - Change the layout of a data set
dplyr::data_frame(a = 1:3, b = 4:6)
dplyr::tbl_df(iris)
Converts data to tbl class. tbls are easier to examine than w
ww
w w
w
ww
w
w
Combine vectors into data frame
(optimized).
data frames. R displays only the data that fits onscreen: ww
1005
A 1005
A
1013
A dplyr::arrange(mtcars, mpg)
1013
A
1010
A 1010
A
tidyr::spread(pollution, size, amount) Order rows by values of a column
1010
A
Source: local data frame [150 x 5]

Sepal.Length Sepal.Width Petal.Length Gather columns into rows. 1010


tidyr::gather(cases, "year", "n", 2:4)
A Spread rows into columns.
(low to high).
dplyr::arrange(mtcars, desc(mpg))
1 5.1 3.5 1.4
2 4.9 3.0 1.4 Order rows by values of a column
3 4.7 3.2 1.3
(high to low).
4
5
4.6
5.0
3.1
3.6
1.5
1.4
w
110w
110p w
110
1007 w
p
110
1007 w
110w
110p
1007 w
110w
110p
1007 dplyr::rename(tb, y = year)
.. ... ... ...
Variables not shown: Petal.Width (dbl),
Species (fctr)
45 45
45
10091009
45
tidyr::separate(storms, date, c("y", "m", "d"))
Separate one column into several. 45 45
45
1009 1009
45
tidyr::unite(data, col, ..., sep)
Unite several columns into one.
Rename the columns of a data
frame.

dplyr::glimpse(iris) Subset Observations (Rows) Subset Variables (Columns)


Information dense summary of tbl data.
utils::View(iris)
View data set in spreadsheet-like display (note capital V). w
110w
110w
110ww wwww
110
110 w
110p
1007p
1007w
110
dplyr::filter(iris, Sepal.Length > 7) 1009
45
1009
45
dplyr::select(iris, Sepal.Width, Petal.Length, Species)
Extract rows that meet logical criteria. Select columns by name or helper function.
dplyr::distinct(iris)
Helper functions for select - ?select
Remove duplicate rows. select(iris, contains("."))
dplyr::sample_frac(iris, 0.5, replace = TRUE) Select columns whose name contains a character string.
Randomly select fraction of rows. select(iris, ends_with("Length"))
Select columns whose name ends with a character string.
dplyr::sample_n(iris, 10, replace = TRUE) select(iris, everything())
dplyr::%>% Randomly select n rows. Select every column.
Passes object on left hand side as first argument (or . dplyr::slice(iris, 10:15) select(iris, matches(".t."))
Select columns whose name matches a regular expression.
argument) of function on righthand side. Select rows by position.
select(iris, num_range("x", 1:5))
dplyr::top_n(storms, 2, date) Select columns named x1, x2, x3, x4, x5.
x %>% f(y) is the same as f(x, y)
Select and order top n entries (by group if grouped data). select(iris, one_of(c("Species", "Genus")))
y %>% f(x, ., z) is the same as f(x, y, z )
Select columns whose names are in a group of names.
Logic in R - ?Comparison, ?base::Logic select(iris, starts_with("Sepal"))
"Piping" with %>% makes code more readable, e.g. < Less than != Not equal to Select columns whose name starts with a character string.
> Greater than %in% Group membership select(iris, Sepal.Length:Petal.Width)
iris %>%
group_by(Species) %>% == Equal to is.na Is NA Select all columns between Sepal.Length and Petal.Width (inclusive).
summarise(avg = mean(Sepal.Width)) %>% <= Less than or equal to !is.na Is not NA select(iris, -Species)
arrange(avg) >= Greater than or equal to &,|,!,xor,any,all Boolean operators Select all columns except Species.
RStudio is a trademark of RStudio, Inc. CC BY RStudio info@rstudio.com 844-448-1212 rstudio.com devtools::install_github("rstudio/EDAWR") for data sets Learn more with browseVignettes(package = c("dplyr", "tidyr")) dplyr 0.4.0 tidyr 0.2.0 Updated: 1/15
Summarise Data Make New Variables Combine Data Sets
a b
x1 x2 x1 x3

+ =
A 1 A T
B 2 B F
C 3 D T
dplyr::summarise(iris, avg = mean(Sepal.Length))
dplyr::mutate(iris, sepal = Sepal.Length + Sepal. Width) Mutating Joins
Summarise data into single row of values.
Compute and append one or more new columns. x1 x2 x3
dplyr::left_join(a, b, by = "x1")
dplyr::summarise_each(iris, funs(mean)) A 1 T
dplyr::mutate_each(iris, funs(min_rank)) B 2 F
Join matching rows from b to a.
Apply summary function to each column. C 3 NA
Apply window function to each column.
dplyr::count(iris, Species, wt = Sepal.Length) x1 x3 x2
dplyr::right_join(a, b, by = "x1")
dplyr::transmute(iris, sepal = Sepal.Length + Sepal. Width) A T 1
Count number of rows with each unique value of B F 2 Join matching rows from a to b.
Compute one or more new columns. Drop original columns.
variable (with or without weights).
D T NA

x1 x2 x3 dplyr::inner_join(a, b, by = "x1")
A 1 T
summary window B 2 F Join data. Retain only rows in both sets.
function function x1
A
x2
1
x3
T
dplyr::full_join(a, b, by = "x1")
Summarise uses summary functions, functions that Mutate uses window functions, functions that take a vector of B
C
2
3
F
NA
Join data. Retain all values, all rows.
take a vector of values and return a single value, such as: values and return another vector of values, such as: D NA T

dplyr::first min Filtering Joins


dplyr::lead dplyr::cumall
First value of a vector. Minimum value in a vector. x1 x2 dplyr::semi_join(a, b, by = "x1")
Copy with values shifted by 1. Cumulative all A 1
dplyr::last max B 2 All rows in a that have a match in b.
dplyr::lag dplyr::cumany
Last value of a vector. Maximum value in a vector. dplyr::anti_join(a, b, by = "x1")
Copy with values lagged by 1. Cumulative any x1 x2

dplyr::nth mean
C 3
dplyr::dense_rank dplyr::cummean All rows in a that do not have a match in b.
Nth value of a vector. Mean value of a vector.
Ranks with no gaps. Cumulative mean y z
dplyr::n median
dplyr::min_rank cumsum x1 x2 x1 x2
# of values in a vector. Median value of a vector.
+ =
A 1 B 2
dplyr::n_distinct var Ranks. Ties get min rank. Cumulative sum B 2 C 3

# of distinct values in Variance of a vector. dplyr::percent_rank cummax C 3 D 4


Set Operations
a vector. sd Ranks rescaled to [0, 1]. Cumulative max
IQR Standard deviation of a dplyr::row_number cummin x1
B
x2
2 dplyr::intersect(y, z)
IQR of a vector. vector. Ranks. Ties got to first value. Cumulative min C 3
Rows that appear in both y and z.
dplyr::ntile cumprod x1 x2

Group Data Bin vector into n buckets. Cumulative prod


A
B
1
2
dplyr::union(y, z)
C 3 Rows that appear in either or both y and z.
dplyr::group_by(iris, Species) dplyr::between pmax D 4

Group data into rows with the same value of Species. Are values between a and b? Element-wise max x1 x2 dplyr::setdi(y, z)
A 1
dplyr::ungroup(iris) dplyr::cume_dist pmin Rows that appear in y but not z.
Remove grouping information from data frame. Cumulative distribution. Element-wise min Binding
iris %>% group_by(Species) %>% summarise() iris %>% group_by(Species) %>% mutate()
x1
A
x2
1

Compute separate summary row for each group. Compute new variables by group.
B 2 dplyr::bind_rows(y, z)
C 3
B
C
2
3
Append z to y as new rows.
D 4
ir ir dplyr::bind_cols(y, z)
C x1 x2 x1 x2
A 1 B 2 Append z to y as new columns.
B 2 C 3
C 3 D 4 Caution: matches rows by position.
RStudio is a trademark of RStudio, Inc. CC BY RStudio info@rstudio.com 844-448-1212 rstudio.com devtools::install_github("rstudio/EDAWR") for data sets Learn more with browseVignettes(package = c("dplyr", "tidyr")) dplyr 0.4.0 tidyr 0.2.0 Updated: 1/15

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy