Reshaping Data With TidyR in R
Reshaping Data With TidyR in R
The fourth dataset is a synthetic dataset containing attributes of people. sex is a character vector, and # Expand nested data frame columns with unnest_longer()
hair_color is a factor. # Every top-level element of the nested data gets its own column in the resul t
# Vectors inside the nested data are given their own row
>
[[{“title”:”La Jumpa”,”collaborator”:”Arcángel”}]]
Content
U
z
# Summari e parts of a data frame as a list of dataframes with nest()
Definitions
# Combine several columns into a single vector column with unite()
music_unnested %>%
movies %>%
nest(singles = c(title, tracks))
unite(release_date, c(release_year, release_month, release_day), sep = "-")
The majority of data analysis in R is performed in data frames. These are rectangular datasets consisting of rows and artist singles
columns # Split a single vector column into several columns with separate()
example, in a dataset of movies, each movie would be an observation. separate(directors, into = c("director1", "director2"), sep=",", fill = "right")
Drake [[{“title”:”Scary Hours 2”,”tracks”:[{“title”:”What’s Next”},{“title”:”Wants and Needs”,”collaborator”:”Lil Baby”},{“...
A variable is an attribute for the object, across all the observations. For example, the release dates for all the movies
>
Tidy data provides a standard way to organize data. Having a consistent shape for datasets enables you to worry less # Split a single column into several rows with separate_rows()
about data structures and more on getting useful results. The principles of tidy data are
Every column is a variable
movies %>%
separate_rows(directors, sep=",")
Dealing with missing data
Every row is an observation
Every cell is a single value.
> P acking and unpacking columns # Drop
people %>%
rows containing any missing values in the specified columns with drop_na()
drop_na(weight_kg)
# Replace
people %>%
replace_na(list(weight_kg = 1 00))
# The release date column is a data frame with 5 rows, 3 column s
I
install.packages("tidyverse")
# Split a single data frame column into several columns with unpack()
movies_packed %>%
unpack(release_date)
/ /
# release_date column replaced with release_year release_month release_day columns
> Creating grids
# Install it directly
# Get all combinations of x
input values with e pand_grid()
expand_grid(
>
install.packages("tidyr")
library(tidyr)
P ivoting sex = c("male", "female", "female")
hair_color = c("red", "brown", "blonde", "black", "red")
The %>% Operator # Move side-by-side columns to consecutive rows with pivot_longer()
: 6
pivot_longer(trial_1 trial_ , names_to = "trial", values_to = "n_unpopped")
# Get all combinations of input values, deduplicating and sorting with crossing()
%>% is a special operator in R found in the magrittr and tidyr packages. %>% lets you pass objects to functions elegantly, # "brand" columns contains "Orville" "
and "Seaway
crossing (
and helps you make your code more readable. The following two lines of code are equivalent.
x
# Same as e pand_grid() but "red" rows only appear once and order is alphabetica l
popcorn_long %>%
dataset %>% some_function(arg1, arg2) %>% second_function(arg3) # Same contents and shape as popcorn dataset
# Get all combinations of values in data frame columns with e pand()
x
All ' data
>
# factor levels included, even if they don t appear in
> Datasets used throughout this cheat sheet Nesting and unnesting
people %>%
x x
e pand(se , hair_color)
# Equivalent x q $ x
to e pand_grid(uni ue(people se ), levels(people hair_color) $ )
Throughout this cheat sheet we will use a dataset of the top grossing movies of all time, stored as movies. # Expand nested data frame columns with unnest_longer()
# Vectors inside the nested data are given their own row
# Get x
all combinations of values that e ist in data frame columns with e pand() x + nesting()
title release_year release_month release_day directors box_office_busd # The number of columns remains unchanged
people %>%
Star Wars Ep. 2015 12 14 J.J Abrams 2.068 Bad Bunny Gato de Noche 2 Variables
VII: The Force people %>%
n
# Fill q
in se uence of numeric or datetime columns with e pand() x + full_se ()
q
Popcorn dataset in the Stat2Data package. #
# The number of rows
inside the nested data are given their own colum
remains unchange d
people %>%
x x
e pand(height_cm_e panded = full_se (height_cm, 1))
q
brand trial_1 trial_2 trial_3 trial_4 trial_5 trial_6 music %>%
unnest_wider(singles)
Orville 26 35 18 14 8 6 # from min height_cm to ma x height_cm in steps of 1
artist singles
Bad Bunny Title Tracks
# Expand selected
Replacement for
nested data frame columns with hoist()
Learn R Online at
www.DataCamp.com
# unnest_wider() %>% select()