0% found this document useful (0 votes)
1 views1 page

Reshaping Data With TidyR in R

The document provides an overview of reshaping data in R using the tidyr package, highlighting various functions such as unnest, pivot_longer, and separate. It emphasizes the principles of tidy data, including that every column should represent a variable, every row an observation, and every cell a single value. Additionally, it includes examples of manipulating datasets, including movies and music data, to demonstrate the application of these functions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views1 page

Reshaping Data With TidyR in R

The document provides an overview of reshaping data in R using the tidyr package, highlighting various functions such as unnest, pivot_longer, and separate. It emphasizes the principles of tidy data, including that every column should represent a variable, every row an observation, and every cell a single value. Additionally, it includes examples of manipulating datasets, including movies and music data, to demonstrate the application of these functions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1

Reshaping Data with tidyr in R


The fourth dataset is a synthetic dataset containing attributes of people. sex is a character vector, and # Expand nested data frame columns with unnest_longer()

hair_color is a factor. # Every top-level element of the nested data gets its own column in the resul t

# Vectors inside the nested data are given their own row

sex hair_color height_cm weight_kg music_unnested <- music %>%

female brown 166 72 unnest(singles)

# Roughly equivalent to music %>% unnest_longer(singles) %>% unnest_wider(singles)


male blonde 184

Learn R online at www.DataCamp.com female


male
black
black
153
192 93
artist
Bad Bunny
title
Gato de Noche
tracks
[[{“title”:”Gato de Noche”,”collaborator”:”Ñengo Flow”}]]
Bad Bunny La Jumpa

>
[[{“title”:”La Jumpa”,”collaborator”:”Arcángel”}]]

niting and separating columns


>
Drake Scary Hours 2 [[{“title”:”What’s Next”},{”title”:”Wants and Needs”,”collaborator”:”Lil Baby”},{"tit...

Content
U
z
# Summari e parts of a data frame as a list of dataframes with nest()

Definitions
# Combine several columns into a single vector column with unite()
music_unnested %>%

movies %>%
nest(singles = c(title, tracks))
unite(release_date, c(release_year, release_month, release_day), sep = "-")

The majority of data analysis in R is performed in data frames. These are rectangular datasets consisting of rows and artist singles
columns # Split a single vector column into several columns with separate()

Bad Bunny [[{“title”:”Gato de Noche”,”tracks”:[{“title”:”Gato de Noche”, “collaborator”:”Ñengo Flow”}]},{“title”:”La Jumpa”,”...


An observation contains all the values or variables related to a single instance of the objects being analyzed. For movies %>%

example, in a dataset of movies, each movie would be an observation. separate(directors, into = c("director1", "director2"), sep=",", fill = "right")

 Drake [[{“title”:”Scary Hours 2”,”tracks”:[{“title”:”What’s Next”},{“title”:”Wants and Needs”,”collaborator”:”Lil Baby”},{“...
A variable is an attribute for the object, across all the observations. For example, the release dates for all the movies

>
Tidy data provides a standard way to organize data. Having a consistent shape for datasets enables you to worry less # Split a single column into several rows with separate_rows()

about data structures and more on getting useful results. The principles of tidy data are
Every column is a variable
movies %>%

separate_rows(directors, sep=",")
Dealing with missing data
Every row is an observation
Every cell is a single value.
> P acking and unpacking columns # Drop
people %>%

rows containing any missing values in the specified columns with drop_na()

drop_na(weight_kg)

> Helpful syntax before getting started


# Combine several columns into a data frame column with pack()

movies_packed <- movies %>%

# Replace
people %>%

missing values with a default value with replace_na()

nstalling and loading tidyr


pack(release_date = c(release_year, release_month, release_day))

replace_na(list(weight_kg = 1 00))
# The release date column is a data frame with 5 rows, 3 column s

I

# Install tidyr through tidyverse

install.packages("tidyverse")

# Split a single data frame column into several columns with unpack()

movies_packed %>%

unpack(release_date)

/ /
# release_date column replaced with release_year release_month release_day columns
> Creating grids
# Install it directly
# Get all combinations of x
input values with e pand_grid()

expand_grid(

>
install.packages("tidyr")



# Load tidyr into R

library(tidyr)
P ivoting sex = c("male", "female", "female")
hair_color = c("red", "brown", "blonde", "black", "red")

# 2 column data frame with rows like "male", "red" .


The %>% Operator # Move side-by-side columns to consecutive rows with pivot_longer()

popcorn_long <- popcorn %>%

: 6
pivot_longer(trial_1 trial_ , names_to = "trial", values_to = "n_unpopped")

# Get all combinations of input values, deduplicating and sorting with crossing()

%>% is a special operator in R found in the magrittr and tidyr packages. %>% lets you pass objects to functions elegantly, # "brand" columns contains "Orville" "

and "Seaway
crossing (

and helps you make your code more readable. The following two lines of code are equivalent.

# "trial" column contains "trial_1" to "trial_6"

se x = c("male", "female", "female") ,

# "n_unpopped" column contains the numbers



hair_color = c("red", "brown", "blonde", "black", "red")

# Without the %>% operator

second_function(first_function(dataset, arg1, arg2), arg3)

# Move values in different rows to columns with pivot_wider()

x
# Same as e pand_grid() but "red" rows only appear once and order is alphabetica l

popcorn_long %>%

# With the %>% operator


pivot_wider(brand, names_from = "trial", values_from = "n_unpopped")

dataset %>% some_function(arg1, arg2) %>% second_function(arg3) # Same contents and shape as popcorn dataset
# Get all combinations of values in data frame columns with e pand()
x
All ' data

>
# factor levels included, even if they don t appear in

> Datasets used throughout this cheat sheet Nesting and unnesting
people %>%

x x
e pand(se , hair_color)

# Equivalent x q $ x
to e pand_grid(uni ue(people se ), levels(people hair_color) $ )

Throughout this cheat sheet we will use a dataset of the top grossing movies of all time, stored as movies. # Expand nested data frame columns with unnest_longer()

# Vectors inside the nested data are given their own row
# Get x
all combinations of values that e ist in data frame columns with e pand() x + nesting()

title release_year release_month release_day directors box_office_busd # The number of columns remains unchanged
people %>%

Avatar 2009 12 18 James Cameron 2.922 music %>%


x x
e pand(nesting(se , hair_color))

unnest_longer(singles) # As previous, but filtered to rows that e ist in people datase x t


Avengers: 2019 4 22 Anthony Russo,


2.798
Endgame Joe Russo
Titanic 1997 11 01 James Cameron 2.202 artist single$title singles$tracks
# Expand the data frame, then full join to itself with complete()

Star Wars Ep. 2015 12 14 J.J Abrams 2.068 Bad Bunny Gato de Noche 2 Variables
VII: The Force people %>%

Awakens Bad Bunny La Jumpa 2 Variables x


complete(se , hair_color)

Avengers: 2018 4 23 Anthony Russo,


2.048 Drake Scary Hours 2 1 Variable x
# Same output as e pand, with additional height_cm and weight_kg column s

Infinity War Joe Russo


Expand nested data
The second dataset involves an experiment with the number of unpopped kernels in bags of popcorn, adapted from the #
Top-level elements
frame columns with unnest_wider()

n
# Fill q
in se uence of numeric or datetime columns with e pand() x + full_se ()
q
Popcorn dataset in the Stat2Data package. #
# The number of rows
inside the nested data are given their own colum
remains unchange d
people %>%

x x
e pand(height_cm_e panded = full_se (height_cm, 1))
q
brand trial_1 trial_2 trial_3 trial_4 trial_5 trial_6 music %>%

# 1 column data frame with height_cm_e panded value x s

unnest_wider(singles)
Orville 26 35 18 14 8 6 # from min height_cm to ma x height_cm in steps of 1

Seaway 47 47 14 34 21 37 artist title tracks


The third dataset is JSON data about music containing nested elements. The JSON is parsed into nested lists using Bad Bunny [[“Gato de Noche”,”La Jumpa”]] [[[{“title”:”Gato de Noche”,”collaborator”:”Ñengo Flow”}],[{“title”:”...
parse_json() from the jsonlite package. Drake [“Scary Hours 2”] [[[{“title”:”What’s Next”},{”title”:”wants and needs”,”collaborator”...

artist singles
Bad Bunny Title Tracks
# Expand selected
Replacement for
nested data frame columns with hoist()

Learn R Online at
www.DataCamp.com
# unnest_wider() %>% select()

Gato de Noche Gato de Noche, Ñengo Flow music %>%

hoist(singles, single_titles = "title")


La Jumpa La Jumpa, Arcángel
Drake Title Tracks artist single_titles singles
Scary Hours 2 What's Next, Wants and
Needs, Lemon Pepper Bad Bunny [[“Gato de Noche”,”La Jumpa”]] [[{“tracks”:[{title”:”Gato de Noche”,”collaborator”:”Ñengo Flow”}...
Freestyle, NA, Lil
Baby, Rick Ross Drake [“Scary Hours 2”] [[{“tracks”:[{”title”:”What’s Next”},{”title”:”Wants and Needs”,”coll...

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy