Required packages to install:
At the end of this session you will be able:
%>%
operator.mutate()
.filter()
and select()
.arrange()
.group_by()
and summarise()
.gather()
(and optionally spread()
).For learning:
For help:
?tidyr
, ?dplyr
We’ll be using a very simple dataset that is built-in to the tidyr package. This dataset contains data on tuberculosis cases in Afghanistan, Brazil, and China from 1999 and 2000.
library(dplyr)
library(tidyr)
names(table1)
#> [1] "country" "year" "cases" "population"
str(table1)
#> Classes 'tbl_df', 'tbl' and 'data.frame': 6 obs. of 4 variables:
#> $ country : chr "Afghanistan" "Afghanistan" "Brazil" "Brazil" ...
#> $ year : int 1999 2000 1999 2000 1999 2000
#> $ cases : int 745 2666 37737 80488 212258 213766
#> $ population: int 19987071 20595360 172006362 174504898 1272915272 1280428583
summary(table1)
#> country year cases population
#> Length:6 Min. :1999 Min. : 745 Min. :1.999e+07
#> Class :character 1st Qu.:1999 1st Qu.: 11434 1st Qu.:5.845e+07
#> Mode :character Median :2000 Median : 59112 Median :1.733e+08
#> Mean :2000 Mean : 91277 Mean :4.901e+08
#> 3rd Qu.:2000 3rd Qu.:179316 3rd Qu.:9.983e+08
#> Max. :2000 Max. :213766 Max. :1.280e+09
# a tibble is a modified data.frame, making it slightly easier to use
as_tibble(table1)
#> # A tibble: 6 x 4
#> country year cases population
#> <chr> <int> <int> <int>
#> 1 Afghanistan 1999 745 19987071
#> 2 Afghanistan 2000 2666 20595360
#> 3 Brazil 1999 37737 172006362
#> 4 Brazil 2000 80488 174504898
#> 5 China 1999 212258 1272915272
#> 6 China 2000 213766 1280428583
Create an exercise.R
file (we will have done that in class) and put these code into the file. Then choose one of the below datasets that you will use for only the exercises:
airquality
(recommended for complete beginners)swiss
(recommended for complete beginners)population
storms
starwars
Copy the code below, replacing the ___
with the relevant dataset. This format is used throughout the session.
The dataset is tidy when:
Look at the structure of the toy tidyr datasets table1
, table3
, and table4a
. Which is tidy?
table1
#> # A tibble: 6 x 4
#> country year cases population
#> <chr> <int> <int> <int>
#> 1 Afghanistan 1999 745 19987071
#> 2 Afghanistan 2000 2666 20595360
#> 3 Brazil 1999 37737 172006362
#> 4 Brazil 2000 80488 174504898
#> 5 China 1999 212258 1272915272
#> 6 China 2000 213766 1280428583
table3
#> # A tibble: 6 x 3
#> country year rate
#> * <chr> <int> <chr>
#> 1 Afghanistan 1999 745/19987071
#> 2 Afghanistan 2000 2666/20595360
#> 3 Brazil 1999 37737/172006362
#> 4 Brazil 2000 80488/174504898
#> 5 China 1999 212258/1272915272
#> 6 China 2000 213766/1280428583
table4a
#> # A tibble: 3 x 3
#> country `1999` `2000`
#> * <chr> <int> <int>
#> 1 Afghanistan 745 2666
#> 2 Brazil 37737 80488
#> 3 China 212258 213766
More or less, table1
is tidy while the others are not. That’s because in table1
, each column has specific and unique information and each row represents a single observation from a single location at a single point in time.
Benefits of tidy data and tidy code:
%>%
pipe operator can help clarify complex data processing workflows.%>%
%>%
pipe operator can help clarify complex data processing workflows.# normal R way of nesting functions
head(glimpse(table1))
#> Observations: 6
#> Variables: 4
#> $ country <chr> "Afghanistan", "Afghanistan", "Brazil", "Brazil", "...
#> $ year <int> 1999, 2000, 1999, 2000, 1999, 2000
#> $ cases <int> 745, 2666, 37737, 80488, 212258, 213766
#> $ population <int> 19987071, 20595360, 172006362, 174504898, 127291527...
#> # A tibble: 6 x 4
#> country year cases population
#> <chr> <int> <int> <int>
#> 1 Afghanistan 1999 745 19987071
#> 2 Afghanistan 2000 2666 20595360
#> 3 Brazil 1999 37737 172006362
#> 4 Brazil 2000 80488 174504898
#> 5 China 1999 212258 1272915272
#> 6 China 2000 213766 1280428583
# the pipe way of linking functions
table1 %>% glimpse() %>% head()
#> Observations: 6
#> Variables: 4
#> $ country <chr> "Afghanistan", "Afghanistan", "Brazil", "Brazil", "...
#> $ year <int> 1999, 2000, 1999, 2000, 1999, 2000
#> $ cases <int> 745, 2666, 37737, 80488, 212258, 213766
#> $ population <int> 19987071, 20595360, 172006362, 174504898, 127291527...
#> # A tibble: 6 x 4
#> country year cases population
#> <chr> <int> <int> <int>
#> 1 Afghanistan 1999 745 19987071
#> 2 Afghanistan 2000 2666 20595360
#> 3 Brazil 1999 37737 172006362
#> 4 Brazil 2000 80488 174504898
#> 5 China 1999 212258 1272915272
#> 6 China 2000 213766 1280428583
mutate()
The mutate()
function adds or replaces a variable/column in a dataset. To add
# Replace an existing variable
table1 %>%
mutate(population = population / 1000000)
#> # A tibble: 6 x 4
#> country year cases population
#> <chr> <int> <int> <dbl>
#> 1 Afghanistan 1999 745 20.0
#> 2 Afghanistan 2000 2666 20.6
#> 3 Brazil 1999 37737 172.
#> 4 Brazil 2000 80488 175.
#> 5 China 1999 212258 1273.
#> 6 China 2000 213766 1280.
# Or create a new variable based on a condition
table1 %>%
mutate(after_2000 = if_else(year >= 2000, "yes", "no"))
#> # A tibble: 6 x 5
#> country year cases population after_2000
#> <chr> <int> <int> <int> <chr>
#> 1 Afghanistan 1999 745 19987071 no
#> 2 Afghanistan 2000 2666 20595360 yes
#> 3 Brazil 1999 37737 172006362 no
#> 4 Brazil 2000 80488 174504898 yes
#> 5 China 1999 212258 1272915272 no
#> 6 China 2000 213766 1280428583 yes
# Create or replace multiple variables by using the ","
table1 %>%
mutate(new_column = "only one value",
population = population / 1000000)
#> # A tibble: 6 x 5
#> country year cases population new_column
#> <chr> <int> <int> <dbl> <chr>
#> 1 Afghanistan 1999 745 20.0 only one value
#> 2 Afghanistan 2000 2666 20.6 only one value
#> 3 Brazil 1999 37737 172. only one value
#> 4 Brazil 2000 80488 175. only one value
#> 5 China 1999 212258 1273. only one value
#> 6 China 2000 213766 1280. only one value
# Create a new variable using existing variables and save a new dataframe
table1_rate <- table1 %>%
mutate(rate = (cases / population) * 100000)
table1_rate
#> # A tibble: 6 x 5
#> country year cases population rate
#> <chr> <int> <int> <int> <dbl>
#> 1 Afghanistan 1999 745 19987071 3.73
#> 2 Afghanistan 2000 2666 20595360 12.9
#> 3 Brazil 1999 37737 172006362 21.9
#> 4 Brazil 2000 80488 174504898 46.1
#> 5 China 1999 212258 1272915272 16.7
#> 6 China 2000 213766 1280428583 16.7
Time: 10 min
# explore the structure of the data
names(___)
# pipe the data into mutate function and:
new_dataset <- ___ %>% # dataset
mutate(
# 1. create a new variable
___ = ___,
# 2. transform/replace an existing variable
___ = ___,
# 3. create a new variable using a conditional
___ = if_else(___, TRUE, FALSE)
)
new_dataset
select()
# select columns/variables by name, without quotes
table1_rate %>%
select(country, year, rate)
#> # A tibble: 6 x 3
#> country year rate
#> <chr> <int> <dbl>
#> 1 Afghanistan 1999 3.73
#> 2 Afghanistan 2000 12.9
#> 3 Brazil 1999 21.9
#> 4 Brazil 2000 46.1
#> 5 China 1999 16.7
#> 6 China 2000 16.7
# to *not* select a variable, us minus (-)
table1_rate %>%
select(-country)
#> # A tibble: 6 x 4
#> year cases population rate
#> <int> <int> <int> <dbl>
#> 1 1999 745 19987071 3.73
#> 2 2000 2666 20595360 12.9
#> 3 1999 37737 172006362 21.9
#> 4 2000 80488 174504898 46.1
#> 5 1999 212258 1272915272 16.7
#> 6 2000 213766 1280428583 16.7
# when you have many variables with similar names, use "matching" functions
table1_rate %>%
select(starts_with("c"), ends_with("e"), contains("pop"))
#> # A tibble: 6 x 4
#> country cases rate population
#> <chr> <int> <dbl> <int>
#> 1 Afghanistan 745 3.73 19987071
#> 2 Afghanistan 2666 12.9 20595360
#> 3 Brazil 37737 21.9 172006362
#> 4 Brazil 80488 46.1 174504898
#> 5 China 212258 16.7 1272915272
#> 6 China 213766 16.7 1280428583
filter()
Warning: Filter using logic… and humans are very bad at logic… Make sure to be very certain that you think your logic is what the code reads as logic… Lots of mistakes can be made at this stage! Especially with complex logic situations.
# when country is equal to
table1_rate %>%
filter(country == "Brazil")
#> # A tibble: 2 x 5
#> country year cases population rate
#> <chr> <int> <int> <int> <dbl>
#> 1 Brazil 1999 37737 172006362 21.9
#> 2 Brazil 2000 80488 174504898 46.1
# when country is *not* equal to
table1_rate %>%
filter(country != "Brazil")
#> # A tibble: 4 x 5
#> country year cases population rate
#> <chr> <int> <int> <int> <dbl>
#> 1 Afghanistan 1999 745 19987071 3.73
#> 2 Afghanistan 2000 2666 20595360 12.9
#> 3 China 1999 212258 1272915272 16.7
#> 4 China 2000 213766 1280428583 16.7
# when year is equal to
table1_rate %>%
filter(year == 1999)
#> # A tibble: 3 x 5
#> country year cases population rate
#> <chr> <int> <int> <int> <dbl>
#> 1 Afghanistan 1999 745 19987071 3.73
#> 2 Brazil 1999 37737 172006362 21.9
#> 3 China 1999 212258 1272915272 16.7
# Or when year is equal to or more than
table1_rate %>%
filter(year >= 1999)
#> # A tibble: 6 x 5
#> country year cases population rate
#> <chr> <int> <int> <int> <dbl>
#> 1 Afghanistan 1999 745 19987071 3.73
#> 2 Afghanistan 2000 2666 20595360 12.9
#> 3 Brazil 1999 37737 172006362 21.9
#> 4 Brazil 2000 80488 174504898 46.1
#> 5 China 1999 212258 1272915272 16.7
#> 6 China 2000 213766 1280428583 16.7
# when year is 1999 *and* country is Brazil
table1_rate %>%
filter(year == 1999 & country == "Brazil")
#> # A tibble: 1 x 5
#> country year cases population rate
#> <chr> <int> <int> <int> <dbl>
#> 1 Brazil 1999 37737 172006362 21.9
# when year is 1999 *or* country is Brazil
table1_rate %>%
filter(year == 1999 | country == "Brazil")
#> # A tibble: 4 x 5
#> country year cases population rate
#> <chr> <int> <int> <int> <dbl>
#> 1 Afghanistan 1999 745 19987071 3.73
#> 2 Brazil 1999 37737 172006362 21.9
#> 3 Brazil 2000 80488 174504898 46.1
#> 4 China 1999 212258 1272915272 16.7
arrange()
# ascending order by rate
table1_rate %>%
arrange(rate)
#> # A tibble: 6 x 5
#> country year cases population rate
#> <chr> <int> <int> <int> <dbl>
#> 1 Afghanistan 1999 745 19987071 3.73
#> 2 Afghanistan 2000 2666 20595360 12.9
#> 3 China 1999 212258 1272915272 16.7
#> 4 China 2000 213766 1280428583 16.7
#> 5 Brazil 1999 37737 172006362 21.9
#> 6 Brazil 2000 80488 174504898 46.1
# descending order by rate
table1_rate %>%
arrange(desc(rate))
#> # A tibble: 6 x 5
#> country year cases population rate
#> <chr> <int> <int> <int> <dbl>
#> 1 Brazil 2000 80488 174504898 46.1
#> 2 Brazil 1999 37737 172006362 21.9
#> 3 China 2000 213766 1280428583 16.7
#> 4 China 1999 212258 1272915272 16.7
#> 5 Afghanistan 2000 2666 20595360 12.9
#> 6 Afghanistan 1999 745 19987071 3.73
# ascending order by year and rate
table1_rate %>%
arrange(year, rate)
#> # A tibble: 6 x 5
#> country year cases population rate
#> <chr> <int> <int> <int> <dbl>
#> 1 Afghanistan 1999 745 19987071 3.73
#> 2 China 1999 212258 1272915272 16.7
#> 3 Brazil 1999 37737 172006362 21.9
#> 4 Afghanistan 2000 2666 20595360 12.9
#> 5 China 2000 213766 1280428583 16.7
#> 6 Brazil 2000 80488 174504898 46.1
Time: 10 min
# filter the rows by two variables
# choose two numeric variables and two numbers
___ %>%
# format: variable >= number
# example: weight >= 50
filter(___ >= ___ & ___ <= ___)
# note the "&" above, compare with "|" below
# how do they differ?
___ %>%
filter(___ >= ___ | ___ <= ___)
# pipe the data and continue piping
___ %>%
# filter a variable, preferably a character variable,
filter(______ == ______) %>%
# sort by two variables
arrange(___, ___) %>%
# now keep only three variables
select(___, ___, ___)
If you are familiar with ggplot2 from the ggplot2 session, then you can optional try to visualize the dataset.
group_by()
, summarise()
# summarise on its own
# summarise only can output one (1) value, eg. a max or an average
table1_rate %>%
summarise(cum_cases = sum(cases),
max = max(cases))
#> # A tibble: 1 x 2
#> cum_cases max
#> <int> <dbl>
#> 1 547660 213766
# grouping by country, on its own, does nothing
table1_rate %>%
group_by(country)
#> # A tibble: 6 x 5
#> # Groups: country [3]
#> country year cases population rate
#> <chr> <int> <int> <int> <dbl>
#> 1 Afghanistan 1999 745 19987071 3.73
#> 2 Afghanistan 2000 2666 20595360 12.9
#> 3 Brazil 1999 37737 172006362 21.9
#> 4 Brazil 2000 80488 174504898 46.1
#> 5 China 1999 212258 1272915272 16.7
#> 6 China 2000 213766 1280428583 16.7
# group_by combined with summarise is powerful
table1_rate %>%
group_by(country) %>%
summarise(cum_cases = sum(cases),
max = max(cases))
#> # A tibble: 3 x 3
#> country cum_cases max
#> <chr> <int> <dbl>
#> 1 Afghanistan 3411 2666
#> 2 Brazil 118225 80488
#> 3 China 426024 213766
# grouping by year
table1_rate %>%
group_by(year) %>%
summarise(cum_cases = sum(cases),
max = max(cases))
#> # A tibble: 2 x 3
#> year cum_cases max
#> <int> <int> <dbl>
#> 1 1999 250740 212258
#> 2 2000 296920 213766
gather()
Compare how table1
looks normally and after converting to the long form with gather()
.
table1
#> # A tibble: 6 x 4
#> country year cases population
#> <chr> <int> <int> <int>
#> 1 Afghanistan 1999 745 19987071
#> 2 Afghanistan 2000 2666 20595360
#> 3 Brazil 1999 37737 172006362
#> 4 Brazil 2000 80488 174504898
#> 5 China 1999 212258 1272915272
#> 6 China 2000 213766 1280428583
# convert to long form by stacking all cases with all population
table1 %>%
gather(variable, value, -country, -year)
#> # A tibble: 12 x 4
#> country year variable value
#> <chr> <int> <chr> <int>
#> 1 Afghanistan 1999 cases 745
#> 2 Afghanistan 2000 cases 2666
#> 3 Brazil 1999 cases 37737
#> 4 Brazil 2000 cases 80488
#> 5 China 1999 cases 212258
#> 6 China 2000 cases 213766
#> 7 Afghanistan 1999 population 19987071
#> 8 Afghanistan 2000 population 20595360
#> 9 Brazil 1999 population 172006362
#> 10 Brazil 2000 population 174504898
#> 11 China 1999 population 1272915272
#> 12 China 2000 population 1280428583
# this does the same:
table1 %>%
gather(variable, value, cases, population)
#> # A tibble: 12 x 4
#> country year variable value
#> <chr> <int> <chr> <int>
#> 1 Afghanistan 1999 cases 745
#> 2 Afghanistan 2000 cases 2666
#> 3 Brazil 1999 cases 37737
#> 4 Brazil 2000 cases 80488
#> 5 China 1999 cases 212258
#> 6 China 2000 cases 213766
#> 7 Afghanistan 1999 population 19987071
#> 8 Afghanistan 2000 population 20595360
#> 9 Brazil 1999 population 172006362
#> 10 Brazil 2000 population 174504898
#> 11 China 1999 population 1272915272
#> 12 China 2000 population 1280428583
Converting to long form makes other types of exploration much much easier. For instance, combined with group_by()
and summarise()
:
table1 %>%
gather(variables, values, cases, population) %>%
group_by(year, variables) %>%
summarise(cum_sum = sum(values),
max = sum(values))
#> # A tibble: 4 x 4
#> # Groups: year [?]
#> year variables cum_sum max
#> <int> <chr> <int> <int>
#> 1 1999 cases 250740 250740
#> 2 1999 population 1464908705 1464908705
#> 3 2000 cases 296920 296920
#> 4 2000 population 1475528841 1475528841
Time: Until end of the session.
Copy the code below and run it to get the dataset to use for this exercise. Using everything you learned in this session, try to recreate the table below from the dataset given.
library(ggplot2)
economics_data <- economics %>%
# get the year information
mutate(year = as.numeric(substr(date, 0, 4))) %>%
as_tibble()
economics_data
#> # A tibble: 574 x 7
#> date pce pop psavert uempmed unemploy year
#> <date> <dbl> <int> <dbl> <dbl> <int> <dbl>
#> 1 1967-07-01 507. 198712 12.5 4.5 2944 1967
#> 2 1967-08-01 510. 198911 12.5 4.7 2945 1967
#> 3 1967-09-01 516. 199113 11.7 4.6 2958 1967
#> 4 1967-10-01 513. 199311 12.5 4.9 3143 1967
#> 5 1967-11-01 518. 199498 12.5 4.7 3066 1967
#> 6 1967-12-01 526. 199657 12.1 4.8 3018 1967
#> 7 1968-01-01 532. 199808 11.7 5.1 2878 1968
#> 8 1968-02-01 534. 199920 12.2 4.5 3001 1968
#> 9 1968-03-01 545. 200056 11.6 4.1 2877 1968
#> 10 1968-04-01 545. 200208 12.2 4.6 2709 1968
#> # ... with 564 more rows
year | variable | Max | Median | Min |
---|---|---|---|---|
2000 | uempmed | 6.3000000 | 6.0000000 | 5.2000000 |
2000 | unemploy_rate | 0.0208329 | 0.0201725 | 0.0194601 |
1999 | uempmed | 6.9000000 | 6.2500000 | 5.8000000 |
1999 | unemploy_rate | 0.0219826 | 0.0210060 | 0.0201378 |
1998 | uempmed | 7.4000000 | 6.7500000 | 5.8000000 |
1998 | unemploy_rate | 0.0233487 | 0.0225944 | 0.0215798 |
1997 | uempmed | 8.3000000 | 7.9500000 | 7.5000000 |
1997 | unemploy_rate | 0.0263782 | 0.0244107 | 0.0230113 |
1996 | uempmed | 8.6000000 | 8.3000000 | 7.7000000 |
1996 | unemploy_rate | 0.0279358 | 0.0269796 | 0.0254912 |
1995 | uempmed | 9.1000000 | 8.2500000 | 7.9000000 |
1995 | unemploy_rate | 0.0287671 | 0.0278592 | 0.0269421 |
1994 | uempmed | 10.0000000 | 9.1500000 | 8.6000000 |
1994 | unemploy_rate | 0.0329491 | 0.0301162 | 0.0273032 |
1993 | uempmed | 8.6000000 | 8.3000000 | 8.0000000 |
1993 | unemploy_rate | 0.0360485 | 0.0346286 | 0.0323953 |
1992 | uempmed | 9.3000000 | 8.6500000 | 8.1000000 |
1992 | unemploy_rate | 0.0391287 | 0.0370310 | 0.0363734 |
1991 | uempmed | 7.8000000 | 6.8500000 | 6.0000000 |
1991 | unemploy_rate | 0.0360757 | 0.0342328 | 0.0318196 |
1990 | uempmed | 5.9000000 | 5.3500000 | 4.8000000 |
1990 | unemploy_rate | 0.0313998 | 0.0274685 | 0.0263759 |
spread()
To convert from long to wide, you use spread()
.
# original
table2
#> # A tibble: 12 x 4
#> country year type count
#> <chr> <int> <chr> <int>
#> 1 Afghanistan 1999 cases 745
#> 2 Afghanistan 1999 population 19987071
#> 3 Afghanistan 2000 cases 2666
#> 4 Afghanistan 2000 population 20595360
#> 5 Brazil 1999 cases 37737
#> 6 Brazil 1999 population 172006362
#> 7 Brazil 2000 cases 80488
#> 8 Brazil 2000 population 174504898
#> 9 China 1999 cases 212258
#> 10 China 1999 population 1272915272
#> 11 China 2000 cases 213766
#> 12 China 2000 population 1280428583
# after spread
table2 %>%
spread(type, count)
#> # A tibble: 6 x 4
#> country year cases population
#> <chr> <int> <int> <int>
#> 1 Afghanistan 1999 745 19987071
#> 2 Afghanistan 2000 2666 20595360
#> 3 Brazil 1999 37737 172006362
#> 4 Brazil 2000 80488 174504898
#> 5 China 1999 212258 1272915272
#> 6 China 2000 213766 1280428583
This work is licensed under a Creative Commons Attribution 4.0 International License. See the licensing page for more details about copyright information.