Required packages to install:
At the end of this session you will be able:
%>% operator.mutate().filter()and select().arrange().group_by() and summarise().gather() (and optionally spread()).For learning:
For help:
?tidyr, ?dplyrWe’ll be using a very simple dataset that is built-in to the tidyr package. This dataset contains data on tuberculosis cases in Afghanistan, Brazil, and China from 1999 and 2000.
library(dplyr)
library(tidyr)
names(table1)
#> [1] "country"    "year"       "cases"      "population"
str(table1)
#> Classes 'tbl_df', 'tbl' and 'data.frame':    6 obs. of  4 variables:
#>  $ country   : chr  "Afghanistan" "Afghanistan" "Brazil" "Brazil" ...
#>  $ year      : int  1999 2000 1999 2000 1999 2000
#>  $ cases     : int  745 2666 37737 80488 212258 213766
#>  $ population: int  19987071 20595360 172006362 174504898 1272915272 1280428583
summary(table1)
#>    country               year          cases          population       
#>  Length:6           Min.   :1999   Min.   :   745   Min.   :1.999e+07  
#>  Class :character   1st Qu.:1999   1st Qu.: 11434   1st Qu.:5.845e+07  
#>  Mode  :character   Median :2000   Median : 59112   Median :1.733e+08  
#>                     Mean   :2000   Mean   : 91277   Mean   :4.901e+08  
#>                     3rd Qu.:2000   3rd Qu.:179316   3rd Qu.:9.983e+08  
#>                     Max.   :2000   Max.   :213766   Max.   :1.280e+09
# a tibble is a modified data.frame, making it slightly easier to use
as_tibble(table1)
#> # A tibble: 6 x 4
#>   country      year  cases population
#>   <chr>       <int>  <int>      <int>
#> 1 Afghanistan  1999    745   19987071
#> 2 Afghanistan  2000   2666   20595360
#> 3 Brazil       1999  37737  172006362
#> 4 Brazil       2000  80488  174504898
#> 5 China        1999 212258 1272915272
#> 6 China        2000 213766 1280428583Create an exercise.R file (we will have done that in class) and put these code into the file. Then choose one of the below datasets that you will use for only the exercises:
airquality (recommended for complete beginners)swiss (recommended for complete beginners)populationstormsstarwarsCopy the code below, replacing the ___ with the relevant dataset. This format is used throughout the session.
The dataset is tidy when:
Look at the structure of the toy tidyr datasets table1, table3, and table4a. Which is tidy?
table1
#> # A tibble: 6 x 4
#>   country      year  cases population
#>   <chr>       <int>  <int>      <int>
#> 1 Afghanistan  1999    745   19987071
#> 2 Afghanistan  2000   2666   20595360
#> 3 Brazil       1999  37737  172006362
#> 4 Brazil       2000  80488  174504898
#> 5 China        1999 212258 1272915272
#> 6 China        2000 213766 1280428583
table3
#> # A tibble: 6 x 3
#>   country      year rate             
#> * <chr>       <int> <chr>            
#> 1 Afghanistan  1999 745/19987071     
#> 2 Afghanistan  2000 2666/20595360    
#> 3 Brazil       1999 37737/172006362  
#> 4 Brazil       2000 80488/174504898  
#> 5 China        1999 212258/1272915272
#> 6 China        2000 213766/1280428583
table4a
#> # A tibble: 3 x 3
#>   country     `1999` `2000`
#> * <chr>        <int>  <int>
#> 1 Afghanistan    745   2666
#> 2 Brazil       37737  80488
#> 3 China       212258 213766More or less, table1 is tidy while the others are not. That’s because in table1, each column has specific and unique information and each row represents a single observation from a single location at a single point in time.
Benefits of tidy data and tidy code:
%>% pipe operator can help clarify complex data processing workflows.%>%%>% pipe operator can help clarify complex data processing workflows.# normal R way of nesting functions
head(glimpse(table1))
#> Observations: 6
#> Variables: 4
#> $ country    <chr> "Afghanistan", "Afghanistan", "Brazil", "Brazil", "...
#> $ year       <int> 1999, 2000, 1999, 2000, 1999, 2000
#> $ cases      <int> 745, 2666, 37737, 80488, 212258, 213766
#> $ population <int> 19987071, 20595360, 172006362, 174504898, 127291527...
#> # A tibble: 6 x 4
#>   country      year  cases population
#>   <chr>       <int>  <int>      <int>
#> 1 Afghanistan  1999    745   19987071
#> 2 Afghanistan  2000   2666   20595360
#> 3 Brazil       1999  37737  172006362
#> 4 Brazil       2000  80488  174504898
#> 5 China        1999 212258 1272915272
#> 6 China        2000 213766 1280428583
# the pipe way of linking functions
table1 %>% glimpse() %>% head()
#> Observations: 6
#> Variables: 4
#> $ country    <chr> "Afghanistan", "Afghanistan", "Brazil", "Brazil", "...
#> $ year       <int> 1999, 2000, 1999, 2000, 1999, 2000
#> $ cases      <int> 745, 2666, 37737, 80488, 212258, 213766
#> $ population <int> 19987071, 20595360, 172006362, 174504898, 127291527...
#> # A tibble: 6 x 4
#>   country      year  cases population
#>   <chr>       <int>  <int>      <int>
#> 1 Afghanistan  1999    745   19987071
#> 2 Afghanistan  2000   2666   20595360
#> 3 Brazil       1999  37737  172006362
#> 4 Brazil       2000  80488  174504898
#> 5 China        1999 212258 1272915272
#> 6 China        2000 213766 1280428583mutate()The mutate() function adds or replaces a variable/column in a dataset. To add
# Replace an existing variable
table1 %>%
    mutate(population = population / 1000000)
#> # A tibble: 6 x 4
#>   country      year  cases population
#>   <chr>       <int>  <int>      <dbl>
#> 1 Afghanistan  1999    745       20.0
#> 2 Afghanistan  2000   2666       20.6
#> 3 Brazil       1999  37737      172. 
#> 4 Brazil       2000  80488      175. 
#> 5 China        1999 212258     1273. 
#> 6 China        2000 213766     1280.
# Or create a new variable based on a condition
table1 %>%
    mutate(after_2000 = if_else(year >= 2000, "yes", "no"))
#> # A tibble: 6 x 5
#>   country      year  cases population after_2000
#>   <chr>       <int>  <int>      <int> <chr>     
#> 1 Afghanistan  1999    745   19987071 no        
#> 2 Afghanistan  2000   2666   20595360 yes       
#> 3 Brazil       1999  37737  172006362 no        
#> 4 Brazil       2000  80488  174504898 yes       
#> 5 China        1999 212258 1272915272 no        
#> 6 China        2000 213766 1280428583 yes
# Create or replace multiple variables by using the ","
table1 %>%
    mutate(new_column = "only one value",
           population = population / 1000000)
#> # A tibble: 6 x 5
#>   country      year  cases population new_column    
#>   <chr>       <int>  <int>      <dbl> <chr>         
#> 1 Afghanistan  1999    745       20.0 only one value
#> 2 Afghanistan  2000   2666       20.6 only one value
#> 3 Brazil       1999  37737      172.  only one value
#> 4 Brazil       2000  80488      175.  only one value
#> 5 China        1999 212258     1273.  only one value
#> 6 China        2000 213766     1280.  only one value
# Create a new variable using existing variables and save a new dataframe
table1_rate <- table1 %>%
    mutate(rate = (cases / population) * 100000)
table1_rate
#> # A tibble: 6 x 5
#>   country      year  cases population  rate
#>   <chr>       <int>  <int>      <int> <dbl>
#> 1 Afghanistan  1999    745   19987071  3.73
#> 2 Afghanistan  2000   2666   20595360 12.9 
#> 3 Brazil       1999  37737  172006362 21.9 
#> 4 Brazil       2000  80488  174504898 46.1 
#> 5 China        1999 212258 1272915272 16.7 
#> 6 China        2000 213766 1280428583 16.7Time: 10 min
# explore the structure of the data
names(___)
# pipe the data into mutate function and:
new_dataset <- ___ %>% # dataset
    mutate(
        # 1. create a new variable
        ___ = ___,
        # 2. transform/replace an existing variable
        ___ = ___,
        # 3. create a new variable using a conditional
        ___ = if_else(___, TRUE, FALSE)
    )
new_datasetselect()# select columns/variables by name, without quotes
table1_rate %>%
    select(country, year, rate)
#> # A tibble: 6 x 3
#>   country      year  rate
#>   <chr>       <int> <dbl>
#> 1 Afghanistan  1999  3.73
#> 2 Afghanistan  2000 12.9 
#> 3 Brazil       1999 21.9 
#> 4 Brazil       2000 46.1 
#> 5 China        1999 16.7 
#> 6 China        2000 16.7
# to *not* select a variable, us minus (-)
table1_rate %>%
    select(-country)
#> # A tibble: 6 x 4
#>    year  cases population  rate
#>   <int>  <int>      <int> <dbl>
#> 1  1999    745   19987071  3.73
#> 2  2000   2666   20595360 12.9 
#> 3  1999  37737  172006362 21.9 
#> 4  2000  80488  174504898 46.1 
#> 5  1999 212258 1272915272 16.7 
#> 6  2000 213766 1280428583 16.7
# when you have many variables with similar names, use "matching" functions
table1_rate %>%
    select(starts_with("c"), ends_with("e"), contains("pop"))
#> # A tibble: 6 x 4
#>   country      cases  rate population
#>   <chr>        <int> <dbl>      <int>
#> 1 Afghanistan    745  3.73   19987071
#> 2 Afghanistan   2666 12.9    20595360
#> 3 Brazil       37737 21.9   172006362
#> 4 Brazil       80488 46.1   174504898
#> 5 China       212258 16.7  1272915272
#> 6 China       213766 16.7  1280428583filter()Warning: Filter using logic… and humans are very bad at logic… Make sure to be very certain that you think your logic is what the code reads as logic… Lots of mistakes can be made at this stage! Especially with complex logic situations.
# when country is equal to
table1_rate %>%
    filter(country == "Brazil")
#> # A tibble: 2 x 5
#>   country  year cases population  rate
#>   <chr>   <int> <int>      <int> <dbl>
#> 1 Brazil   1999 37737  172006362  21.9
#> 2 Brazil   2000 80488  174504898  46.1
# when country is *not* equal to
table1_rate %>%
    filter(country != "Brazil")
#> # A tibble: 4 x 5
#>   country      year  cases population  rate
#>   <chr>       <int>  <int>      <int> <dbl>
#> 1 Afghanistan  1999    745   19987071  3.73
#> 2 Afghanistan  2000   2666   20595360 12.9 
#> 3 China        1999 212258 1272915272 16.7 
#> 4 China        2000 213766 1280428583 16.7
# when year is equal to
table1_rate %>%
    filter(year == 1999)
#> # A tibble: 3 x 5
#>   country      year  cases population  rate
#>   <chr>       <int>  <int>      <int> <dbl>
#> 1 Afghanistan  1999    745   19987071  3.73
#> 2 Brazil       1999  37737  172006362 21.9 
#> 3 China        1999 212258 1272915272 16.7
# Or when year is equal to or more than
table1_rate %>%
    filter(year >= 1999)
#> # A tibble: 6 x 5
#>   country      year  cases population  rate
#>   <chr>       <int>  <int>      <int> <dbl>
#> 1 Afghanistan  1999    745   19987071  3.73
#> 2 Afghanistan  2000   2666   20595360 12.9 
#> 3 Brazil       1999  37737  172006362 21.9 
#> 4 Brazil       2000  80488  174504898 46.1 
#> 5 China        1999 212258 1272915272 16.7 
#> 6 China        2000 213766 1280428583 16.7
# when year is 1999 *and* country is Brazil
table1_rate %>%
    filter(year == 1999 & country == "Brazil")
#> # A tibble: 1 x 5
#>   country  year cases population  rate
#>   <chr>   <int> <int>      <int> <dbl>
#> 1 Brazil   1999 37737  172006362  21.9
# when year is 1999 *or* country is Brazil
table1_rate %>%
    filter(year == 1999 | country == "Brazil")
#> # A tibble: 4 x 5
#>   country      year  cases population  rate
#>   <chr>       <int>  <int>      <int> <dbl>
#> 1 Afghanistan  1999    745   19987071  3.73
#> 2 Brazil       1999  37737  172006362 21.9 
#> 3 Brazil       2000  80488  174504898 46.1 
#> 4 China        1999 212258 1272915272 16.7arrange()# ascending order by rate
table1_rate %>%
    arrange(rate)
#> # A tibble: 6 x 5
#>   country      year  cases population  rate
#>   <chr>       <int>  <int>      <int> <dbl>
#> 1 Afghanistan  1999    745   19987071  3.73
#> 2 Afghanistan  2000   2666   20595360 12.9 
#> 3 China        1999 212258 1272915272 16.7 
#> 4 China        2000 213766 1280428583 16.7 
#> 5 Brazil       1999  37737  172006362 21.9 
#> 6 Brazil       2000  80488  174504898 46.1
# descending order by rate
table1_rate %>%
    arrange(desc(rate))
#> # A tibble: 6 x 5
#>   country      year  cases population  rate
#>   <chr>       <int>  <int>      <int> <dbl>
#> 1 Brazil       2000  80488  174504898 46.1 
#> 2 Brazil       1999  37737  172006362 21.9 
#> 3 China        2000 213766 1280428583 16.7 
#> 4 China        1999 212258 1272915272 16.7 
#> 5 Afghanistan  2000   2666   20595360 12.9 
#> 6 Afghanistan  1999    745   19987071  3.73
# ascending order by year and rate
table1_rate %>%
    arrange(year, rate)
#> # A tibble: 6 x 5
#>   country      year  cases population  rate
#>   <chr>       <int>  <int>      <int> <dbl>
#> 1 Afghanistan  1999    745   19987071  3.73
#> 2 China        1999 212258 1272915272 16.7 
#> 3 Brazil       1999  37737  172006362 21.9 
#> 4 Afghanistan  2000   2666   20595360 12.9 
#> 5 China        2000 213766 1280428583 16.7 
#> 6 Brazil       2000  80488  174504898 46.1Time: 10 min
# filter the rows by two variables
# choose two numeric variables and two numbers
___ %>%
    # format: variable >= number
    # example: weight >= 50
    filter(___ >= ___ & ___ <= ___)
# note the "&" above, compare with "|" below
# how do they differ?
___ %>%
    filter(___ >= ___ | ___ <= ___)
# pipe the data and continue piping
___ %>%
    # filter a variable, preferably a character variable,
    filter(______  ==  ______) %>%
    # sort by two variables
    arrange(___,   ___) %>%
    # now keep only three variables
    select(___,  ___,  ___) If you are familiar with ggplot2 from the ggplot2 session, then you can optional try to visualize the dataset.
group_by(), summarise()# summarise on its own
# summarise only can output one (1) value, eg. a max or an average
table1_rate %>%
    summarise(cum_cases = sum(cases),
              max = max(cases))
#> # A tibble: 1 x 2
#>   cum_cases    max
#>       <int>  <dbl>
#> 1    547660 213766
# grouping by country, on its own, does nothing
table1_rate %>%
    group_by(country) 
#> # A tibble: 6 x 5
#> # Groups:   country [3]
#>   country      year  cases population  rate
#>   <chr>       <int>  <int>      <int> <dbl>
#> 1 Afghanistan  1999    745   19987071  3.73
#> 2 Afghanistan  2000   2666   20595360 12.9 
#> 3 Brazil       1999  37737  172006362 21.9 
#> 4 Brazil       2000  80488  174504898 46.1 
#> 5 China        1999 212258 1272915272 16.7 
#> 6 China        2000 213766 1280428583 16.7
# group_by combined with summarise is powerful
table1_rate %>%
    group_by(country) %>%
    summarise(cum_cases = sum(cases),
              max = max(cases))
#> # A tibble: 3 x 3
#>   country     cum_cases    max
#>   <chr>           <int>  <dbl>
#> 1 Afghanistan      3411   2666
#> 2 Brazil         118225  80488
#> 3 China          426024 213766
# grouping by year
table1_rate %>%
    group_by(year) %>%
    summarise(cum_cases = sum(cases),
              max = max(cases))
#> # A tibble: 2 x 3
#>    year cum_cases    max
#>   <int>     <int>  <dbl>
#> 1  1999    250740 212258
#> 2  2000    296920 213766gather()Compare how table1 looks normally and after converting to the long form with gather().
table1
#> # A tibble: 6 x 4
#>   country      year  cases population
#>   <chr>       <int>  <int>      <int>
#> 1 Afghanistan  1999    745   19987071
#> 2 Afghanistan  2000   2666   20595360
#> 3 Brazil       1999  37737  172006362
#> 4 Brazil       2000  80488  174504898
#> 5 China        1999 212258 1272915272
#> 6 China        2000 213766 1280428583
# convert to long form by stacking all cases with all population
table1 %>% 
    gather(variable, value, -country, -year)
#> # A tibble: 12 x 4
#>    country      year variable        value
#>    <chr>       <int> <chr>           <int>
#>  1 Afghanistan  1999 cases             745
#>  2 Afghanistan  2000 cases            2666
#>  3 Brazil       1999 cases           37737
#>  4 Brazil       2000 cases           80488
#>  5 China        1999 cases          212258
#>  6 China        2000 cases          213766
#>  7 Afghanistan  1999 population   19987071
#>  8 Afghanistan  2000 population   20595360
#>  9 Brazil       1999 population  172006362
#> 10 Brazil       2000 population  174504898
#> 11 China        1999 population 1272915272
#> 12 China        2000 population 1280428583
# this does the same:
table1 %>%
    gather(variable, value, cases, population)
#> # A tibble: 12 x 4
#>    country      year variable        value
#>    <chr>       <int> <chr>           <int>
#>  1 Afghanistan  1999 cases             745
#>  2 Afghanistan  2000 cases            2666
#>  3 Brazil       1999 cases           37737
#>  4 Brazil       2000 cases           80488
#>  5 China        1999 cases          212258
#>  6 China        2000 cases          213766
#>  7 Afghanistan  1999 population   19987071
#>  8 Afghanistan  2000 population   20595360
#>  9 Brazil       1999 population  172006362
#> 10 Brazil       2000 population  174504898
#> 11 China        1999 population 1272915272
#> 12 China        2000 population 1280428583Converting to long form makes other types of exploration much much easier. For instance, combined with group_by() and summarise():
table1 %>% 
    gather(variables, values, cases, population) %>% 
    group_by(year, variables) %>% 
    summarise(cum_sum = sum(values),
              max = sum(values))
#> # A tibble: 4 x 4
#> # Groups:   year [?]
#>    year variables     cum_sum        max
#>   <int> <chr>           <int>      <int>
#> 1  1999 cases          250740     250740
#> 2  1999 population 1464908705 1464908705
#> 3  2000 cases          296920     296920
#> 4  2000 population 1475528841 1475528841Time: Until end of the session.
Copy the code below and run it to get the dataset to use for this exercise. Using everything you learned in this session, try to recreate the table below from the dataset given.
library(ggplot2)
economics_data <- economics %>% 
    # get the year information
    mutate(year = as.numeric(substr(date, 0, 4))) %>% 
    as_tibble()
economics_data
#> # A tibble: 574 x 7
#>    date         pce    pop psavert uempmed unemploy  year
#>    <date>     <dbl>  <int>   <dbl>   <dbl>    <int> <dbl>
#>  1 1967-07-01  507. 198712    12.5     4.5     2944  1967
#>  2 1967-08-01  510. 198911    12.5     4.7     2945  1967
#>  3 1967-09-01  516. 199113    11.7     4.6     2958  1967
#>  4 1967-10-01  513. 199311    12.5     4.9     3143  1967
#>  5 1967-11-01  518. 199498    12.5     4.7     3066  1967
#>  6 1967-12-01  526. 199657    12.1     4.8     3018  1967
#>  7 1968-01-01  532. 199808    11.7     5.1     2878  1968
#>  8 1968-02-01  534. 199920    12.2     4.5     3001  1968
#>  9 1968-03-01  545. 200056    11.6     4.1     2877  1968
#> 10 1968-04-01  545. 200208    12.2     4.6     2709  1968
#> # ... with 564 more rows| year | variable | Max | Median | Min | 
|---|---|---|---|---|
| 2000 | uempmed | 6.3000000 | 6.0000000 | 5.2000000 | 
| 2000 | unemploy_rate | 0.0208329 | 0.0201725 | 0.0194601 | 
| 1999 | uempmed | 6.9000000 | 6.2500000 | 5.8000000 | 
| 1999 | unemploy_rate | 0.0219826 | 0.0210060 | 0.0201378 | 
| 1998 | uempmed | 7.4000000 | 6.7500000 | 5.8000000 | 
| 1998 | unemploy_rate | 0.0233487 | 0.0225944 | 0.0215798 | 
| 1997 | uempmed | 8.3000000 | 7.9500000 | 7.5000000 | 
| 1997 | unemploy_rate | 0.0263782 | 0.0244107 | 0.0230113 | 
| 1996 | uempmed | 8.6000000 | 8.3000000 | 7.7000000 | 
| 1996 | unemploy_rate | 0.0279358 | 0.0269796 | 0.0254912 | 
| 1995 | uempmed | 9.1000000 | 8.2500000 | 7.9000000 | 
| 1995 | unemploy_rate | 0.0287671 | 0.0278592 | 0.0269421 | 
| 1994 | uempmed | 10.0000000 | 9.1500000 | 8.6000000 | 
| 1994 | unemploy_rate | 0.0329491 | 0.0301162 | 0.0273032 | 
| 1993 | uempmed | 8.6000000 | 8.3000000 | 8.0000000 | 
| 1993 | unemploy_rate | 0.0360485 | 0.0346286 | 0.0323953 | 
| 1992 | uempmed | 9.3000000 | 8.6500000 | 8.1000000 | 
| 1992 | unemploy_rate | 0.0391287 | 0.0370310 | 0.0363734 | 
| 1991 | uempmed | 7.8000000 | 6.8500000 | 6.0000000 | 
| 1991 | unemploy_rate | 0.0360757 | 0.0342328 | 0.0318196 | 
| 1990 | uempmed | 5.9000000 | 5.3500000 | 4.8000000 | 
| 1990 | unemploy_rate | 0.0313998 | 0.0274685 | 0.0263759 | 
spread()To convert from long to wide, you use spread().
# original
table2
#> # A tibble: 12 x 4
#>    country      year type            count
#>    <chr>       <int> <chr>           <int>
#>  1 Afghanistan  1999 cases             745
#>  2 Afghanistan  1999 population   19987071
#>  3 Afghanistan  2000 cases            2666
#>  4 Afghanistan  2000 population   20595360
#>  5 Brazil       1999 cases           37737
#>  6 Brazil       1999 population  172006362
#>  7 Brazil       2000 cases           80488
#>  8 Brazil       2000 population  174504898
#>  9 China        1999 cases          212258
#> 10 China        1999 population 1272915272
#> 11 China        2000 cases          213766
#> 12 China        2000 population 1280428583
# after spread
table2 %>% 
    spread(type, count)
#> # A tibble: 6 x 4
#>   country      year  cases population
#>   <chr>       <int>  <int>      <int>
#> 1 Afghanistan  1999    745   19987071
#> 2 Afghanistan  2000   2666   20595360
#> 3 Brazil       1999  37737  172006362
#> 4 Brazil       2000  80488  174504898
#> 5 China        1999 212258 1272915272
#> 6 China        2000 213766 1280428583This work is licensed under a Creative Commons Attribution 4.0 International License. See the licensing page for more details about copyright information.