Work with and manage your data efficiently in R with dplyr and tidyr

Session details

Date of session: 26 Oct, 2018
Instructor: Elena Dudukina
Session level: Beginner
- Part of the “Beginner Series”

Required packages to install:

install.packages(c("dplyr", "tidyr"))
# Or just use install tidyverse
# install.packages("tidyverse")

Objectives

To learn the difference between “messy” and “tidy” data and see the advantages of using data that are consistently structured.
To perform simple data transformations and create summaries.
To get to know the tools that help tidying up and reshaping the data.

At the end of this session you will be able:

To write code using the %>% operator.
To perform simple data transformations using dplyr function mutate().
To select variables and observations to work with using dplyr functions filter()and select().
To order data by variable using dplyr function arrange().
To provide a simple data summary using dplyr functions group_by() and summarise().
To change the structure of the data using tidyr functions gather() (and optionally spread()).

Resources for learning and help

For learning:

“R for Data Science” book: Chapter on tidy data
tidyr vignette
dplyr vignette
DataCamp course “Working with Data in the Tidyverse”
- Chapter 1 “Explore your data” is free (the following chapters are available after purchasing the DataCamp subscription)

For help:

Data wrangling cheatsheet
StackOverflow for tidyr
StackOverflow for dplyr
Quick package information using RStudio: ?tidyr, ?dplyr

Quickly look over the data we will use

We’ll be using a very simple dataset that is built-in to the tidyr package. This dataset contains data on tuberculosis cases in Afghanistan, Brazil, and China from 1999 and 2000.

library(dplyr)
library(tidyr)

names(table1)
#> [1] "country"    "year"       "cases"      "population"

str(table1)
#> Classes 'tbl_df', 'tbl' and 'data.frame':    6 obs. of  4 variables:
#>  $ country   : chr  "Afghanistan" "Afghanistan" "Brazil" "Brazil" ...
#>  $ year      : int  1999 2000 1999 2000 1999 2000
#>  $ cases     : int  745 2666 37737 80488 212258 213766
#>  $ population: int  19987071 20595360 172006362 174504898 1272915272 1280428583

summary(table1)
#>    country               year          cases          population       
#>  Length:6           Min.   :1999   Min.   :   745   Min.   :1.999e+07  
#>  Class :character   1st Qu.:1999   1st Qu.: 11434   1st Qu.:5.845e+07  
#>  Mode  :character   Median :2000   Median : 59112   Median :1.733e+08  
#>                     Mean   :2000   Mean   : 91277   Mean   :4.901e+08  
#>                     3rd Qu.:2000   3rd Qu.:179316   3rd Qu.:9.983e+08  
#>                     Max.   :2000   Max.   :213766   Max.   :1.280e+09

# a tibble is a modified data.frame, making it slightly easier to use
as_tibble(table1)
#> # A tibble: 6 x 4
#>   country      year  cases population
#>   <chr>       <int>  <int>      <int>
#> 1 Afghanistan  1999    745   19987071
#> 2 Afghanistan  2000   2666   20595360
#> 3 Brazil       1999  37737  172006362
#> 4 Brazil       2000  80488  174504898
#> 5 China        1999 212258 1272915272
#> 6 China        2000 213766 1280428583

Exercise: Choose and look at a dataset

Create an exercise.R file (we will have done that in class) and put these code into the file. Then choose one of the below datasets that you will use for only the exercises:

airquality (recommended for complete beginners)
swiss (recommended for complete beginners)
population
storms
starwars

Copy the code below, replacing the ___ with the relevant dataset. This format is used throughout the session.

# load the packages
library(dplyr)
library(tidyr)

# variables
names(___)

# general contents
str(___)

# using the dplyr function
glimpse(___)

# quick summary
summary(___)

# compare a data.frame with a tibble
as.data.frame(___)
as_tibble(___)

# convert your dataset into a tibble
___ <- as_tibble(___)

“Messy” vs. “tidy” data

The dataset is tidy when:

Each variable has its own column.
Each observation has its own row
Each value has its own cell

library(tidyr)
library(dplyr)

Look at the structure of the toy tidyr datasets table1, table3, and table4a. Which is tidy?

table1
#> # A tibble: 6 x 4
#>   country      year  cases population
#>   <chr>       <int>  <int>      <int>
#> 1 Afghanistan  1999    745   19987071
#> 2 Afghanistan  2000   2666   20595360
#> 3 Brazil       1999  37737  172006362
#> 4 Brazil       2000  80488  174504898
#> 5 China        1999 212258 1272915272
#> 6 China        2000 213766 1280428583
table3
#> # A tibble: 6 x 3
#>   country      year rate             
#> * <chr>       <int> <chr>            
#> 1 Afghanistan  1999 745/19987071     
#> 2 Afghanistan  2000 2666/20595360    
#> 3 Brazil       1999 37737/172006362  
#> 4 Brazil       2000 80488/174504898  
#> 5 China        1999 212258/1272915272
#> 6 China        2000 213766/1280428583
table4a
#> # A tibble: 3 x 3
#>   country     `1999` `2000`
#> * <chr>        <int>  <int>
#> 1 Afghanistan    745   2666
#> 2 Brazil       37737  80488
#> 3 China       212258 213766

More or less, table1 is tidy while the others are not. That’s because in table1, each column has specific and unique information and each row represents a single observation from a single location at a single point in time.

Benefits of tidy data and tidy code:

Time spent preparing your data at the beginning can save hours of frustration in the long run.
“Tidy data” provides a concept and the package tidyr provides the functions for this work.
Tibbles makes datasets efficient to work with.
dplyr provides fast and intuitive data processing functions.
- For very very large datasets, the data.table package is very fast and powerful
The %>% pipe operator can help clarify complex data processing workflows.

Pipe operator: `%>%`

This operator allows you to pipe the output from one function to the input of another function.
Instead of nesting functions (reading from the inside to the outside), the idea of piping is to read the functions from left to right.
The %>% pipe operator can help clarify complex data processing workflows.

# normal R way of nesting functions
head(glimpse(table1))
#> Observations: 6
#> Variables: 4
#> $ country    <chr> "Afghanistan", "Afghanistan", "Brazil", "Brazil", "...
#> $ year       <int> 1999, 2000, 1999, 2000, 1999, 2000
#> $ cases      <int> 745, 2666, 37737, 80488, 212258, 213766
#> $ population <int> 19987071, 20595360, 172006362, 174504898, 127291527...
#> # A tibble: 6 x 4
#>   country      year  cases population
#>   <chr>       <int>  <int>      <int>
#> 1 Afghanistan  1999    745   19987071
#> 2 Afghanistan  2000   2666   20595360
#> 3 Brazil       1999  37737  172006362
#> 4 Brazil       2000  80488  174504898
#> 5 China        1999 212258 1272915272
#> 6 China        2000 213766 1280428583

# the pipe way of linking functions
table1 %>% glimpse() %>% head()
#> Observations: 6
#> Variables: 4
#> $ country    <chr> "Afghanistan", "Afghanistan", "Brazil", "Brazil", "...
#> $ year       <int> 1999, 2000, 1999, 2000, 1999, 2000
#> $ cases      <int> 745, 2666, 37737, 80488, 212258, 213766
#> $ population <int> 19987071, 20595360, 172006362, 174504898, 127291527...
#> # A tibble: 6 x 4
#>   country      year  cases population
#>   <chr>       <int>  <int>      <int>
#> 1 Afghanistan  1999    745   19987071
#> 2 Afghanistan  2000   2666   20595360
#> 3 Brazil       1999  37737  172006362
#> 4 Brazil       2000  80488  174504898
#> 5 China        1999 212258 1272915272
#> 6 China        2000 213766 1280428583

Transforming or adding variables: `mutate()`

The mutate() function adds or replaces a variable/column in a dataset. To add

# Replace an existing variable
table1 %>%
    mutate(population = population / 1000000)
#> # A tibble: 6 x 4
#>   country      year  cases population
#>   <chr>       <int>  <int>      <dbl>
#> 1 Afghanistan  1999    745       20.0
#> 2 Afghanistan  2000   2666       20.6
#> 3 Brazil       1999  37737      172. 
#> 4 Brazil       2000  80488      175. 
#> 5 China        1999 212258     1273. 
#> 6 China        2000 213766     1280.

# Or create a new variable based on a condition
table1 %>%
    mutate(after_2000 = if_else(year >= 2000, "yes", "no"))
#> # A tibble: 6 x 5
#>   country      year  cases population after_2000
#>   <chr>       <int>  <int>      <int> <chr>     
#> 1 Afghanistan  1999    745   19987071 no        
#> 2 Afghanistan  2000   2666   20595360 yes       
#> 3 Brazil       1999  37737  172006362 no        
#> 4 Brazil       2000  80488  174504898 yes       
#> 5 China        1999 212258 1272915272 no        
#> 6 China        2000 213766 1280428583 yes

# Create or replace multiple variables by using the ","
table1 %>%
    mutate(new_column = "only one value",
           population = population / 1000000)
#> # A tibble: 6 x 5
#>   country      year  cases population new_column    
#>   <chr>       <int>  <int>      <dbl> <chr>         
#> 1 Afghanistan  1999    745       20.0 only one value
#> 2 Afghanistan  2000   2666       20.6 only one value
#> 3 Brazil       1999  37737      172.  only one value
#> 4 Brazil       2000  80488      175.  only one value
#> 5 China        1999 212258     1273.  only one value
#> 6 China        2000 213766     1280.  only one value

# Create a new variable using existing variables and save a new dataframe
table1_rate <- table1 %>%
    mutate(rate = (cases / population) * 100000)
table1_rate
#> # A tibble: 6 x 5
#>   country      year  cases population  rate
#>   <chr>       <int>  <int>      <int> <dbl>
#> 1 Afghanistan  1999    745   19987071  3.73
#> 2 Afghanistan  2000   2666   20595360 12.9 
#> 3 Brazil       1999  37737  172006362 21.9 
#> 4 Brazil       2000  80488  174504898 46.1 
#> 5 China        1999 212258 1272915272 16.7 
#> 6 China        2000 213766 1280428583 16.7

Exercise: Piping, transforming, and adding

Time: 10 min

# explore the structure of the data
names(___)

# pipe the data into mutate function and:
new_dataset <- ___ %>% # dataset
    mutate(
        # 1. create a new variable
        ___ = ___,
        # 2. transform/replace an existing variable
        ___ = ___,
        # 3. create a new variable using a conditional
        ___ = if_else(___, TRUE, FALSE)
    )
new_dataset

Select specific data by the variables: `select()`

# select columns/variables by name, without quotes
table1_rate %>%
    select(country, year, rate)
#> # A tibble: 6 x 3
#>   country      year  rate
#>   <chr>       <int> <dbl>
#> 1 Afghanistan  1999  3.73
#> 2 Afghanistan  2000 12.9 
#> 3 Brazil       1999 21.9 
#> 4 Brazil       2000 46.1 
#> 5 China        1999 16.7 
#> 6 China        2000 16.7

# to *not* select a variable, us minus (-)
table1_rate %>%
    select(-country)
#> # A tibble: 6 x 4
#>    year  cases population  rate
#>   <int>  <int>      <int> <dbl>
#> 1  1999    745   19987071  3.73
#> 2  2000   2666   20595360 12.9 
#> 3  1999  37737  172006362 21.9 
#> 4  2000  80488  174504898 46.1 
#> 5  1999 212258 1272915272 16.7 
#> 6  2000 213766 1280428583 16.7

# when you have many variables with similar names, use "matching" functions
table1_rate %>%
    select(starts_with("c"), ends_with("e"), contains("pop"))
#> # A tibble: 6 x 4
#>   country      cases  rate population
#>   <chr>        <int> <dbl>      <int>
#> 1 Afghanistan    745  3.73   19987071
#> 2 Afghanistan   2666 12.9    20595360
#> 3 Brazil       37737 21.9   172006362
#> 4 Brazil       80488 46.1   174504898
#> 5 China       212258 16.7  1272915272
#> 6 China       213766 16.7  1280428583

Filter/subset the data by row based on values and conditions: `filter()`

Warning: Filter using logic… and humans are very bad at logic… Make sure to be very certain that you think your logic is what the code reads as logic… Lots of mistakes can be made at this stage! Especially with complex logic situations.

# when country is equal to
table1_rate %>%
    filter(country == "Brazil")
#> # A tibble: 2 x 5
#>   country  year cases population  rate
#>   <chr>   <int> <int>      <int> <dbl>
#> 1 Brazil   1999 37737  172006362  21.9
#> 2 Brazil   2000 80488  174504898  46.1

# when country is *not* equal to
table1_rate %>%
    filter(country != "Brazil")
#> # A tibble: 4 x 5
#>   country      year  cases population  rate
#>   <chr>       <int>  <int>      <int> <dbl>
#> 1 Afghanistan  1999    745   19987071  3.73
#> 2 Afghanistan  2000   2666   20595360 12.9 
#> 3 China        1999 212258 1272915272 16.7 
#> 4 China        2000 213766 1280428583 16.7

# when year is equal to
table1_rate %>%
    filter(year == 1999)
#> # A tibble: 3 x 5
#>   country      year  cases population  rate
#>   <chr>       <int>  <int>      <int> <dbl>
#> 1 Afghanistan  1999    745   19987071  3.73
#> 2 Brazil       1999  37737  172006362 21.9 
#> 3 China        1999 212258 1272915272 16.7

# Or when year is equal to or more than
table1_rate %>%
    filter(year >= 1999)
#> # A tibble: 6 x 5
#>   country      year  cases population  rate
#>   <chr>       <int>  <int>      <int> <dbl>
#> 1 Afghanistan  1999    745   19987071  3.73
#> 2 Afghanistan  2000   2666   20595360 12.9 
#> 3 Brazil       1999  37737  172006362 21.9 
#> 4 Brazil       2000  80488  174504898 46.1 
#> 5 China        1999 212258 1272915272 16.7 
#> 6 China        2000 213766 1280428583 16.7

# when year is 1999 *and* country is Brazil
table1_rate %>%
    filter(year == 1999 & country == "Brazil")
#> # A tibble: 1 x 5
#>   country  year cases population  rate
#>   <chr>   <int> <int>      <int> <dbl>
#> 1 Brazil   1999 37737  172006362  21.9

# when year is 1999 *or* country is Brazil
table1_rate %>%
    filter(year == 1999 | country == "Brazil")
#> # A tibble: 4 x 5
#>   country      year  cases population  rate
#>   <chr>       <int>  <int>      <int> <dbl>
#> 1 Afghanistan  1999    745   19987071  3.73
#> 2 Brazil       1999  37737  172006362 21.9 
#> 3 Brazil       2000  80488  174504898 46.1 
#> 4 China        1999 212258 1272915272 16.7

Sorting/(re)arranging your data by column/variable: `arrange()`

# ascending order by rate
table1_rate %>%
    arrange(rate)
#> # A tibble: 6 x 5
#>   country      year  cases population  rate
#>   <chr>       <int>  <int>      <int> <dbl>
#> 1 Afghanistan  1999    745   19987071  3.73
#> 2 Afghanistan  2000   2666   20595360 12.9 
#> 3 China        1999 212258 1272915272 16.7 
#> 4 China        2000 213766 1280428583 16.7 
#> 5 Brazil       1999  37737  172006362 21.9 
#> 6 Brazil       2000  80488  174504898 46.1

# descending order by rate
table1_rate %>%
    arrange(desc(rate))
#> # A tibble: 6 x 5
#>   country      year  cases population  rate
#>   <chr>       <int>  <int>      <int> <dbl>
#> 1 Brazil       2000  80488  174504898 46.1 
#> 2 Brazil       1999  37737  172006362 21.9 
#> 3 China        2000 213766 1280428583 16.7 
#> 4 China        1999 212258 1272915272 16.7 
#> 5 Afghanistan  2000   2666   20595360 12.9 
#> 6 Afghanistan  1999    745   19987071  3.73

# ascending order by year and rate
table1_rate %>%
    arrange(year, rate)
#> # A tibble: 6 x 5
#>   country      year  cases population  rate
#>   <chr>       <int>  <int>      <int> <dbl>
#> 1 Afghanistan  1999    745   19987071  3.73
#> 2 China        1999 212258 1272915272 16.7 
#> 3 Brazil       1999  37737  172006362 21.9 
#> 4 Afghanistan  2000   2666   20595360 12.9 
#> 5 China        2000 213766 1280428583 16.7 
#> 6 Brazil       2000  80488  174504898 46.1

Exercise: Filtering and logic, arranging, and selecting

Time: 10 min

# filter the rows by two variables
# choose two numeric variables and two numbers
___ %>%
    # format: variable >= number
    # example: weight >= 50
    filter(___ >= ___ & ___ <= ___)

# note the "&" above, compare with "|" below
# how do they differ?
___ %>%
    filter(___ >= ___ | ___ <= ___)

# pipe the data and continue piping
___ %>%
    # filter a variable, preferably a character variable,
    filter(______  ==  ______) %>%
    # sort by two variables
    arrange(___,   ___) %>%
    # now keep only three variables
    select(___,  ___,  ___)

If you are familiar with ggplot2 from the ggplot2 session, then you can optional try to visualize the dataset.

Create a summary of the data, alone or by a group(s): `group_by()`, `summarise()`

# summarise on its own
# summarise only can output one (1) value, eg. a max or an average
table1_rate %>%
    summarise(cum_cases = sum(cases),
              max = max(cases))
#> # A tibble: 1 x 2
#>   cum_cases    max
#>       <int>  <dbl>
#> 1    547660 213766

# grouping by country, on its own, does nothing
table1_rate %>%
    group_by(country) 
#> # A tibble: 6 x 5
#> # Groups:   country [3]
#>   country      year  cases population  rate
#>   <chr>       <int>  <int>      <int> <dbl>
#> 1 Afghanistan  1999    745   19987071  3.73
#> 2 Afghanistan  2000   2666   20595360 12.9 
#> 3 Brazil       1999  37737  172006362 21.9 
#> 4 Brazil       2000  80488  174504898 46.1 
#> 5 China        1999 212258 1272915272 16.7 
#> 6 China        2000 213766 1280428583 16.7

# group_by combined with summarise is powerful
table1_rate %>%
    group_by(country) %>%
    summarise(cum_cases = sum(cases),
              max = max(cases))
#> # A tibble: 3 x 3
#>   country     cum_cases    max
#>   <chr>           <int>  <dbl>
#> 1 Afghanistan      3411   2666
#> 2 Brazil         118225  80488
#> 3 China          426024 213766

# grouping by year
table1_rate %>%
    group_by(year) %>%
    summarise(cum_cases = sum(cases),
              max = max(cases))
#> # A tibble: 2 x 3
#>    year cum_cases    max
#>   <int>     <int>  <dbl>
#> 1  1999    250740 212258
#> 2  2000    296920 213766

Converting to long form: `gather()`

Compare how table1 looks normally and after converting to the long form with gather().

table1
#> # A tibble: 6 x 4
#>   country      year  cases population
#>   <chr>       <int>  <int>      <int>
#> 1 Afghanistan  1999    745   19987071
#> 2 Afghanistan  2000   2666   20595360
#> 3 Brazil       1999  37737  172006362
#> 4 Brazil       2000  80488  174504898
#> 5 China        1999 212258 1272915272
#> 6 China        2000 213766 1280428583

# convert to long form by stacking all cases with all population
table1 %>% 
    gather(variable, value, -country, -year)
#> # A tibble: 12 x 4
#>    country      year variable        value
#>    <chr>       <int> <chr>           <int>
#>  1 Afghanistan  1999 cases             745
#>  2 Afghanistan  2000 cases            2666
#>  3 Brazil       1999 cases           37737
#>  4 Brazil       2000 cases           80488
#>  5 China        1999 cases          212258
#>  6 China        2000 cases          213766
#>  7 Afghanistan  1999 population   19987071
#>  8 Afghanistan  2000 population   20595360
#>  9 Brazil       1999 population  172006362
#> 10 Brazil       2000 population  174504898
#> 11 China        1999 population 1272915272
#> 12 China        2000 population 1280428583

# this does the same:
table1 %>%
    gather(variable, value, cases, population)
#> # A tibble: 12 x 4
#>    country      year variable        value
#>    <chr>       <int> <chr>           <int>
#>  1 Afghanistan  1999 cases             745
#>  2 Afghanistan  2000 cases            2666
#>  3 Brazil       1999 cases           37737
#>  4 Brazil       2000 cases           80488
#>  5 China        1999 cases          212258
#>  6 China        2000 cases          213766
#>  7 Afghanistan  1999 population   19987071
#>  8 Afghanistan  2000 population   20595360
#>  9 Brazil       1999 population  172006362
#> 10 Brazil       2000 population  174504898
#> 11 China        1999 population 1272915272
#> 12 China        2000 population 1280428583

Converting to long form makes other types of exploration much much easier. For instance, combined with group_by() and summarise():

table1 %>% 
    gather(variables, values, cases, population) %>% 
    group_by(year, variables) %>% 
    summarise(cum_sum = sum(values),
              max = sum(values))
#> # A tibble: 4 x 4
#> # Groups:   year [?]
#>    year variables     cum_sum        max
#>   <int> <chr>           <int>      <int>
#> 1  1999 cases          250740     250740
#> 2  1999 population 1464908705 1464908705
#> 3  2000 cases          296920     296920
#> 4  2000 population 1475528841 1475528841

Final exercise: Try to replicate an output.

Time: Until end of the session.

Copy the code below and run it to get the dataset to use for this exercise. Using everything you learned in this session, try to recreate the table below from the dataset given.

library(ggplot2)
economics_data <- economics %>% 
    # get the year information
    mutate(year = as.numeric(substr(date, 0, 4))) %>% 
    as_tibble()
economics_data
#> # A tibble: 574 x 7
#>    date         pce    pop psavert uempmed unemploy  year
#>    <date>     <dbl>  <int>   <dbl>   <dbl>    <int> <dbl>
#>  1 1967-07-01  507. 198712    12.5     4.5     2944  1967
#>  2 1967-08-01  510. 198911    12.5     4.7     2945  1967
#>  3 1967-09-01  516. 199113    11.7     4.6     2958  1967
#>  4 1967-10-01  513. 199311    12.5     4.9     3143  1967
#>  5 1967-11-01  518. 199498    12.5     4.7     3066  1967
#>  6 1967-12-01  526. 199657    12.1     4.8     3018  1967
#>  7 1968-01-01  532. 199808    11.7     5.1     2878  1968
#>  8 1968-02-01  534. 199920    12.2     4.5     3001  1968
#>  9 1968-03-01  545. 200056    11.6     4.1     2877  1968
#> 10 1968-04-01  545. 200208    12.2     4.6     2709  1968
#> # ... with 564 more rows

From the data above, try to recreate this data below in this table.
year	variable	Max	Median	Min
2000	uempmed	6.3000000	6.0000000	5.2000000
2000	unemploy_rate	0.0208329	0.0201725	0.0194601
1999	uempmed	6.9000000	6.2500000	5.8000000
1999	unemploy_rate	0.0219826	0.0210060	0.0201378
1998	uempmed	7.4000000	6.7500000	5.8000000
1998	unemploy_rate	0.0233487	0.0225944	0.0215798
1997	uempmed	8.3000000	7.9500000	7.5000000
1997	unemploy_rate	0.0263782	0.0244107	0.0230113
1996	uempmed	8.6000000	8.3000000	7.7000000
1996	unemploy_rate	0.0279358	0.0269796	0.0254912
1995	uempmed	9.1000000	8.2500000	7.9000000
1995	unemploy_rate	0.0287671	0.0278592	0.0269421
1994	uempmed	10.0000000	9.1500000	8.6000000
1994	unemploy_rate	0.0329491	0.0301162	0.0273032
1993	uempmed	8.6000000	8.3000000	8.0000000
1993	unemploy_rate	0.0360485	0.0346286	0.0323953
1992	uempmed	9.3000000	8.6500000	8.1000000
1992	unemploy_rate	0.0391287	0.0370310	0.0363734
1991	uempmed	7.8000000	6.8500000	6.0000000
1991	unemploy_rate	0.0360757	0.0342328	0.0318196
1990	uempmed	5.9000000	5.3500000	4.8000000
1990	unemploy_rate	0.0313998	0.0274685	0.0263759

(If time) Converting to wide form with `spread()`

To convert from long to wide, you use spread().

# original
table2
#> # A tibble: 12 x 4
#>    country      year type            count
#>    <chr>       <int> <chr>           <int>
#>  1 Afghanistan  1999 cases             745
#>  2 Afghanistan  1999 population   19987071
#>  3 Afghanistan  2000 cases            2666
#>  4 Afghanistan  2000 population   20595360
#>  5 Brazil       1999 cases           37737
#>  6 Brazil       1999 population  172006362
#>  7 Brazil       2000 cases           80488
#>  8 Brazil       2000 population  174504898
#>  9 China        1999 cases          212258
#> 10 China        1999 population 1272915272
#> 11 China        2000 cases          213766
#> 12 China        2000 population 1280428583

# after spread
table2 %>% 
    spread(type, count)
#> # A tibble: 6 x 4
#>   country      year  cases population
#>   <chr>       <int>  <int>      <int>
#> 1 Afghanistan  1999    745   19987071
#> 2 Afghanistan  2000   2666   20595360
#> 3 Brazil       1999  37737  172006362
#> 4 Brazil       2000  80488  174504898
#> 5 China        1999 212258 1272915272
#> 6 China        2000 213766 1280428583

This work is licensed under a Creative Commons Attribution 4.0 International License. See the licensing page for more details about copyright information.

Work with and manage your data efficiently in R with dplyr and tidyr

Session details

Objectives

Resources for learning and help

Quickly look over the data we will use

Exercise: Choose and look at a dataset

“Messy” vs. “tidy” data

Pipe operator: %>%

Transforming or adding variables: mutate()

Exercise: Piping, transforming, and adding

Select specific data by the variables: select()

Filter/subset the data by row based on values and conditions: filter()

Sorting/(re)arranging your data by column/variable: arrange()