Session details

Objectives

  1. To learn the difference between “messy” and “tidy” data and see the advantages of using data that are consistently structured.
  2. To perform simple data transformations and create summaries.
  3. To get to know the tools that help tidying up and reshaping the data.

At the end of this session you will be able:

  1. To write code using the %>% operator.
  2. To perform simple data transformations using dplyr function mutate().
  3. To select variables and observations to work with using dplyr functions filter()and select().
  4. To order data by variable using dplyr function arrange().
  5. To provide a simple data summary using dplyr functions group_by() and summarise().
  6. To change the structure of the data using tidyr functions gather() (and optionally spread()).

Resources for learning and help

For learning:

  1. “R for Data Science” book: Chapter on tidy data
  2. tidyr vignette
  3. dplyr vignette
  4. DataCamp course “Working with Data in the Tidyverse”
    • Chapter 1 “Explore your data” is free (the following chapters are available after purchasing the DataCamp subscription)

For help:

  1. Data wrangling cheatsheet
  2. StackOverflow for tidyr
  3. StackOverflow for dplyr
  4. Quick package information using RStudio: ?tidyr, ?dplyr

Exercise: Choose and look at a dataset

Create an exercise.R file (we will have done that in class) and put these code into the file. Then choose one of the below datasets that you will use for only the exercises:

  • airquality (recommended for complete beginners)
  • swiss (recommended for complete beginners)
  • population
  • storms
  • starwars

Copy the code below, replacing the ___ with the relevant dataset. This format is used throughout the session.

“Messy” vs. “tidy” data

The dataset is tidy when:

  • Each variable has its own column.
  • Each observation has its own row
  • Each value has its own cell

Look at the structure of the toy tidyr datasets table1, table3, and table4a. Which is tidy?

More or less, table1 is tidy while the others are not. That’s because in table1, each column has specific and unique information and each row represents a single observation from a single location at a single point in time.

Benefits of tidy data and tidy code:

  1. Time spent preparing your data at the beginning can save hours of frustration in the long run.
  2. “Tidy data” provides a concept and the package tidyr provides the functions for this work.
  3. Tibbles makes datasets efficient to work with.
  4. dplyr provides fast and intuitive data processing functions.
    • For very very large datasets, the data.table package is very fast and powerful
  5. The %>% pipe operator can help clarify complex data processing workflows.

Pipe operator: %>%

  • This operator allows you to pipe the output from one function to the input of another function.
  • Instead of nesting functions (reading from the inside to the outside), the idea of piping is to read the functions from left to right.
  • The %>% pipe operator can help clarify complex data processing workflows.

Transforming or adding variables: mutate()

The mutate() function adds or replaces a variable/column in a dataset. To add

# Replace an existing variable
table1 %>%
    mutate(population = population / 1000000)
#> # A tibble: 6 x 4
#>   country      year  cases population
#>   <chr>       <int>  <int>      <dbl>
#> 1 Afghanistan  1999    745       20.0
#> 2 Afghanistan  2000   2666       20.6
#> 3 Brazil       1999  37737      172. 
#> 4 Brazil       2000  80488      175. 
#> 5 China        1999 212258     1273. 
#> 6 China        2000 213766     1280.

# Or create a new variable based on a condition
table1 %>%
    mutate(after_2000 = if_else(year >= 2000, "yes", "no"))
#> # A tibble: 6 x 5
#>   country      year  cases population after_2000
#>   <chr>       <int>  <int>      <int> <chr>     
#> 1 Afghanistan  1999    745   19987071 no        
#> 2 Afghanistan  2000   2666   20595360 yes       
#> 3 Brazil       1999  37737  172006362 no        
#> 4 Brazil       2000  80488  174504898 yes       
#> 5 China        1999 212258 1272915272 no        
#> 6 China        2000 213766 1280428583 yes

# Create or replace multiple variables by using the ","
table1 %>%
    mutate(new_column = "only one value",
           population = population / 1000000)
#> # A tibble: 6 x 5
#>   country      year  cases population new_column    
#>   <chr>       <int>  <int>      <dbl> <chr>         
#> 1 Afghanistan  1999    745       20.0 only one value
#> 2 Afghanistan  2000   2666       20.6 only one value
#> 3 Brazil       1999  37737      172.  only one value
#> 4 Brazil       2000  80488      175.  only one value
#> 5 China        1999 212258     1273.  only one value
#> 6 China        2000 213766     1280.  only one value

# Create a new variable using existing variables and save a new dataframe
table1_rate <- table1 %>%
    mutate(rate = (cases / population) * 100000)
table1_rate
#> # A tibble: 6 x 5
#>   country      year  cases population  rate
#>   <chr>       <int>  <int>      <int> <dbl>
#> 1 Afghanistan  1999    745   19987071  3.73
#> 2 Afghanistan  2000   2666   20595360 12.9 
#> 3 Brazil       1999  37737  172006362 21.9 
#> 4 Brazil       2000  80488  174504898 46.1 
#> 5 China        1999 212258 1272915272 16.7 
#> 6 China        2000 213766 1280428583 16.7

Filter/subset the data by row based on values and conditions: filter()

Warning: Filter using logic… and humans are very bad at logic… Make sure to be very certain that you think your logic is what the code reads as logic… Lots of mistakes can be made at this stage! Especially with complex logic situations.

# when country is equal to
table1_rate %>%
    filter(country == "Brazil")
#> # A tibble: 2 x 5
#>   country  year cases population  rate
#>   <chr>   <int> <int>      <int> <dbl>
#> 1 Brazil   1999 37737  172006362  21.9
#> 2 Brazil   2000 80488  174504898  46.1

# when country is *not* equal to
table1_rate %>%
    filter(country != "Brazil")
#> # A tibble: 4 x 5
#>   country      year  cases population  rate
#>   <chr>       <int>  <int>      <int> <dbl>
#> 1 Afghanistan  1999    745   19987071  3.73
#> 2 Afghanistan  2000   2666   20595360 12.9 
#> 3 China        1999 212258 1272915272 16.7 
#> 4 China        2000 213766 1280428583 16.7

# when year is equal to
table1_rate %>%
    filter(year == 1999)
#> # A tibble: 3 x 5
#>   country      year  cases population  rate
#>   <chr>       <int>  <int>      <int> <dbl>
#> 1 Afghanistan  1999    745   19987071  3.73
#> 2 Brazil       1999  37737  172006362 21.9 
#> 3 China        1999 212258 1272915272 16.7

# Or when year is equal to or more than
table1_rate %>%
    filter(year >= 1999)
#> # A tibble: 6 x 5
#>   country      year  cases population  rate
#>   <chr>       <int>  <int>      <int> <dbl>
#> 1 Afghanistan  1999    745   19987071  3.73
#> 2 Afghanistan  2000   2666   20595360 12.9 
#> 3 Brazil       1999  37737  172006362 21.9 
#> 4 Brazil       2000  80488  174504898 46.1 
#> 5 China        1999 212258 1272915272 16.7 
#> 6 China        2000 213766 1280428583 16.7

# when year is 1999 *and* country is Brazil
table1_rate %>%
    filter(year == 1999 & country == "Brazil")
#> # A tibble: 1 x 5
#>   country  year cases population  rate
#>   <chr>   <int> <int>      <int> <dbl>
#> 1 Brazil   1999 37737  172006362  21.9

# when year is 1999 *or* country is Brazil
table1_rate %>%
    filter(year == 1999 | country == "Brazil")
#> # A tibble: 4 x 5
#>   country      year  cases population  rate
#>   <chr>       <int>  <int>      <int> <dbl>
#> 1 Afghanistan  1999    745   19987071  3.73
#> 2 Brazil       1999  37737  172006362 21.9 
#> 3 Brazil       2000  80488  174504898 46.1 
#> 4 China        1999 212258 1272915272 16.7

Converting to long form: gather()

Compare how table1 looks normally and after converting to the long form with gather().

Converting to long form makes other types of exploration much much easier. For instance, combined with group_by() and summarise():

Final exercise: Try to replicate an output.

Time: Until end of the session.

Copy the code below and run it to get the dataset to use for this exercise. Using everything you learned in this session, try to recreate the table below from the dataset given.

From the data above, try to recreate this data below in this table.
year variable Max Median Min
2000 uempmed 6.3000000 6.0000000 5.2000000
2000 unemploy_rate 0.0208329 0.0201725 0.0194601
1999 uempmed 6.9000000 6.2500000 5.8000000
1999 unemploy_rate 0.0219826 0.0210060 0.0201378
1998 uempmed 7.4000000 6.7500000 5.8000000
1998 unemploy_rate 0.0233487 0.0225944 0.0215798
1997 uempmed 8.3000000 7.9500000 7.5000000
1997 unemploy_rate 0.0263782 0.0244107 0.0230113
1996 uempmed 8.6000000 8.3000000 7.7000000
1996 unemploy_rate 0.0279358 0.0269796 0.0254912
1995 uempmed 9.1000000 8.2500000 7.9000000
1995 unemploy_rate 0.0287671 0.0278592 0.0269421
1994 uempmed 10.0000000 9.1500000 8.6000000
1994 unemploy_rate 0.0329491 0.0301162 0.0273032
1993 uempmed 8.6000000 8.3000000 8.0000000
1993 unemploy_rate 0.0360485 0.0346286 0.0323953
1992 uempmed 9.3000000 8.6500000 8.1000000
1992 unemploy_rate 0.0391287 0.0370310 0.0363734
1991 uempmed 7.8000000 6.8500000 6.0000000
1991 unemploy_rate 0.0360757 0.0342328 0.0318196
1990 uempmed 5.9000000 5.3500000 4.8000000
1990 unemploy_rate 0.0313998 0.0274685 0.0263759

This work is licensed under a Creative Commons Attribution 4.0 International License. See the licensing page for more details about copyright information.