Data Wrangling with tidyr

The tidyr package has some built in data-sets specifically for practicing tidying, so they’ll be used for the demonstration today.

Tidy Data

The first example data table in tidyr is the cleverly named “table1”. This table is tidy! The other tableNs in this dataset are untidy versions of this one to practice on.

table1
## # A tibble: 6 x 4
##   country      year  cases population
##   <chr>       <int>  <int>      <int>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

As you can see when you look at table1, each column is a variable and contains only values of the same type belonging to that variable. Going foward, we’ll be tidying messed up versions of table1

tidyr functions

spread()

If you look at table 2 in the chunk below, it’s not tidy because the column ‘type’ contains two variables, cases and population. Cases and population are separate measurements, so they should be their own columns. To make more columns from existing columns, use spread().

table2
## # A tibble: 12 x 4
##    country      year type            count
##    <chr>       <int> <chr>           <int>
##  1 Afghanistan  1999 cases             745
##  2 Afghanistan  1999 population   19987071
##  3 Afghanistan  2000 cases            2666
##  4 Afghanistan  2000 population   20595360
##  5 Brazil       1999 cases           37737
##  6 Brazil       1999 population  172006362
##  7 Brazil       2000 cases           80488
##  8 Brazil       2000 population  174504898
##  9 China        1999 cases          212258
## 10 China        1999 population 1272915272
## 11 China        2000 cases          213766
## 12 China        2000 population 1280428583
table2 %>% spread(type, count)
## # A tibble: 6 x 4
##   country      year  cases population
##   <chr>       <int>  <int>      <int>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

gather()

Looking at table4a in the chunk below, it’s not tidy because the columns 1999 and 2000 are not variable descriptions, they’re data (the year the measurement was made in). We need the column names to be values in a column. To do that we’ll use gather.

table4a
## # A tibble: 3 x 3
##   country     `1999` `2000`
## * <chr>        <int>  <int>
## 1 Afghanistan    745   2666
## 2 Brazil       37737  80488
## 3 China       212258 213766
# you can gather by specifying column names
table4a %>% gather(year, cases, `1999`:`2000`)
## # A tibble: 6 x 3
##   country     year   cases
##   <chr>       <chr>  <int>
## 1 Afghanistan 1999     745
## 2 Brazil      1999   37737
## 3 China       1999  212258
## 4 Afghanistan 2000    2666
## 5 Brazil      2000   80488
## 6 China       2000  213766
# also with column numbers
table4a %>% gather(year, cases, 2:3)
## # A tibble: 6 x 3
##   country     year   cases
##   <chr>       <chr>  <int>
## 1 Afghanistan 1999     745
## 2 Brazil      1999   37737
## 3 China       1999  212258
## 4 Afghanistan 2000    2666
## 5 Brazil      2000   80488
## 6 China       2000  213766

separate()

In the chunk below, table3 is not tidy because it has multiple values separate by a backslash in the rate column. To split one column into two or more columns, use separate()

table3
## # A tibble: 6 x 3
##   country      year rate             
## * <chr>       <int> <chr>            
## 1 Afghanistan  1999 745/19987071     
## 2 Afghanistan  2000 2666/20595360    
## 3 Brazil       1999 37737/172006362  
## 4 Brazil       2000 80488/174504898  
## 5 China        1999 212258/1272915272
## 6 China        2000 213766/1280428583
# separate example
table3 %>% separate(rate, into = c('cases', 'population'), sep = '/')
## # A tibble: 6 x 4
##   country      year cases  population
## * <chr>       <int> <chr>  <chr>     
## 1 Afghanistan  1999 745    19987071  
## 2 Afghanistan  2000 2666   20595360  
## 3 Brazil       1999 37737  172006362 
## 4 Brazil       2000 80488  174504898 
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

separate() also lets you specify an interger position to split at

# just so you know, you can also separate by specifying a position to separate after
table3 %>% separate(rate, into = c('cases', 'population'), sep = 3)
## # A tibble: 6 x 4
##   country      year cases population    
## * <chr>       <int> <chr> <chr>         
## 1 Afghanistan  1999 745   /19987071     
## 2 Afghanistan  2000 266   6/20595360    
## 3 Brazil       1999 377   37/172006362  
## 4 Brazil       2000 804   88/174504898  
## 5 China        1999 212   258/1272915272
## 6 China        2000 213   766/1280428583

unite()

In the chunk below, table5 is not tidy because the date is split between two columns, century and year. To combine columns, use the opposite of separate(), unite()

table5
## # A tibble: 6 x 4
##   country     century year  rate             
## * <chr>       <chr>   <chr> <chr>            
## 1 Afghanistan 19      99    745/19987071     
## 2 Afghanistan 20      00    2666/20595360    
## 3 Brazil      19      99    37737/172006362  
## 4 Brazil      20      00    80488/174504898  
## 5 China       19      99    212258/1272915272
## 6 China       20      00    213766/1280428583
table5 %>% unite(year, c('century', 'year'), sep = '')
## # A tibble: 6 x 3
##   country     year  rate             
##   <chr>       <chr> <chr>            
## 1 Afghanistan 1999  745/19987071     
## 2 Afghanistan 2000  2666/20595360    
## 3 Brazil      1999  37737/172006362  
## 4 Brazil      2000  80488/174504898  
## 5 China       1999  212258/1272915272
## 6 China       2000  213766/1280428583

You may have noticed that the rate column in table5 is the same as the rate column in table3 and contains two values. Chain together some of the tidyr functions above to completely tidy table5.

table5 %>% 
  unite(year, c('century', 'year'), sep = '') %>% 
  separate(rate, into = c('cases', 'population'), sep = '/')
## # A tibble: 6 x 4
##   country     year  cases  population
##   <chr>       <chr> <chr>  <chr>     
## 1 Afghanistan 1999  745    19987071  
## 2 Afghanistan 2000  2666   20595360  
## 3 Brazil      1999  37737  172006362 
## 4 Brazil      2000  80488  174504898 
## 5 China       1999  212258 1272915272
## 6 China       2000  213766 1280428583

Combining Tables

Binds and joins are in the dplyr package, but fit in better with data wrangling, so they were saved for this week.

Binds

Attach tables together. In order to bind tables, the tables being bound together have to have the same number of rows OR the same number of columns (depending if you’re binding by rows or columns).

rbind() / bind_rows()

Attach tables together by adding additional rows. When you bind by rows, the second table is added to the bottom of the first table.

table4a
## # A tibble: 3 x 3
##   country     `1999` `2000`
## * <chr>        <int>  <int>
## 1 Afghanistan    745   2666
## 2 Brazil       37737  80488
## 3 China       212258 213766
table4b
## # A tibble: 3 x 3
##   country         `1999`     `2000`
## * <chr>            <int>      <int>
## 1 Afghanistan   19987071   20595360
## 2 Brazil       172006362  174504898
## 3 China       1272915272 1280428583
# base R
rbind(table4a, table4b)
## # A tibble: 6 x 3
##   country         `1999`     `2000`
##   <chr>            <int>      <int>
## 1 Afghanistan        745       2666
## 2 Brazil           37737      80488
## 3 China           212258     213766
## 4 Afghanistan   19987071   20595360
## 5 Brazil       172006362  174504898
## 6 China       1272915272 1280428583
# tidyverse, dplyr
bind_rows(table4a, table4b)
## # A tibble: 6 x 3
##   country         `1999`     `2000`
##   <chr>            <int>      <int>
## 1 Afghanistan        745       2666
## 2 Brazil           37737      80488
## 3 China           212258     213766
## 4 Afghanistan   19987071   20595360
## 5 Brazil       172006362  174504898
## 6 China       1272915272 1280428583

The difference between rbind() and bind_rows() is that if you have a different number of columns, rbind() will refuse to work and return an error, while bind_rows() will keep all columns from all tables and add NAs where there are no values.

cbind() / bind_cols()

Attach tables by adding additional columns.

table4a
## # A tibble: 3 x 3
##   country     `1999` `2000`
## * <chr>        <int>  <int>
## 1 Afghanistan    745   2666
## 2 Brazil       37737  80488
## 3 China       212258 213766
table4b
## # A tibble: 3 x 3
##   country         `1999`     `2000`
## * <chr>            <int>      <int>
## 1 Afghanistan   19987071   20595360
## 2 Brazil       172006362  174504898
## 3 China       1272915272 1280428583
# base R
cbind(table4a, table4b)
##       country   1999   2000     country       1999       2000
## 1 Afghanistan    745   2666 Afghanistan   19987071   20595360
## 2      Brazil  37737  80488      Brazil  172006362  174504898
## 3       China 212258 213766       China 1272915272 1280428583
# tidyverse, dplyr
bind_cols(table4a, table4b)
## # A tibble: 3 x 6
##   country     `1999` `2000` country1       `19991`    `20001`
##   <chr>        <int>  <int> <chr>            <int>      <int>
## 1 Afghanistan    745   2666 Afghanistan   19987071   20595360
## 2 Brazil       37737  80488 Brazil       172006362  174504898
## 3 China       212258 213766 China       1272915272 1280428583

Joins

Joins combine two tables together by common variables. The difference between binds and joins is that binds are intended to combine tables together that have exactly the same information, like for different samples, while joins are intended to combine together tables with complimentary but different information by common variables, like samples with their metadata. The *_join() functions combine either by the columns you specify or by auto-detecting columns with the same name. Mutating joins add additional columns, while filtering joins filter a table by another table.

We’ll use the band_* tables from the dplyr package to practice joins. As you can see when you run the tables in the chunk below, the tables have information in common, musician names, but complimentary information, what band they belong to in one table and what instruments they play in the others.

band_members
## # A tibble: 3 x 2
##   name  band   
##   <chr> <chr>  
## 1 Mick  Stones 
## 2 John  Beatles
## 3 Paul  Beatles
band_instruments
## # A tibble: 3 x 2
##   name  plays 
##   <chr> <chr> 
## 1 John  guitar
## 2 Paul  bass  
## 3 Keith guitar
band_instruments2
## # A tibble: 3 x 2
##   artist plays 
##   <chr>  <chr> 
## 1 John   guitar
## 2 Paul   bass  
## 3 Keith  guitar

The descriptions below refer to x and y in the join. x is the first table listed in the join and y is the second. The order tables are given in joins is important!

Mutating Joins

Mutating joins add columns onto a table from another table.

inner_join()

Joins tables by returning all rows where x and y equal each other (in the join columns) and all columns from both x and y.

band_members
## # A tibble: 3 x 2
##   name  band   
##   <chr> <chr>  
## 1 Mick  Stones 
## 2 John  Beatles
## 3 Paul  Beatles
band_instruments
## # A tibble: 3 x 2
##   name  plays 
##   <chr> <chr> 
## 1 John  guitar
## 2 Paul  bass  
## 3 Keith guitar
inner_join(band_members, band_instruments, by = c('name'))
## # A tibble: 2 x 3
##   name  band    plays 
##   <chr> <chr>   <chr> 
## 1 John  Beatles guitar
## 2 Paul  Beatles bass
full_join()

The opposite of inner_join(), full_join() joins tables by returning all rows all columns from both x and y. If there is no matching value between x and y, an NA will be returned.

band_members
## # A tibble: 3 x 2
##   name  band   
##   <chr> <chr>  
## 1 Mick  Stones 
## 2 John  Beatles
## 3 Paul  Beatles
band_instruments
## # A tibble: 3 x 2
##   name  plays 
##   <chr> <chr> 
## 1 John  guitar
## 2 Paul  bass  
## 3 Keith guitar
full_join(band_members, band_instruments, by = c('name'))
## # A tibble: 4 x 3
##   name  band    plays 
##   <chr> <chr>   <chr> 
## 1 Mick  Stones  <NA>  
## 2 John  Beatles guitar
## 3 Paul  Beatles bass  
## 4 Keith <NA>    guitar
left_join()

Joins tables by return all rows from x and all columns from x and y. If there is no matching value in y for x, an NA will be returned.

band_members
## # A tibble: 3 x 2
##   name  band   
##   <chr> <chr>  
## 1 Mick  Stones 
## 2 John  Beatles
## 3 Paul  Beatles
band_instruments
## # A tibble: 3 x 2
##   name  plays 
##   <chr> <chr> 
## 1 John  guitar
## 2 Paul  bass  
## 3 Keith guitar
left_join(band_members, band_instruments, by = c('name'))
## # A tibble: 3 x 3
##   name  band    plays 
##   <chr> <chr>   <chr> 
## 1 Mick  Stones  <NA>  
## 2 John  Beatles guitar
## 3 Paul  Beatles bass
right_join()

Joins tables by the opposite of left_join(), returning all rows from y and all columns from x and y. If there is no matching value in x for y, an NA will be returned.

band_members
## # A tibble: 3 x 2
##   name  band   
##   <chr> <chr>  
## 1 Mick  Stones 
## 2 John  Beatles
## 3 Paul  Beatles
band_instruments
## # A tibble: 3 x 2
##   name  plays 
##   <chr> <chr> 
## 1 John  guitar
## 2 Paul  bass  
## 3 Keith guitar
right_join(band_members, band_instruments, by = c('name'))
## # A tibble: 3 x 3
##   name  band    plays 
##   <chr> <chr>   <chr> 
## 1 John  Beatles guitar
## 2 Paul  Beatles bass  
## 3 Keith <NA>    guitar

Filtering Joins

Filtering joins filter a table by another table.

semi_join()

Filters x by returning all rows from x that match values in y.

band_members
## # A tibble: 3 x 2
##   name  band   
##   <chr> <chr>  
## 1 Mick  Stones 
## 2 John  Beatles
## 3 Paul  Beatles
band_instruments
## # A tibble: 3 x 2
##   name  plays 
##   <chr> <chr> 
## 1 John  guitar
## 2 Paul  bass  
## 3 Keith guitar
semi_join(band_members, band_instruments, by = c('name'))
## # A tibble: 2 x 2
##   name  band   
##   <chr> <chr>  
## 1 John  Beatles
## 2 Paul  Beatles

anti_join()

The opposite of semi_join(), anti_join() returns rows from x that DON’T match a value in y

band_members
## # A tibble: 3 x 2
##   name  band   
##   <chr> <chr>  
## 1 Mick  Stones 
## 2 John  Beatles
## 3 Paul  Beatles
band_instruments
## # A tibble: 3 x 2
##   name  plays 
##   <chr> <chr> 
## 1 John  guitar
## 2 Paul  bass  
## 3 Keith guitar
anti_join(band_members, band_instruments, by = c('name'))
## # A tibble: 1 x 2
##   name  band  
##   <chr> <chr> 
## 1 Mick  Stones

What if my column names don’t match?

Name and artist are the same variable, who the person is, in the tables below.

band_members
## # A tibble: 3 x 2
##   name  band   
##   <chr> <chr>  
## 1 Mick  Stones 
## 2 John  Beatles
## 3 Paul  Beatles
band_instruments2
## # A tibble: 3 x 2
##   artist plays 
##   <chr>  <chr> 
## 1 John   guitar
## 2 Paul   bass  
## 3 Keith  guitar

To specify the join by two differently named columns, specify that they equal each other. You MUST list the column names in the order the tables are given to the join. ‘name’ comes first below, because that’s the column you want to join by in the first table.

full_join(band_members, band_instruments2, by = c('name' = 'artist'))
## # A tibble: 4 x 3
##   name  band    plays 
##   <chr> <chr>   <chr> 
## 1 Mick  Stones  <NA>  
## 2 John  Beatles guitar
## 3 Paul  Beatles bass  
## 4 Keith <NA>    guitar

Dealing with NAs

If there’s no value in a table, or R can’t figure out what’s supposed to be there, it will give you an NA as seen in the Star Wars table below

starwars
## # A tibble: 87 x 13
##    name  height  mass hair_color skin_color eye_color birth_year gender
##    <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> 
##  1 Luke…    172    77 blond      fair       blue            19   male  
##  2 C-3PO    167    75 <NA>       gold       yellow         112   <NA>  
##  3 R2-D2     96    32 <NA>       white, bl… red             33   <NA>  
##  4 Dart…    202   136 none       white      yellow          41.9 male  
##  5 Leia…    150    49 brown      light      brown           19   female
##  6 Owen…    178   120 brown, gr… light      blue            52   male  
##  7 Beru…    165    75 brown      light      blue            47   female
##  8 R5-D4     97    32 <NA>       white, red red             NA   <NA>  
##  9 Bigg…    183    84 black      light      brown           24   male  
## 10 Obi-…    182    77 auburn, w… fair       blue-gray       57   male  
## # ... with 77 more rows, and 5 more variables: homeworld <chr>,
## #   species <chr>, films <list>, vehicles <list>, starships <list>

NAs are bad not only because you’re missing data, but also because many functions in R either refuse to run or return weird answers if there are NAs in the data.

mean(starwars$birth_year)
## [1] NA

na.omit()

One strategy is just to drop all rows that contain NAs

starwars
## # A tibble: 87 x 13
##    name  height  mass hair_color skin_color eye_color birth_year gender
##    <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> 
##  1 Luke…    172    77 blond      fair       blue            19   male  
##  2 C-3PO    167    75 <NA>       gold       yellow         112   <NA>  
##  3 R2-D2     96    32 <NA>       white, bl… red             33   <NA>  
##  4 Dart…    202   136 none       white      yellow          41.9 male  
##  5 Leia…    150    49 brown      light      brown           19   female
##  6 Owen…    178   120 brown, gr… light      blue            52   male  
##  7 Beru…    165    75 brown      light      blue            47   female
##  8 R5-D4     97    32 <NA>       white, red red             NA   <NA>  
##  9 Bigg…    183    84 black      light      brown           24   male  
## 10 Obi-…    182    77 auburn, w… fair       blue-gray       57   male  
## # ... with 77 more rows, and 5 more variables: homeworld <chr>,
## #   species <chr>, films <list>, vehicles <list>, starships <list>
starwars %>% na.omit()
## # A tibble: 29 x 13
##    name  height  mass hair_color skin_color eye_color birth_year gender
##    <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> 
##  1 Luke…    172    77 blond      fair       blue            19   male  
##  2 Dart…    202   136 none       white      yellow          41.9 male  
##  3 Leia…    150    49 brown      light      brown           19   female
##  4 Owen…    178   120 brown, gr… light      blue            52   male  
##  5 Beru…    165    75 brown      light      blue            47   female
##  6 Bigg…    183    84 black      light      brown           24   male  
##  7 Obi-…    182    77 auburn, w… fair       blue-gray       57   male  
##  8 Anak…    188    84 blond      fair       blue            41.9 male  
##  9 Chew…    228   112 brown      unknown    blue           200   male  
## 10 Han …    180    80 brown      fair       brown           29   male  
## # ... with 19 more rows, and 5 more variables: homeworld <chr>,
## #   species <chr>, films <list>, vehicles <list>, starships <list>

replace_na()

Sometimes you might want to replace NAs with a value though, which you can do using tidyr

starwars
## # A tibble: 87 x 13
##    name  height  mass hair_color skin_color eye_color birth_year gender
##    <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> 
##  1 Luke…    172    77 blond      fair       blue            19   male  
##  2 C-3PO    167    75 <NA>       gold       yellow         112   <NA>  
##  3 R2-D2     96    32 <NA>       white, bl… red             33   <NA>  
##  4 Dart…    202   136 none       white      yellow          41.9 male  
##  5 Leia…    150    49 brown      light      brown           19   female
##  6 Owen…    178   120 brown, gr… light      blue            52   male  
##  7 Beru…    165    75 brown      light      blue            47   female
##  8 R5-D4     97    32 <NA>       white, red red             NA   <NA>  
##  9 Bigg…    183    84 black      light      brown           24   male  
## 10 Obi-…    182    77 auburn, w… fair       blue-gray       57   male  
## # ... with 77 more rows, and 5 more variables: homeworld <chr>,
## #   species <chr>, films <list>, vehicles <list>, starships <list>
starwars %>% replace_na(list(hair_color = 'none', birth_year = 'unknown', gender = 'n/a'))
## # A tibble: 87 x 13
##    name  height  mass hair_color skin_color eye_color birth_year gender
##    <chr>  <int> <dbl> <chr>      <chr>      <chr>     <chr>      <chr> 
##  1 Luke…    172    77 blond      fair       blue      19         male  
##  2 C-3PO    167    75 none       gold       yellow    112        n/a   
##  3 R2-D2     96    32 none       white, bl… red       33         n/a   
##  4 Dart…    202   136 none       white      yellow    41.9       male  
##  5 Leia…    150    49 brown      light      brown     19         female
##  6 Owen…    178   120 brown, gr… light      blue      52         male  
##  7 Beru…    165    75 brown      light      blue      47         female
##  8 R5-D4     97    32 none       white, red red       unknown    n/a   
##  9 Bigg…    183    84 black      light      brown     24         male  
## 10 Obi-…    182    77 auburn, w… fair       blue-gray 57         male  
## # ... with 77 more rows, and 5 more variables: homeworld <chr>,
## #   species <chr>, films <list>, vehicles <list>, starships <list>

Putting it together

# what the tidy table should look like
table1
## # A tibble: 6 x 4
##   country      year  cases population
##   <chr>       <int>  <int>      <int>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583
# the tables we need to tidy into one table
table4a
## # A tibble: 3 x 3
##   country     `1999` `2000`
## * <chr>        <int>  <int>
## 1 Afghanistan    745   2666
## 2 Brazil       37737  80488
## 3 China       212258 213766
table4b
## # A tibble: 3 x 3
##   country         `1999`     `2000`
## * <chr>            <int>      <int>
## 1 Afghanistan   19987071   20595360
## 2 Brazil       172006362  174504898
## 3 China       1272915272 1280428583
# tidy individual tables
table4a %>% gather(year, cases, `1999`:`2000`) -> table4a_tidy
table4b %>% gather(year, population, `1999`:`2000`) -> table4b_tidy

# combine
left_join(table4a_tidy, table4b_tidy, by = c('country', 'year'))
## # A tibble: 6 x 4
##   country     year   cases population
##   <chr>       <chr>  <int>      <int>
## 1 Afghanistan 1999     745   19987071
## 2 Brazil      1999   37737  172006362
## 3 China       1999  212258 1272915272
## 4 Afghanistan 2000    2666   20595360
## 5 Brazil      2000   80488  174504898
## 6 China       2000  213766 1280428583