The tidyr package has some built in data-sets specifically for practicing tidying, so they’ll be used for the demonstration today.
The first example data table in tidyr is the cleverly named “table1”. This table is tidy! The other tableNs in this dataset are untidy versions of this one to practice on.
table1
## # A tibble: 6 x 4
## country year cases population
## <chr> <int> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
As you can see when you look at table1, each column is a variable and contains only values of the same type belonging to that variable. Going foward, we’ll be tidying messed up versions of table1
spread()
If you look at table 2 in the chunk below, it’s not tidy because the column ‘type’ contains two variables, cases and population. Cases and population are separate measurements, so they should be their own columns. To make more columns from existing columns, use spread()
.
table2
## # A tibble: 12 x 4
## country year type count
## <chr> <int> <chr> <int>
## 1 Afghanistan 1999 cases 745
## 2 Afghanistan 1999 population 19987071
## 3 Afghanistan 2000 cases 2666
## 4 Afghanistan 2000 population 20595360
## 5 Brazil 1999 cases 37737
## 6 Brazil 1999 population 172006362
## 7 Brazil 2000 cases 80488
## 8 Brazil 2000 population 174504898
## 9 China 1999 cases 212258
## 10 China 1999 population 1272915272
## 11 China 2000 cases 213766
## 12 China 2000 population 1280428583
table2 %>% spread(type, count)
## # A tibble: 6 x 4
## country year cases population
## <chr> <int> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
gather()
Looking at table4a in the chunk below, it’s not tidy because the columns 1999 and 2000 are not variable descriptions, they’re data (the year the measurement was made in). We need the column names to be values in a column. To do that we’ll use gather.
table4a
## # A tibble: 3 x 3
## country `1999` `2000`
## * <chr> <int> <int>
## 1 Afghanistan 745 2666
## 2 Brazil 37737 80488
## 3 China 212258 213766
# you can gather by specifying column names
table4a %>% gather(year, cases, `1999`:`2000`)
## # A tibble: 6 x 3
## country year cases
## <chr> <chr> <int>
## 1 Afghanistan 1999 745
## 2 Brazil 1999 37737
## 3 China 1999 212258
## 4 Afghanistan 2000 2666
## 5 Brazil 2000 80488
## 6 China 2000 213766
# also with column numbers
table4a %>% gather(year, cases, 2:3)
## # A tibble: 6 x 3
## country year cases
## <chr> <chr> <int>
## 1 Afghanistan 1999 745
## 2 Brazil 1999 37737
## 3 China 1999 212258
## 4 Afghanistan 2000 2666
## 5 Brazil 2000 80488
## 6 China 2000 213766
separate()
In the chunk below, table3 is not tidy because it has multiple values separate by a backslash in the rate column. To split one column into two or more columns, use separate()
table3
## # A tibble: 6 x 3
## country year rate
## * <chr> <int> <chr>
## 1 Afghanistan 1999 745/19987071
## 2 Afghanistan 2000 2666/20595360
## 3 Brazil 1999 37737/172006362
## 4 Brazil 2000 80488/174504898
## 5 China 1999 212258/1272915272
## 6 China 2000 213766/1280428583
# separate example
table3 %>% separate(rate, into = c('cases', 'population'), sep = '/')
## # A tibble: 6 x 4
## country year cases population
## * <chr> <int> <chr> <chr>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
separate()
also lets you specify an interger position to split at
# just so you know, you can also separate by specifying a position to separate after
table3 %>% separate(rate, into = c('cases', 'population'), sep = 3)
## # A tibble: 6 x 4
## country year cases population
## * <chr> <int> <chr> <chr>
## 1 Afghanistan 1999 745 /19987071
## 2 Afghanistan 2000 266 6/20595360
## 3 Brazil 1999 377 37/172006362
## 4 Brazil 2000 804 88/174504898
## 5 China 1999 212 258/1272915272
## 6 China 2000 213 766/1280428583
unite()
In the chunk below, table5 is not tidy because the date is split between two columns, century and year. To combine columns, use the opposite of separate()
, unite()
table5
## # A tibble: 6 x 4
## country century year rate
## * <chr> <chr> <chr> <chr>
## 1 Afghanistan 19 99 745/19987071
## 2 Afghanistan 20 00 2666/20595360
## 3 Brazil 19 99 37737/172006362
## 4 Brazil 20 00 80488/174504898
## 5 China 19 99 212258/1272915272
## 6 China 20 00 213766/1280428583
table5 %>% unite(year, c('century', 'year'), sep = '')
## # A tibble: 6 x 3
## country year rate
## <chr> <chr> <chr>
## 1 Afghanistan 1999 745/19987071
## 2 Afghanistan 2000 2666/20595360
## 3 Brazil 1999 37737/172006362
## 4 Brazil 2000 80488/174504898
## 5 China 1999 212258/1272915272
## 6 China 2000 213766/1280428583
You may have noticed that the rate column in table5 is the same as the rate column in table3 and contains two values. Chain together some of the tidyr functions above to completely tidy table5.
table5 %>%
unite(year, c('century', 'year'), sep = '') %>%
separate(rate, into = c('cases', 'population'), sep = '/')
## # A tibble: 6 x 4
## country year cases population
## <chr> <chr> <chr> <chr>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
Binds and joins are in the dplyr package, but fit in better with data wrangling, so they were saved for this week.
Attach tables together. In order to bind tables, the tables being bound together have to have the same number of rows OR the same number of columns (depending if you’re binding by rows or columns).
rbind()
/ bind_rows()
Attach tables together by adding additional rows. When you bind by rows, the second table is added to the bottom of the first table.
table4a
## # A tibble: 3 x 3
## country `1999` `2000`
## * <chr> <int> <int>
## 1 Afghanistan 745 2666
## 2 Brazil 37737 80488
## 3 China 212258 213766
table4b
## # A tibble: 3 x 3
## country `1999` `2000`
## * <chr> <int> <int>
## 1 Afghanistan 19987071 20595360
## 2 Brazil 172006362 174504898
## 3 China 1272915272 1280428583
# base R
rbind(table4a, table4b)
## # A tibble: 6 x 3
## country `1999` `2000`
## <chr> <int> <int>
## 1 Afghanistan 745 2666
## 2 Brazil 37737 80488
## 3 China 212258 213766
## 4 Afghanistan 19987071 20595360
## 5 Brazil 172006362 174504898
## 6 China 1272915272 1280428583
# tidyverse, dplyr
bind_rows(table4a, table4b)
## # A tibble: 6 x 3
## country `1999` `2000`
## <chr> <int> <int>
## 1 Afghanistan 745 2666
## 2 Brazil 37737 80488
## 3 China 212258 213766
## 4 Afghanistan 19987071 20595360
## 5 Brazil 172006362 174504898
## 6 China 1272915272 1280428583
The difference between rbind()
and bind_rows()
is that if you have a different number of columns, rbind()
will refuse to work and return an error, while bind_rows()
will keep all columns from all tables and add NAs where there are no values.
cbind()
/ bind_cols()
Attach tables by adding additional columns.
table4a
## # A tibble: 3 x 3
## country `1999` `2000`
## * <chr> <int> <int>
## 1 Afghanistan 745 2666
## 2 Brazil 37737 80488
## 3 China 212258 213766
table4b
## # A tibble: 3 x 3
## country `1999` `2000`
## * <chr> <int> <int>
## 1 Afghanistan 19987071 20595360
## 2 Brazil 172006362 174504898
## 3 China 1272915272 1280428583
# base R
cbind(table4a, table4b)
## country 1999 2000 country 1999 2000
## 1 Afghanistan 745 2666 Afghanistan 19987071 20595360
## 2 Brazil 37737 80488 Brazil 172006362 174504898
## 3 China 212258 213766 China 1272915272 1280428583
# tidyverse, dplyr
bind_cols(table4a, table4b)
## # A tibble: 3 x 6
## country `1999` `2000` country1 `19991` `20001`
## <chr> <int> <int> <chr> <int> <int>
## 1 Afghanistan 745 2666 Afghanistan 19987071 20595360
## 2 Brazil 37737 80488 Brazil 172006362 174504898
## 3 China 212258 213766 China 1272915272 1280428583
Joins combine two tables together by common variables. The difference between binds and joins is that binds are intended to combine tables together that have exactly the same information, like for different samples, while joins are intended to combine together tables with complimentary but different information by common variables, like samples with their metadata. The *_join()
functions combine either by the columns you specify or by auto-detecting columns with the same name. Mutating joins add additional columns, while filtering joins filter a table by another table.
We’ll use the band_* tables from the dplyr package to practice joins. As you can see when you run the tables in the chunk below, the tables have information in common, musician names, but complimentary information, what band they belong to in one table and what instruments they play in the others.
band_members
## # A tibble: 3 x 2
## name band
## <chr> <chr>
## 1 Mick Stones
## 2 John Beatles
## 3 Paul Beatles
band_instruments
## # A tibble: 3 x 2
## name plays
## <chr> <chr>
## 1 John guitar
## 2 Paul bass
## 3 Keith guitar
band_instruments2
## # A tibble: 3 x 2
## artist plays
## <chr> <chr>
## 1 John guitar
## 2 Paul bass
## 3 Keith guitar
The descriptions below refer to x and y in the join. x is the first table listed in the join and y is the second. The order tables are given in joins is important!
Mutating joins add columns onto a table from another table.
inner_join()
Joins tables by returning all rows where x and y equal each other (in the join columns) and all columns from both x and y.
band_members
## # A tibble: 3 x 2
## name band
## <chr> <chr>
## 1 Mick Stones
## 2 John Beatles
## 3 Paul Beatles
band_instruments
## # A tibble: 3 x 2
## name plays
## <chr> <chr>
## 1 John guitar
## 2 Paul bass
## 3 Keith guitar
inner_join(band_members, band_instruments, by = c('name'))
## # A tibble: 2 x 3
## name band plays
## <chr> <chr> <chr>
## 1 John Beatles guitar
## 2 Paul Beatles bass
full_join()
The opposite of inner_join()
, full_join()
joins tables by returning all rows all columns from both x and y. If there is no matching value between x and y, an NA will be returned.
band_members
## # A tibble: 3 x 2
## name band
## <chr> <chr>
## 1 Mick Stones
## 2 John Beatles
## 3 Paul Beatles
band_instruments
## # A tibble: 3 x 2
## name plays
## <chr> <chr>
## 1 John guitar
## 2 Paul bass
## 3 Keith guitar
full_join(band_members, band_instruments, by = c('name'))
## # A tibble: 4 x 3
## name band plays
## <chr> <chr> <chr>
## 1 Mick Stones <NA>
## 2 John Beatles guitar
## 3 Paul Beatles bass
## 4 Keith <NA> guitar
left_join()
Joins tables by return all rows from x and all columns from x and y. If there is no matching value in y for x, an NA will be returned.
band_members
## # A tibble: 3 x 2
## name band
## <chr> <chr>
## 1 Mick Stones
## 2 John Beatles
## 3 Paul Beatles
band_instruments
## # A tibble: 3 x 2
## name plays
## <chr> <chr>
## 1 John guitar
## 2 Paul bass
## 3 Keith guitar
left_join(band_members, band_instruments, by = c('name'))
## # A tibble: 3 x 3
## name band plays
## <chr> <chr> <chr>
## 1 Mick Stones <NA>
## 2 John Beatles guitar
## 3 Paul Beatles bass
right_join()
Joins tables by the opposite of left_join()
, returning all rows from y and all columns from x and y. If there is no matching value in x for y, an NA will be returned.
band_members
## # A tibble: 3 x 2
## name band
## <chr> <chr>
## 1 Mick Stones
## 2 John Beatles
## 3 Paul Beatles
band_instruments
## # A tibble: 3 x 2
## name plays
## <chr> <chr>
## 1 John guitar
## 2 Paul bass
## 3 Keith guitar
right_join(band_members, band_instruments, by = c('name'))
## # A tibble: 3 x 3
## name band plays
## <chr> <chr> <chr>
## 1 John Beatles guitar
## 2 Paul Beatles bass
## 3 Keith <NA> guitar
Filtering joins filter a table by another table.
semi_join()
Filters x by returning all rows from x that match values in y.
band_members
## # A tibble: 3 x 2
## name band
## <chr> <chr>
## 1 Mick Stones
## 2 John Beatles
## 3 Paul Beatles
band_instruments
## # A tibble: 3 x 2
## name plays
## <chr> <chr>
## 1 John guitar
## 2 Paul bass
## 3 Keith guitar
semi_join(band_members, band_instruments, by = c('name'))
## # A tibble: 2 x 2
## name band
## <chr> <chr>
## 1 John Beatles
## 2 Paul Beatles
anti_join()
The opposite of semi_join()
, anti_join()
returns rows from x that DON’T match a value in y
band_members
## # A tibble: 3 x 2
## name band
## <chr> <chr>
## 1 Mick Stones
## 2 John Beatles
## 3 Paul Beatles
band_instruments
## # A tibble: 3 x 2
## name plays
## <chr> <chr>
## 1 John guitar
## 2 Paul bass
## 3 Keith guitar
anti_join(band_members, band_instruments, by = c('name'))
## # A tibble: 1 x 2
## name band
## <chr> <chr>
## 1 Mick Stones
Name and artist are the same variable, who the person is, in the tables below.
band_members
## # A tibble: 3 x 2
## name band
## <chr> <chr>
## 1 Mick Stones
## 2 John Beatles
## 3 Paul Beatles
band_instruments2
## # A tibble: 3 x 2
## artist plays
## <chr> <chr>
## 1 John guitar
## 2 Paul bass
## 3 Keith guitar
To specify the join by two differently named columns, specify that they equal each other. You MUST list the column names in the order the tables are given to the join. ‘name’ comes first below, because that’s the column you want to join by in the first table.
full_join(band_members, band_instruments2, by = c('name' = 'artist'))
## # A tibble: 4 x 3
## name band plays
## <chr> <chr> <chr>
## 1 Mick Stones <NA>
## 2 John Beatles guitar
## 3 Paul Beatles bass
## 4 Keith <NA> guitar
If there’s no value in a table, or R can’t figure out what’s supposed to be there, it will give you an NA
as seen in the Star Wars table below
starwars
## # A tibble: 87 x 13
## name height mass hair_color skin_color eye_color birth_year gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 Luke… 172 77 blond fair blue 19 male
## 2 C-3PO 167 75 <NA> gold yellow 112 <NA>
## 3 R2-D2 96 32 <NA> white, bl… red 33 <NA>
## 4 Dart… 202 136 none white yellow 41.9 male
## 5 Leia… 150 49 brown light brown 19 female
## 6 Owen… 178 120 brown, gr… light blue 52 male
## 7 Beru… 165 75 brown light blue 47 female
## 8 R5-D4 97 32 <NA> white, red red NA <NA>
## 9 Bigg… 183 84 black light brown 24 male
## 10 Obi-… 182 77 auburn, w… fair blue-gray 57 male
## # ... with 77 more rows, and 5 more variables: homeworld <chr>,
## # species <chr>, films <list>, vehicles <list>, starships <list>
NAs are bad not only because you’re missing data, but also because many functions in R either refuse to run or return weird answers if there are NAs in the data.
mean(starwars$birth_year)
## [1] NA
na.omit()
One strategy is just to drop all rows that contain NAs
starwars
## # A tibble: 87 x 13
## name height mass hair_color skin_color eye_color birth_year gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 Luke… 172 77 blond fair blue 19 male
## 2 C-3PO 167 75 <NA> gold yellow 112 <NA>
## 3 R2-D2 96 32 <NA> white, bl… red 33 <NA>
## 4 Dart… 202 136 none white yellow 41.9 male
## 5 Leia… 150 49 brown light brown 19 female
## 6 Owen… 178 120 brown, gr… light blue 52 male
## 7 Beru… 165 75 brown light blue 47 female
## 8 R5-D4 97 32 <NA> white, red red NA <NA>
## 9 Bigg… 183 84 black light brown 24 male
## 10 Obi-… 182 77 auburn, w… fair blue-gray 57 male
## # ... with 77 more rows, and 5 more variables: homeworld <chr>,
## # species <chr>, films <list>, vehicles <list>, starships <list>
starwars %>% na.omit()
## # A tibble: 29 x 13
## name height mass hair_color skin_color eye_color birth_year gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 Luke… 172 77 blond fair blue 19 male
## 2 Dart… 202 136 none white yellow 41.9 male
## 3 Leia… 150 49 brown light brown 19 female
## 4 Owen… 178 120 brown, gr… light blue 52 male
## 5 Beru… 165 75 brown light blue 47 female
## 6 Bigg… 183 84 black light brown 24 male
## 7 Obi-… 182 77 auburn, w… fair blue-gray 57 male
## 8 Anak… 188 84 blond fair blue 41.9 male
## 9 Chew… 228 112 brown unknown blue 200 male
## 10 Han … 180 80 brown fair brown 29 male
## # ... with 19 more rows, and 5 more variables: homeworld <chr>,
## # species <chr>, films <list>, vehicles <list>, starships <list>
replace_na()
Sometimes you might want to replace NAs with a value though, which you can do using tidyr
starwars
## # A tibble: 87 x 13
## name height mass hair_color skin_color eye_color birth_year gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 Luke… 172 77 blond fair blue 19 male
## 2 C-3PO 167 75 <NA> gold yellow 112 <NA>
## 3 R2-D2 96 32 <NA> white, bl… red 33 <NA>
## 4 Dart… 202 136 none white yellow 41.9 male
## 5 Leia… 150 49 brown light brown 19 female
## 6 Owen… 178 120 brown, gr… light blue 52 male
## 7 Beru… 165 75 brown light blue 47 female
## 8 R5-D4 97 32 <NA> white, red red NA <NA>
## 9 Bigg… 183 84 black light brown 24 male
## 10 Obi-… 182 77 auburn, w… fair blue-gray 57 male
## # ... with 77 more rows, and 5 more variables: homeworld <chr>,
## # species <chr>, films <list>, vehicles <list>, starships <list>
starwars %>% replace_na(list(hair_color = 'none', birth_year = 'unknown', gender = 'n/a'))
## # A tibble: 87 x 13
## name height mass hair_color skin_color eye_color birth_year gender
## <chr> <int> <dbl> <chr> <chr> <chr> <chr> <chr>
## 1 Luke… 172 77 blond fair blue 19 male
## 2 C-3PO 167 75 none gold yellow 112 n/a
## 3 R2-D2 96 32 none white, bl… red 33 n/a
## 4 Dart… 202 136 none white yellow 41.9 male
## 5 Leia… 150 49 brown light brown 19 female
## 6 Owen… 178 120 brown, gr… light blue 52 male
## 7 Beru… 165 75 brown light blue 47 female
## 8 R5-D4 97 32 none white, red red unknown n/a
## 9 Bigg… 183 84 black light brown 24 male
## 10 Obi-… 182 77 auburn, w… fair blue-gray 57 male
## # ... with 77 more rows, and 5 more variables: homeworld <chr>,
## # species <chr>, films <list>, vehicles <list>, starships <list>
# what the tidy table should look like
table1
## # A tibble: 6 x 4
## country year cases population
## <chr> <int> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
# the tables we need to tidy into one table
table4a
## # A tibble: 3 x 3
## country `1999` `2000`
## * <chr> <int> <int>
## 1 Afghanistan 745 2666
## 2 Brazil 37737 80488
## 3 China 212258 213766
table4b
## # A tibble: 3 x 3
## country `1999` `2000`
## * <chr> <int> <int>
## 1 Afghanistan 19987071 20595360
## 2 Brazil 172006362 174504898
## 3 China 1272915272 1280428583
# tidy individual tables
table4a %>% gather(year, cases, `1999`:`2000`) -> table4a_tidy
table4b %>% gather(year, population, `1999`:`2000`) -> table4b_tidy
# combine
left_join(table4a_tidy, table4b_tidy, by = c('country', 'year'))
## # A tibble: 6 x 4
## country year cases population
## <chr> <chr> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Brazil 1999 37737 172006362
## 3 China 1999 212258 1272915272
## 4 Afghanistan 2000 2666 20595360
## 5 Brazil 2000 80488 174504898
## 6 China 2000 213766 1280428583