Week 7 Practice: Combine all your skills

Pick a dataset

Pick a dataset to download from the website https://fels-bioinformatics.github.io/fels_bioinformatics_meetup/ to work with (or if there’s another data set you’d like to use go for it).

Provided datasets

sparrows.csv

Briefly, in 1898, Hermon Bumpus, an American biologist working at Brown University, collected data on one of the first examples of natural selection directly observed in nature. Immediately following a bad winter storm, he collected 136 English house sparrows, Passer domesticus, and brought them indoors. Of these birds, 64 had died during the storm, but 72 recovered and survived. By comparing measurements of physical traits, Bumpus claimed to detect substantial physical differences between the dead and living birds. The tidy sparrows dataset contains the following columns:

Sex = sex of the bird
Age = whether the bird was adult or young
Survival = whether the bird survived
Length = body length of the bird (cm)
Wingspread = length of the bird’s wings from wingtip to wingtip (cm)
Weight = weight of the bird (g)
Skull_Length = length of the bird’s skull (cm)
Humerus_Length = length of the bird’s long arm bone (cm)
Femur_Length = length of the bird’s long leg bone (cm)
Tarsus_Length = length of the bird’s ankle bones (cm)
Sternum_Length = length of the bird’s breastbone (cm)
Skull_Width = width of the bird’s skull (cm)

wine.tsv

The wine dataset contains the results of a chemical analysis of wines grown in a specific area of Italy. Three types of wine are represented in the 178 samples, with the results of 13 chemical analyses recorded for each sample. The Type variable has been transformed into a categoric variable. The tidy wine dataset contains the following columns:

Cultivar = the number factor indicating the grape cultivar the wine was made from
Alcohol = the alcohol concentration in the wine sample (g/L)
MalicAcid = the malic acid concentration in the wine sample (g/L)
Ash = the ash concentration in the wine sample (g/L)
Magnesium = the magnesium concentration in the wine sample (g/L)
TotalPhenol = the total amount of all phenol compounds in the wine sample (g/L)
Flavanoids = the concentration of all flavanoids in the wine sample (g/L)
NonflavPhenols = the concentration of all non-flavanoid phenols in the wine sample (g/L)
Color = wine color (spectrophotometric measure?)

rowan.csv

This dataset is from a field experiment studying the diversity of Chinese Rowan, or Mountain Ash, trees from the genus Sorbus. Researchers randomly sampled and recorded characteristics of leaves from three different Rowan species, and they further noted whether birds were actively nesting in each tree (recorded as y/n for yes/no). Altitude is recorded in meters (m), respiration rate (resp.rate) is recorded in per unit leaf mass, and leaf length (leaf.len) is recorded in centimeters (cm). The tidy rowan dataset contains the following columns:

altitude = the alitude the rowan was found at (m)
resp.rate = the rowan’s respiration rate (nmol/s)
species = the rowan species
leaf.len = the rowan’s leaf length (cm)
neating = logial, was there a bird nesting in the rowan, yes or no

Note: This is the answer key, so I’ve provided example workflows for all three of the datasets above, but you should only have done one.

sparrows example

Wrangle

Import

Read in your dataset of choice (either from the list above or your own dataset) in the chunk below!

# sparrows.csv
sparrows <- read_csv('practice_files/sparrows2.csv')

## Parsed with column specification:
## cols(
##   Sex = col_character(),
##   Age = col_character(),
##   Survival = col_character(),
##   Length = col_integer(),
##   Wingspread = col_integer(),
##   Weight = col_double(),
##   skull_width_length = col_character(),
##   Humerus_Length = col_double(),
##   Femur_Length = col_double(),
##   Tarsus_Length = col_double(),
##   Sternum_Length = col_double()
## )

Tidy

As ever, first thing is to look at your data. Use the chunk below.

sparrows

## # A tibble: 136 x 11
##    Sex   Age   Survival Length Wingspread Weight skull_width_len…
##    <chr> <chr> <chr>     <int>      <int>  <dbl> <chr>           
##  1 Male  Adult Alive       154        241   24.5 14.9;31.2       
##  2 Male  Adult Alive       160        252   26.9 15.3;30.8       
##  3 Male  Adult Alive       155        243   26.9 15.3;30.6       
##  4 Male  Adult Alive       154        245   24.3 14.8;31.7       
##  5 Male  Adult Alive       156        247   24.1 14.6;31.5       
##  6 Male  Adult Alive       161        253   26.5 15.4;31.8       
##  7 Male  Adult Alive       157        251   24.6 15.5;31.1       
##  8 Male  Adult Alive       159        247   24.2 15.5;31.4       
##  9 Male  Adult Alive       158        247   23.6 15.3;29.8       
## 10 Male  Adult Alive       158        252   26.2 15.6;32         
## # ... with 126 more rows, and 4 more variables: Humerus_Length <dbl>,
## #   Femur_Length <dbl>, Tarsus_Length <dbl>, Sternum_Length <dbl>

Do you see any odd features that need to be tidied before continuing? If yes, tidy the table in the chunk below. Don’t forget to save your tidied table to another variable/object before continuing.

sparrows %>% separate(skull_width_length, into = c('Skull_Width', 'Skull_Length'), sep = ';') -> sparrows_tidy

Understand the Data

Ask a Question

Looking at your dataset, what questions come to mind? For example, in everyone’s favorite dataset iris, you might ask if petal width is different between the three iris species. Look at your dataset and come up with a question and write it down below.

Write your question here: Does the age of sparrows affect their survival?

Visualize

Think about your question. How can you visually represent the relevant data columns? Plot your data in the chunk below.

ggplot(sparrows_tidy, aes(x = Age, fill = Survival)) + 
  geom_bar(position = 'dodge') + 
  scale_fill_manual(values = c('dodgerblue3', 'gray60')) + 
  theme_classic()

Test

Use the appropriate hypothesis test (ex: t.test() or chisq.test()) to test your question.

sparrows_tidy %>% 
  group_by(Age, Survival) %>% 
  count() %>% 
  ungroup() %>% 
  spread(Age, n) %>%
  column_to_rownames('Survival') %>%
  as.matrix() %>% 
  chisq.test(.) %>%
  tidy()

## Warning: Setting row names on a tibble is deprecated.

## # A tibble: 1 x 4
##   statistic p.value parameter method                                      
##       <dbl>   <dbl>     <int> <chr>                                       
## 1      1.12   0.290         1 Pearson's Chi-squared test with Yates' cont…

Ask a Question

Ask another question about your data!

Write your question here: Does the weight of sparrows affect their survival?

Visualize

Make another figure in the chunk below visualizing the variables you asked your second question about.

ggplot(sparrows_tidy, aes(x = Weight, fill = Survival)) + 
  geom_density(alpha = 0.5) + 
  scale_fill_manual(values = c('darkorchid4', 'gray60')) + 
  theme_classic()

Test

And again, use an appropriate hypothesis test to test your idea.

t.test(Weight ~ Survival, data = sparrows_tidy) %>% tidy()

## # A tibble: 1 x 10
##   estimate estimate1 estimate2 statistic p.value parameter conf.low
##      <dbl>     <dbl>     <dbl>     <dbl>   <dbl>     <dbl>    <dbl>
## 1   -0.648      25.2      25.9     -2.57  0.0114      118.    -1.15
## # ... with 3 more variables: conf.high <dbl>, method <chr>,
## #   alternative <chr>

Communicate

Write a few sentences here that explain what you tested and why.

Erase me and put sentences here

wine example

Wrangle

Import

Read in your dataset of choice (either from the list above or your own dataset) in the chunk below!

# wine.tsv
wine <- read_tsv('practice_files/wine2.tsv')

## Parsed with column specification:
## cols(
##   Cultivar = col_integer(),
##   Alcohol = col_double(),
##   MalicAcid = col_double(),
##   Ash = col_double(),
##   Magnesium = col_integer(),
##   Color = col_double(),
##   `phenol/flav` = col_character(),
##   value = col_double()
## )

Tidy

As ever, first thing is to look at your data. Use the chunk below.

wine

## # A tibble: 534 x 8
##    Cultivar Alcohol MalicAcid   Ash Magnesium Color `phenol/flav` value
##       <int>   <dbl>     <dbl> <dbl>     <int> <dbl> <chr>         <dbl>
##  1        1    14.2      1.71  2.43       127  5.64 TotalPhenol    2.8 
##  2        1    13.2      1.78  2.14       100  4.38 TotalPhenol    2.65
##  3        1    13.2      2.36  2.67       101  5.68 TotalPhenol    2.8 
##  4        1    14.4      1.95  2.5        113  7.8  TotalPhenol    3.85
##  5        1    13.2      2.59  2.87       118  4.32 TotalPhenol    2.8 
##  6        1    14.2      1.76  2.45       112  6.75 TotalPhenol    3.27
##  7        1    14.4      1.87  2.45        96  5.25 TotalPhenol    2.5 
##  8        1    14.1      2.15  2.61       121  5.05 TotalPhenol    2.6 
##  9        1    14.8      1.64  2.17        97  5.2  TotalPhenol    2.8 
## 10        1    13.9      1.35  2.27        98  7.22 TotalPhenol    2.98
## # ... with 524 more rows

Do you see any odd features that need to be tidied before continuing? If yes, tidy the table in the chunk below. Don’t forget to save your tidied table to another variable/object before continuing.

wine %>% spread(`phenol/flav`, value) %>% mutate(Cultivar = as.factor(Cultivar)) -> wine_tidy

Understand the Data

Ask a Question

Write your question here: Does the amount of alcohol in the wine differ between cultivars?

Visualize

Think about your question. How can you visually represent the relevant data columns? Plot your data in the chunk below.

ggplot(wine_tidy, aes(x = Cultivar, y = Alcohol, fill = Cultivar)) + 
  scale_fill_viridis(discrete = TRUE, option = 'cividis') +
  geom_boxplot()

Test

Use the appropriate hypothesis test (ex: t.test() or chisq.test()) to test your question.

pairwise.t.test(wine_tidy$Alcohol, wine_tidy$Cultivar) %>% tidy()

## # A tibble: 3 x 3
##   group1 group2  p.value
## * <chr>  <chr>     <dbl>
## 1 2      1      2.47e-36
## 2 3      1      1.51e- 8
## 3 3      2      2.96e-16

Ask a Question

Ask another question about your data!

Write your question here: Is the color of the wine different between different cultivars?

Visualize

Make another figure in the chunk below visualizing the variables you asked your second question about.

ggplot(wine_tidy, aes(x = Cultivar, y = Color, fill = Cultivar)) + 
  geom_boxplot() +
  geom_jitter(width = 0.2) + 
  theme_classic()

Test

And again, use an appropriate hypothesis test to test your idea.

pairwise.t.test(wine_tidy$Color, wine_tidy$Cultivar, p.adj = 'fdr') %>% tidy()

## # A tibble: 3 x 3
##   group1 group2  p.value
## * <chr>  <chr>     <dbl>
## 1 2      1      1.93e-16
## 2 3      1      1.73e- 9
## 3 3      2      1.70e-33

Communicate

Write a few sentences here that explain what you tested and why.

rowan example

Wrangle

Import

Read in your dataset of choice (either from the list above or your own dataset) in the chunk below!

# rowan.csv
rowan <- read_csv('practice_files/rowan2.csv')

## Parsed with column specification:
## cols(
##   `altitude  resp.rate   nesting microphylla oligodonta  sargentiana` = col_character()
## )

Tidy

As ever, first thing is to look at your data. Use the chunk below.

rowan

## # A tibble: 300 x 1
##    `altitude\tresp.rate\tnesting\tmicrophylla\toligodonta\tsargentiana`
##    <chr>                                                               
##  1 "90\t0.041\ty\tNA\tNA\t28.6"                                        
##  2 "93\t0.116\ty\t8.8\tNA\tNA"                                         
##  3 "152\t0.105\ty\tNA\tNA\t30.2"                                       
##  4 "167\t0.074\tn\tNA\t11\tNA"                                         
##  5 "184\t0.181\tn\tNA\tNA\t21.7"                                       
##  6 "193\t0.043\tn\t7.3\tNA\tNA"                                        
##  7 "199\t0.068\tn\tNA\tNA\t39.3"                                       
##  8 "208\t0.062\tn\tNA\t17.8\tNA"                                       
##  9 "218\t0.048\tn\t8.4\tNA\tNA"                                        
## 10 "224\t0.247\tn\tNA\t10.5\tNA"                                       
## # ... with 290 more rows

Do you see any odd features that need to be tidied before continuing? If yes, tidy the table in the chunk below. Don’t forget to save your tidied table to another variable/object before continuing.

rowan %>% separate(`altitude\tresp.rate\tnesting\tmicrophylla\toligodonta\tsargentiana`, 
                   into = c('altitude', 'resp.rate', 'nesting', 'microphylla', 'oligodonta', 'sargentiana'),
                   sep = '\t', convert = TRUE) %>%
  gather(species, leaf.len, microphylla:sargentiana) %>%
  filter(leaf.len != 'NA') -> rowan_tidy

Understand the Data

Ask a Question

Write your question here: Do different species live at different altitudes?

Visualize

Think about your question. How can you visually represent the relevant data columns? Plot your data in the chunk below.

ggplot(rowan_tidy, aes(x = species, y = altitude, fill = species)) + 
  geom_violin(alpha = 0.8) +
  scale_fill_manual(values = c('darkorange1', 'deepskyblue', 'firebrick'))

Test

Use the appropriate hypothesis test (ex: t.test() or chisq.test()) to test your question.

pairwise.t.test(rowan_tidy$altitude, rowan_tidy$species) %>% tidy()

## # A tibble: 3 x 3
##   group1      group2      p.value
## * <chr>       <chr>         <dbl>
## 1 oligodonta  microphylla  0.176 
## 2 sargentiana microphylla  0.483 
## 3 sargentiana oligodonta   0.0492

Ask a Question

Ask another question about your data!

Write your question here: Does nesting happen more often in one species vs another?

Visualize

Make another figure in the chunk below visualizing the variables you asked your second question about.

ggplot(rowan_tidy, aes(x = species, fill = nesting)) + 
  geom_bar(position = 'dodge') +
  scale_fill_manual(values = c('firebrick4', 'turquoise4')) +
  theme_classic()

Test

And again, use an appropriate hypothesis test to test your idea.

rowan_tidy %>%
  group_by(nesting, species) %>%
  count() %>%
  ungroup() %>%
  spread(species, n) %>%
  column_to_rownames('nesting') %>%
  chisq.test(.) %>% 
  tidy()

## Warning: Setting row names on a tibble is deprecated.

## # A tibble: 1 x 4
##   statistic  p.value parameter method                    
##       <dbl>    <dbl>     <int> <chr>                     
## 1      89.1 4.42e-20         2 Pearson's Chi-squared test

Communicate

Write a few sentences here that explain what you tested and why.