Pick a dataset to download from the website https://fels-bioinformatics.github.io/fels_bioinformatics_meetup/ to work with (or if there’s another data set you’d like to use go for it).
Briefly, in 1898, Hermon Bumpus, an American biologist working at Brown University, collected data on one of the first examples of natural selection directly observed in nature. Immediately following a bad winter storm, he collected 136 English house sparrows, Passer domesticus, and brought them indoors. Of these birds, 64 had died during the storm, but 72 recovered and survived. By comparing measurements of physical traits, Bumpus claimed to detect substantial physical differences between the dead and living birds. The tidy sparrows dataset contains the following columns:
The wine dataset contains the results of a chemical analysis of wines grown in a specific area of Italy. Three types of wine are represented in the 178 samples, with the results of 13 chemical analyses recorded for each sample. The Type variable has been transformed into a categoric variable. The tidy wine dataset contains the following columns:
This dataset is from a field experiment studying the diversity of Chinese Rowan, or Mountain Ash, trees from the genus Sorbus. Researchers randomly sampled and recorded characteristics of leaves from three different Rowan species, and they further noted whether birds were actively nesting in each tree (recorded as y/n for yes/no). Altitude is recorded in meters (m), respiration rate (resp.rate) is recorded in per unit leaf mass, and leaf length (leaf.len) is recorded in centimeters (cm). The tidy rowan dataset contains the following columns:
Note: This is the answer key, so I’ve provided example workflows for all three of the datasets above, but you should only have done one.
Read in your dataset of choice (either from the list above or your own dataset) in the chunk below!
# sparrows.csv
sparrows <- read_csv('practice_files/sparrows2.csv')
## Parsed with column specification:
## cols(
## Sex = col_character(),
## Age = col_character(),
## Survival = col_character(),
## Length = col_integer(),
## Wingspread = col_integer(),
## Weight = col_double(),
## skull_width_length = col_character(),
## Humerus_Length = col_double(),
## Femur_Length = col_double(),
## Tarsus_Length = col_double(),
## Sternum_Length = col_double()
## )
As ever, first thing is to look at your data. Use the chunk below.
sparrows
## # A tibble: 136 x 11
## Sex Age Survival Length Wingspread Weight skull_width_len…
## <chr> <chr> <chr> <int> <int> <dbl> <chr>
## 1 Male Adult Alive 154 241 24.5 14.9;31.2
## 2 Male Adult Alive 160 252 26.9 15.3;30.8
## 3 Male Adult Alive 155 243 26.9 15.3;30.6
## 4 Male Adult Alive 154 245 24.3 14.8;31.7
## 5 Male Adult Alive 156 247 24.1 14.6;31.5
## 6 Male Adult Alive 161 253 26.5 15.4;31.8
## 7 Male Adult Alive 157 251 24.6 15.5;31.1
## 8 Male Adult Alive 159 247 24.2 15.5;31.4
## 9 Male Adult Alive 158 247 23.6 15.3;29.8
## 10 Male Adult Alive 158 252 26.2 15.6;32
## # ... with 126 more rows, and 4 more variables: Humerus_Length <dbl>,
## # Femur_Length <dbl>, Tarsus_Length <dbl>, Sternum_Length <dbl>
Do you see any odd features that need to be tidied before continuing? If yes, tidy the table in the chunk below. Don’t forget to save your tidied table to another variable/object before continuing.
sparrows %>% separate(skull_width_length, into = c('Skull_Width', 'Skull_Length'), sep = ';') -> sparrows_tidy
Looking at your dataset, what questions come to mind? For example, in everyone’s favorite dataset iris, you might ask if petal width is different between the three iris species. Look at your dataset and come up with a question and write it down below.
Write your question here: Does the age of sparrows affect their survival?
Think about your question. How can you visually represent the relevant data columns? Plot your data in the chunk below.
ggplot(sparrows_tidy, aes(x = Age, fill = Survival)) +
geom_bar(position = 'dodge') +
scale_fill_manual(values = c('dodgerblue3', 'gray60')) +
theme_classic()
Use the appropriate hypothesis test (ex: t.test()
or chisq.test()
) to test your question.
sparrows_tidy %>%
group_by(Age, Survival) %>%
count() %>%
ungroup() %>%
spread(Age, n) %>%
column_to_rownames('Survival') %>%
as.matrix() %>%
chisq.test(.) %>%
tidy()
## Warning: Setting row names on a tibble is deprecated.
## # A tibble: 1 x 4
## statistic p.value parameter method
## <dbl> <dbl> <int> <chr>
## 1 1.12 0.290 1 Pearson's Chi-squared test with Yates' cont…
Ask another question about your data!
Write your question here: Does the weight of sparrows affect their survival?
Make another figure in the chunk below visualizing the variables you asked your second question about.
ggplot(sparrows_tidy, aes(x = Weight, fill = Survival)) +
geom_density(alpha = 0.5) +
scale_fill_manual(values = c('darkorchid4', 'gray60')) +
theme_classic()
And again, use an appropriate hypothesis test to test your idea.
t.test(Weight ~ Survival, data = sparrows_tidy) %>% tidy()
## # A tibble: 1 x 10
## estimate estimate1 estimate2 statistic p.value parameter conf.low
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 -0.648 25.2 25.9 -2.57 0.0114 118. -1.15
## # ... with 3 more variables: conf.high <dbl>, method <chr>,
## # alternative <chr>
Write a few sentences here that explain what you tested and why.
Erase me and put sentences here
Read in your dataset of choice (either from the list above or your own dataset) in the chunk below!
# wine.tsv
wine <- read_tsv('practice_files/wine2.tsv')
## Parsed with column specification:
## cols(
## Cultivar = col_integer(),
## Alcohol = col_double(),
## MalicAcid = col_double(),
## Ash = col_double(),
## Magnesium = col_integer(),
## Color = col_double(),
## `phenol/flav` = col_character(),
## value = col_double()
## )
As ever, first thing is to look at your data. Use the chunk below.
wine
## # A tibble: 534 x 8
## Cultivar Alcohol MalicAcid Ash Magnesium Color `phenol/flav` value
## <int> <dbl> <dbl> <dbl> <int> <dbl> <chr> <dbl>
## 1 1 14.2 1.71 2.43 127 5.64 TotalPhenol 2.8
## 2 1 13.2 1.78 2.14 100 4.38 TotalPhenol 2.65
## 3 1 13.2 2.36 2.67 101 5.68 TotalPhenol 2.8
## 4 1 14.4 1.95 2.5 113 7.8 TotalPhenol 3.85
## 5 1 13.2 2.59 2.87 118 4.32 TotalPhenol 2.8
## 6 1 14.2 1.76 2.45 112 6.75 TotalPhenol 3.27
## 7 1 14.4 1.87 2.45 96 5.25 TotalPhenol 2.5
## 8 1 14.1 2.15 2.61 121 5.05 TotalPhenol 2.6
## 9 1 14.8 1.64 2.17 97 5.2 TotalPhenol 2.8
## 10 1 13.9 1.35 2.27 98 7.22 TotalPhenol 2.98
## # ... with 524 more rows
Do you see any odd features that need to be tidied before continuing? If yes, tidy the table in the chunk below. Don’t forget to save your tidied table to another variable/object before continuing.
wine %>% spread(`phenol/flav`, value) %>% mutate(Cultivar = as.factor(Cultivar)) -> wine_tidy
Looking at your dataset, what questions come to mind? For example, in everyone’s favorite dataset iris, you might ask if petal width is different between the three iris species. Look at your dataset and come up with a question and write it down below.
Write your question here: Does the amount of alcohol in the wine differ between cultivars?
Think about your question. How can you visually represent the relevant data columns? Plot your data in the chunk below.
ggplot(wine_tidy, aes(x = Cultivar, y = Alcohol, fill = Cultivar)) +
scale_fill_viridis(discrete = TRUE, option = 'cividis') +
geom_boxplot()
Use the appropriate hypothesis test (ex: t.test()
or chisq.test()
) to test your question.
pairwise.t.test(wine_tidy$Alcohol, wine_tidy$Cultivar) %>% tidy()
## # A tibble: 3 x 3
## group1 group2 p.value
## * <chr> <chr> <dbl>
## 1 2 1 2.47e-36
## 2 3 1 1.51e- 8
## 3 3 2 2.96e-16
Ask another question about your data!
Write your question here: Is the color of the wine different between different cultivars?
Make another figure in the chunk below visualizing the variables you asked your second question about.
ggplot(wine_tidy, aes(x = Cultivar, y = Color, fill = Cultivar)) +
geom_boxplot() +
geom_jitter(width = 0.2) +
theme_classic()
And again, use an appropriate hypothesis test to test your idea.
pairwise.t.test(wine_tidy$Color, wine_tidy$Cultivar, p.adj = 'fdr') %>% tidy()
## # A tibble: 3 x 3
## group1 group2 p.value
## * <chr> <chr> <dbl>
## 1 2 1 1.93e-16
## 2 3 1 1.73e- 9
## 3 3 2 1.70e-33
Write a few sentences here that explain what you tested and why.
Read in your dataset of choice (either from the list above or your own dataset) in the chunk below!
# rowan.csv
rowan <- read_csv('practice_files/rowan2.csv')
## Parsed with column specification:
## cols(
## `altitude resp.rate nesting microphylla oligodonta sargentiana` = col_character()
## )
As ever, first thing is to look at your data. Use the chunk below.
rowan
## # A tibble: 300 x 1
## `altitude\tresp.rate\tnesting\tmicrophylla\toligodonta\tsargentiana`
## <chr>
## 1 "90\t0.041\ty\tNA\tNA\t28.6"
## 2 "93\t0.116\ty\t8.8\tNA\tNA"
## 3 "152\t0.105\ty\tNA\tNA\t30.2"
## 4 "167\t0.074\tn\tNA\t11\tNA"
## 5 "184\t0.181\tn\tNA\tNA\t21.7"
## 6 "193\t0.043\tn\t7.3\tNA\tNA"
## 7 "199\t0.068\tn\tNA\tNA\t39.3"
## 8 "208\t0.062\tn\tNA\t17.8\tNA"
## 9 "218\t0.048\tn\t8.4\tNA\tNA"
## 10 "224\t0.247\tn\tNA\t10.5\tNA"
## # ... with 290 more rows
Do you see any odd features that need to be tidied before continuing? If yes, tidy the table in the chunk below. Don’t forget to save your tidied table to another variable/object before continuing.
rowan %>% separate(`altitude\tresp.rate\tnesting\tmicrophylla\toligodonta\tsargentiana`,
into = c('altitude', 'resp.rate', 'nesting', 'microphylla', 'oligodonta', 'sargentiana'),
sep = '\t', convert = TRUE) %>%
gather(species, leaf.len, microphylla:sargentiana) %>%
filter(leaf.len != 'NA') -> rowan_tidy
Looking at your dataset, what questions come to mind? For example, in everyone’s favorite dataset iris, you might ask if petal width is different between the three iris species. Look at your dataset and come up with a question and write it down below.
Write your question here: Do different species live at different altitudes?
Think about your question. How can you visually represent the relevant data columns? Plot your data in the chunk below.
ggplot(rowan_tidy, aes(x = species, y = altitude, fill = species)) +
geom_violin(alpha = 0.8) +
scale_fill_manual(values = c('darkorange1', 'deepskyblue', 'firebrick'))
Use the appropriate hypothesis test (ex: t.test()
or chisq.test()
) to test your question.
pairwise.t.test(rowan_tidy$altitude, rowan_tidy$species) %>% tidy()
## # A tibble: 3 x 3
## group1 group2 p.value
## * <chr> <chr> <dbl>
## 1 oligodonta microphylla 0.176
## 2 sargentiana microphylla 0.483
## 3 sargentiana oligodonta 0.0492
Ask another question about your data!
Write your question here: Does nesting happen more often in one species vs another?
Make another figure in the chunk below visualizing the variables you asked your second question about.
ggplot(rowan_tidy, aes(x = species, fill = nesting)) +
geom_bar(position = 'dodge') +
scale_fill_manual(values = c('firebrick4', 'turquoise4')) +
theme_classic()
And again, use an appropriate hypothesis test to test your idea.
rowan_tidy %>%
group_by(nesting, species) %>%
count() %>%
ungroup() %>%
spread(species, n) %>%
column_to_rownames('nesting') %>%
chisq.test(.) %>%
tidy()
## Warning: Setting row names on a tibble is deprecated.
## # A tibble: 1 x 4
## statistic p.value parameter method
## <dbl> <dbl> <int> <chr>
## 1 89.1 4.42e-20 2 Pearson's Chi-squared test
Write a few sentences here that explain what you tested and why.