Practice: Statistical Tests in R

Before doing anything else, install the broom package in the console below! Once you’ve done that load the library.

library(broom)

Practice Running the Code

Summary Statistics

Remember the flights dataset from the nycflights13 package? It has information for flights in 2013 in the New York City metropolitan area airports: Newark (EWR), LaGuardia (LGA), and JFK. In the chunk below, find the mean, median, and standard deviation of the departure delay in the flights table in a SINGLE line of code by airport.

flights %>% 
  group_by(origin) %>% 
  na.omit() %>%
  summarize(avg_delay = mean(dep_delay),
            median_delay = median(dep_delay),
            sd_delay = sd(dep_delay))

## # A tibble: 3 x 4
##   origin avg_delay median_delay sd_delay
##   <chr>      <dbl>        <dbl>    <dbl>
## 1 EWR         15.0           -1     41.2
## 2 JFK         12.0           -1     38.8
## 3 LGA         10.3           -3     39.9

Why is there such a large difference between the mean and the median? Plotting a histogram of the departure delay gives the answer the question. Can you explain it now?

ggplot(flights, aes(x = dep_delay)) + geom_histogram() + xlab('Departure Delay (mins)')

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 8255 rows containing non-finite values (stat_bin).

Write your answer here (so you dont forget later): The handful of very large delays skews the mean higher, aka to the right.

t test

Suppose that on Thanksgiving, the average departure delay at NYC airports is 45 mins. Is this unusual? Test it with a t test.

t.test(flights$dep_delay, mu = 45)

## 
##  One Sample t-test
## 
## data:  flights$dep_delay
## t = -461.28, df = 328520, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 45
## 95 percent confidence interval:
##  12.50157 12.77657
## sample estimates:
## mean of x 
##  12.63907

chi square test

The Titanic dataset is built into R and gives survival information for passengers on the Titanic. You’ll need to practice your data wrangling skills to get the table in the correct form and don’t forget to end by tidying the test with broom::tidy()! There’s a bit of code in the chunk below to get you started:

Titanic %>% 
  as_tibble() %>% 
  spread(Class, n) %>% 
  filter(Survived == 'Yes') %>% 
  select(-(Sex:Survived)) %>%
  chisq.test(.)

## Warning in chisq.test(.): Chi-squared approximation may be incorrect

## 
##  Pearson's Chi-squared test
## 
## data:  .
## X-squared = 290.84, df = 9, p-value < 2.2e-16

correct for multiple testing

The chickwts dataset is built into R and looks at how different feed typs affect baby chick’s weight. Use a pairwise t test to test the differences between feed types and don’t forget to correct for multiple testing.

pairwise.t.test(chickwts$weight, chickwts$feed, p.adj = 'fdr') %>% tidy()

## # A tibble: 15 x 3
##    group1    group2         p.value
##  * <chr>     <chr>            <dbl>
##  1 horsebean casein    0.0000000155
##  2 linseed   casein    0.0000448   
##  3 meatmeal  casein    0.0570      
##  4 soybean   casein    0.00125     
##  5 sunflower casein    0.812       
##  6 linseed   horsebean 0.0228      
##  7 meatmeal  horsebean 0.0000280   
##  8 soybean   horsebean 0.000696    
##  9 sunflower horsebean 0.0000000123
## 10 meatmeal  linseed   0.0225      
## 11 soybean   linseed   0.219       
## 12 sunflower linseed   0.0000280   
## 13 soybean   meatmeal  0.199       
## 14 sunflower meatmeal  0.0360      
## 15 sunflower soybean   0.000696

Pick the Test for the Data

In this section there’s a question to answer and you need to pick the appropriate test to answer it.

Question 1: In the flights dataset again, is the average departure delay the same as the average arrival delay?

flights %>% 
  na.omit() %>%
  summarize(avg_dep_delay = mean(dep_delay), avg_arr_delay = mean(arr_delay))

## # A tibble: 1 x 2
##   avg_dep_delay avg_arr_delay
##           <dbl>         <dbl>
## 1          12.6          6.90

Question 2: The weather dataset is also in the nycflights13 package, and gives weather information at the three NYC airports from 2013. In the weather dataset, is wind speed significantly different between the airports?

pairwise.t.test(weather$wind_speed, weather$origin) %>% tidy()

## # A tibble: 3 x 3
##   group1 group2  p.value
## * <chr>  <chr>     <dbl>
## 1 JFK    EWR    5.42e-54
## 2 LGA    EWR    4.13e-19
## 3 LGA    JFK    5.38e-11

Question 3: Do flights from different carriers fly about the same distances?

pairwise.t.test(flights$distance, flights$carrier, p.adj = 'fdr') %>% tidy()

## # A tibble: 120 x 3
##    group1 group2  p.value
##  * <chr>  <chr>     <dbl>
##  1 AA     9E     0.      
##  2 AS     9E     0.      
##  3 B6     9E     0.      
##  4 DL     9E     0.      
##  5 EV     9E     6.29e-11
##  6 F9     9E     0.      
##  7 FL     9E     1.66e-33
##  8 HA     9E     0.      
##  9 MQ     9E     3.24e-12
## 10 OO     9E     7.77e- 1
## # ... with 110 more rows

Question 4: In the built-in HairEyeColor dataset, are there equal numbers of students between the four hair colors (black, blonde, brown, and red)?

HairEyeColor %>% 
  as_tibble() %>% 
  spread(Hair, n) %>%
  select(-Eye, -Sex) %>%
  chisq.test(.)

## Warning in chisq.test(.): Chi-squared approximation may be incorrect

## 
##  Pearson's Chi-squared test
## 
## data:  .
## X-squared = 163.56, df = 21, p-value < 2.2e-16

Question 5: Is the variation in depature delays the same at all the airports in the flights table?

flights %>% 
  group_by(origin) %>% 
  na.omit() %>% 
  summarize(sd_dep_delay = sd(dep_delay))

## # A tibble: 3 x 2
##   origin sd_dep_delay
##   <chr>         <dbl>
## 1 EWR            41.2
## 2 JFK            38.8
## 3 LGA            39.9