broom
To practice the new tests we’ve learned, we’ll use the built-in mtcars
dataset.
What is the correlation between horsepower and weight? Is it significant?
mtcars %$% cor.test(hp, wt) %>% tidy()
## # A tibble: 1 x 8
## estimate statistic p.value parameter conf.low conf.high method
## <dbl> <dbl> <dbl> <int> <dbl> <dbl> <chr>
## 1 0.659 4.80 4.15e-5 30 0.403 0.819 Pears…
## # ... with 1 more variable: alternative <chr>
Plot a scatterplot of the rear axle ratio vs. fuel displacement. Does the relationship look linear?
ggplot(mtcars, aes(x = drat, y = disp)) +
geom_point(size = 3) +
labs(x = 'Rear axle ratio', y = 'Displacement (cu. in.)') +
theme_classic()
Since the relationship looks approximately linear, test it with lm()
lm(disp ~ drat, data = mtcars) %>% tidy()
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 823. 108. 7.60 0.0000000179
## 2 drat -165. 29.8 -5.53 0.00000528
Do an ANOVA of quarter mile time over the number of cylinders in the the engine.
aov(qsec ~ cyl, data = mtcars) %>% tidy()
## # A tibble: 2 x 6
## term df sumsq meansq statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 cyl 1 34.6 34.6 16.1 0.000366
## 2 Residuals 30 64.4 2.15 NA NA
Now take the same ANOVA from the chunk above and add a post-hoc Tukey test to it.
aov(qsec ~ as.factor(cyl), data = mtcars) %>% TukeyHSD() %>% tidy()
## # A tibble: 3 x 6
## term comparison estimate conf.low conf.high adj.p.value
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 as.factor(cyl) 6-4 -1.16 -2.94 0.619 0.257
## 2 as.factor(cyl) 8-4 -2.37 -3.85 -0.883 0.00133
## 3 as.factor(cyl) 8-6 -1.20 -2.91 0.498 0.205
Pick one of the three tests you did above and practice tidying it up with the three broom
functions!
tidy()
Add tidy()
onto end of your test.
aov(qsec ~ cyl, data = mtcars) %>% tidy()
## # A tibble: 2 x 6
## term df sumsq meansq statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 cyl 1 34.6 34.6 16.1 0.000366
## 2 Residuals 30 64.4 2.15 NA NA
glance()
Add glance()
onto the end of your test to look at the model parameters.
aov(cyl ~ qsec, data = mtcars) %>% glance()
## # A tibble: 1 x 11
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## * <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.350 0.328 1.46 16.1 3.66e-4 2 -56.6 119. 124.
## # ... with 2 more variables: deviance <dbl>, df.residual <int>
augment()
Add the results of the test back to mtcars using augment()
aov(cyl ~ qsec, data = mtcars) %>% augment(mtcars)
## # A tibble: 32 x 19
## .rownames mpg cyl disp hp drat wt qsec vs am gear
## * <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Mazda RX4 21 6 160 110 3.9 2.62 16.5 0 1 4
## 2 Mazda RX… 21 6 160 110 3.9 2.88 17.0 0 1 4
## 3 Datsun 7… 22.8 4 108 93 3.85 2.32 18.6 1 1 4
## 4 Hornet 4… 21.4 6 258 110 3.08 3.22 19.4 1 0 3
## 5 Hornet S… 18.7 8 360 175 3.15 3.44 17.0 0 0 3
## 6 Valiant 18.1 6 225 105 2.76 3.46 20.2 1 0 3
## 7 Duster 3… 14.3 8 360 245 3.21 3.57 15.8 0 0 3
## 8 Merc 240D 24.4 4 147. 62 3.69 3.19 20 1 0 4
## 9 Merc 230 22.8 4 141. 95 3.92 3.15 22.9 1 0 4
## 10 Merc 280 19.2 6 168. 123 3.92 3.44 18.3 1 0 4
## # ... with 22 more rows, and 8 more variables: carb <dbl>, .fitted <dbl>,
## # .se.fit <dbl>, .resid <dbl>, .hat <dbl>, .sigma <dbl>, .cooksd <dbl>,
## # .std.resid <dbl>
Use the augmented table from the chunk above to plot something.
aov(cyl ~ qsec, data = mtcars) %>% augment(mtcars) %>%
ggplot(aes(x = qsec, y = .resid, color = cyl)) +
geom_point()
Use the tests talked about this week, cor()
, cor.test()
, lm()
, aov()
, and/or TukeyHSD()
to ask two more questions about the mtcars
dataset.
Look at mtcars again. What’s another question you could ask about it? Plot the variable(s) you’re interested in below
ggplot(mtcars, aes(x = mpg, color = as.factor(vs))) +
geom_density(size = 1) +
labs(x = 'miles per gallon', y = '') +
scale_color_discrete(name = 'engine ehape', labels = c('v-shaped', 'straight')) +
theme_classic()
Based on the variables you just plotted, what’s your question?
Does engine shape affect the number of miles per gallon a car gets?
Pick one of the tests to answer your question and run it in the chunk below.
aov(mpg ~ as.factor(vs), data = mtcars) %>% tidy()
## # A tibble: 2 x 6
## term df sumsq meansq statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 as.factor(vs) 1 497. 497. 23.7 0.0000342
## 2 Residuals 30 630. 21.0 NA NA
What do you conclude from your test?
With a p-value of 0.0000342, it’s unlikely that the difference in mpg between the different engine types is due to chance.
Look at mtcars again. What’s another question you could ask about it? Plot the variable(s) you’re interested in below
ggplot(mtcars, aes(x = disp, y = hp)) +
geom_point(size = 3) +
labs(x = 'displacement (cu. in.)', y = 'gross horsepower') +
theme_classic()
Based on the variables you just plotted, what’s your question?
Can displacement predict horsepower?
Pick one of the tests to answer your question and run it in the chunk below.
lm(hp ~ disp, data = mtcars) %>% tidy()
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 45.7 16.1 2.84 0.00811
## 2 disp 0.438 0.0618 7.08 0.0000000714
What do you conclude from your test?
With a p-value of 0.00811 for the y-intercept and a p-value of 7.14e-8 for the slope, displacement can make a good prediction of the car’s horsepower.