DESEq2
We will be conducting differential expression analysis with the goal of identifying genes that have changed their activity in response to some perturbation (in our example data, treatment with DAC) using the R Bioconductor package DESeq2
. DESeq2
is a popular and user-friendly package to perform differential expression using negative binomical generalized linear models. This lesson will cover the basics, but for more details you can check the DESeq2
Vignette https://bioconductor.org/packages/release/workflows/vignettes/rnaseqGene/inst/doc/rnaseqGene.html#annotating-and-exporting-results
The data we have been using for this RNA-seq tutorial is from a former PhD student in the lab, Sandra Deliard. We have two sets of triplicates both in YB5 cells, control (siC) and treated with DAC (dac). DAC is a DNA methyltransferase inhibitor (DNMT) and treatment with it results in global hypomethylation and increased expression. Details for the sample names are given in the table below.
Sample ID | treatment | replicate |
---|---|---|
dac1 | 5-aza-2’-deoxyctyidine, aka decitabine, DAC | 1 |
dac2 | 5-aza-2’-deoxyctyidine, aka decitabine, DAC | 2 |
dac3 | 5-aza-2’-deoxyctyidine, aka decitabine, DAC | 3 |
siCl | control | 1 |
siC2 | control | 2 |
siC3 | control | 3 |
If you don’t have any of the necessary packages installed, uncomment the install lines in the chunk below and install them.
### tidyverse
#install.packages('tidyverse')
### conflicted
# Conflicted allows you to explictly manage namespace conflicts (happens when
# two functions have the same name)
#install.packages('conflicted')
### vroom
# The vroom package has one function, vroom, for fast and lazy data read in.
# Vroom works like readr, but is much faster with a 55-60x speedup.
#install.pacakges('vroom')
### DESeq2
# Package to do differential expression
# if (!requireNamespace("BiocManager", quietly = TRUE))
# install.packages("BiocManager")
#
#BiocManager::install("DESeq2")
### Fastcluster
# Faster implementation of the hclust() function for hierarchical clustering
#install.packages('fastcluster')
### ggdendro
# This is an extension to ggplot2 to allow you to easily plot dendrograms
#install.packages('ggdendro')
### AnnotationDbi
# Bioconductor package that interfaces between R and Bioconductor's SQLite-based
# biological data annotation packages
#BiocManager::install("AnnotationDbi")
### org.Hs.eg.db
# Bioconductor package that contains the current human (Hs) annotations; if you
# want annotations for a different version of the reference genome, you need a
# different package
#BiocManager::install("org.Hs.eg.db")
# get list of files
list.files(path = '../03_count',
pattern = '.txt$',
full.names = T) -> files
# read files in recursively
tibble(filename = files) %>%
# loop aka map over the file paths and read the files in
mutate(file_contents =
purrr::map(filename,
# use vroom::vroom to read the files in fast and lazily. Each file will be
# placed in the table as a nested subtable
~ vroom::vroom(.,
# delim = '\t' tells vroom that the files are tab separated
# comment = '#' and skip = 2 tells vroom to skip the comment lines and the
# column name line and start reading in the files in where the data starts at
# line 2. This is necessary because the files do not have the same column names,
# so we will tell vroom what they are manually.
delim = '\t', comment = '#', skip = 2,
# Give vroom the column names so the files all have the same column names
col_names = c('ensembl_gene_id', 'chr',
'start', 'end', 'strand',
'length', 'count')),),
# Add the sample ID as an additional column using str_extract(), which pulls a
# string based on the regular expression you give it. Here I asked for strings
# starting with either d or s, followed by any two letters of the alphabet
# either upper or lower case followed by any number
sample_id = str_extract(filename, '[d,s][A-z]{2}[0-9]'),
# Add the treatment by checking whether the sample ID starts with dac; if it
# does that sample is labelled as treatment, otherwise its labelled as control
treatment = ifelse(str_detect(sample_id, 'dac'),
'treatment', 'control')) %>%
# drop the file path because we don't need it anymore.
dplyr::select(-filename) %>%
# Unpack the subtables so we have one giant rectangular table.
tidyr::unnest() -> data
## Rows: 835
## Columns: 7
## Delimiter: "\t"
## chr [5]: ensembl_gene_id, chr, start, end, strand
## dbl [2]: length, count
##
## Use `spec()` to retrieve the guessed column specification
## Pass a specification to the `col_types` argument to quiet this message
## Rows: 835
## Columns: 7
## Delimiter: "\t"
## chr [5]: ensembl_gene_id, chr, start, end, strand
## dbl [2]: length, count
##
## Use `spec()` to retrieve the guessed column specification
## Pass a specification to the `col_types` argument to quiet this message
## Rows: 835
## Columns: 7
## Delimiter: "\t"
## chr [5]: ensembl_gene_id, chr, start, end, strand
## dbl [2]: length, count
##
## Use `spec()` to retrieve the guessed column specification
## Pass a specification to the `col_types` argument to quiet this message
## Rows: 835
## Columns: 7
## Delimiter: "\t"
## chr [5]: ensembl_gene_id, chr, start, end, strand
## dbl [2]: length, count
##
## Use `spec()` to retrieve the guessed column specification
## Pass a specification to the `col_types` argument to quiet this message
## Rows: 835
## Columns: 7
## Delimiter: "\t"
## chr [5]: ensembl_gene_id, chr, start, end, strand
## dbl [2]: length, count
##
## Use `spec()` to retrieve the guessed column specification
## Pass a specification to the `col_types` argument to quiet this message
## Rows: 835
## Columns: 7
## Delimiter: "\t"
## chr [5]: ensembl_gene_id, chr, start, end, strand
## dbl [2]: length, count
##
## Use `spec()` to retrieve the guessed column specification
## Pass a specification to the `col_types` argument to quiet this message
## Warning: `cols` is now required.
## Please use `cols = c(file_contents)`
DESeq2
DESeq2
requires that the data be in an S4 object of class DESeqDataSet
before you can use its functions to conduct differential expression. DESeq2
and Bioconductor pacakges in general like to use S4 objects because S4s have very strict definitions, which prevent users from making naive mistakes in creating and modifying them. S4 objects have slots that can be filled with whatever information the creator specifies. For a DESeqDataSet
, a custom S4 class, this includes, but is not limited to, the raw RNA-seq counts, metadata (ex: treatement, batch) from the experiment, experimental design, and the results of the differential expression analysis.
TL;DR DESeq2
requires data to be in a specific and semi-custom format before you can calculate differential expression, so we’ll put the data into that format in this section.
Here we’re going to make two tables; a count matrix with genes as rows, samples as columns and read counts as the data, and a metadata table with the sample IDs and any other relevant information. Here the information is just treatment, but it could include other covariates like sex or batch. NOTE: The columns in the count matrix and the rows in the metadata table MUST BE THE SAME and IN THE SAME ORDER or the information in the count matrix and the metadata tables will not be matched up properly.
data %>%
dplyr::select(ensembl_gene_id, sample_id, count) %>%
spread(sample_id, count) %>%
as.data.frame() -> count_matrix
data %>%
dplyr::select(sample_id, treatment) %>%
distinct() -> metadata
Use the helper function DESeqDataSetFromMatrix()
to put our data into a DESeqDAtaSet
object.
# given the function the count and metadata tables
dds <- DESeqDataSetFromMatrix(countData = count_matrix,
colData = metadata,
# tidy = T says that our data is in tidy format and the first columns of the
# tables should be taken as row.names
tidy = TRUE,
# Give the design of the experiment, here just taking treatment into account
design = ~ treatment)
## converting counts to integer mode
## Warning in DESeqDataSet(se, design = design, ignoreRank): some variables in
## design formula are characters, converting to factors
It’s standard to pre-filter data for hypothesis tests because you don’t want to include genes that for example, have no counts detected in any samples or have extrememly low counts in many samples and have to have an increased multiple testing correction.
### filter out rows that contain only zero counts
keep <- rowSums(counts(dds)) > 1
dds <- dds[keep, ]
### NOT DOING, but example
# Filter for genes where at least three samples have a count of 10 or higher
# keep <- rowSums(counts(dds) >= 10) >= 3
# dds <- dds[keep, ]
For whatever reason, sometimes one or a few samples can have divergent measurements from the rest of the data. We want to check for that before doing differential expression so we can correct for it, for example by including batch as a covariate in our model or using a package like RUVSeq
to perform further specialized normalizations.
The first way we’re going to visualize the similarity between our samples is by creating a tree illustrating it using hierarchical clustering.
# calculate the distances and clustering for the tree
counts(dds) %>%
# Whatever you want to visualize on the branches needs to be on the rows of the
# table, so swap rows and columns
t() %>%
# hclust() requires the distance between samples to be precalculated
dist(.) %>%
# calculate the clusterin
fastcluster::hclust() %>%
# transforms the hclust object into the dendrogram class for tree structures
as.dendrogram() %>%
# From the ggdendro package, a helper function that transforms the dendrogram
# object into a group of tables that can be used to plot the tree with ggplot
dendro_data() -> clust_dds
# with the ggdendro plotting function
ggdendrogram(clust_dds)
Plot it with modifications using ggplot2
. We’re going to add color to the sample names and play with scale
# rewrite the label table from the ggdendro object so you can add a column to
# color by
clust_dds$labels %>%
mutate(treatment = ifelse(str_detect(label, 'dac'),
'DAC', 'control'),
y = -25) -> clust_labels
# plot with ggplot2
ggplot() +
geom_segment(data = clust_dds$segments,
aes(x = x, y = y, xend = xend, yend = yend)) +
geom_text(data = clust_labels,
aes(x = x, y = y, label = label, color = treatment),
vjust = 1.2, size = 6) +
scale_color_manual(values = c('gold3', 'hotpink3')) +
coord_cartesian(ylim = c(-100, max(clust_dds$segments$yend))) +
theme_classic() +
theme(legend.position = 'none',
axis.title = element_blank(),
axis.line = element_blank(),
axis.ticks.x = element_blank(),
axis.text.x = element_blank())
For our second check, we’ll use Principal Component Analysis (PCA) to see how well our observed variation corresponds with our variables of interest. We’ll do this in 2 ways, first using the built-in functions that DESeq2
has for PCA analysis and second using standard R functions.
DESEq2
FunctionsDESeq2
stores raw (non-normalized) counts in the DESeqDataSet
object, but requires that those counts be normalized before calculating the PCA. RNA-seq counts follow a negative binomial distribution instead of a normal distribution, so they need to be normalized before man
### must use normalized counts for the DESeq PCA function
# Why? See the negative binomial distribution; most genes have low counts with
# a few genes having extremely high counts
ggplot(data, aes(x = count)) +
geom_density() +
theme_classic(base_size = 20)
# normalize the counts
vsd <- varianceStabilizingTransformation(dds, blind = FALSE)
## -- note: fitType='parametric', but the dispersion trend was not well captured by the
## function: y = a/x + b, and a local regression fit was automatically substituted.
## specify fitType='local' or 'mean' to avoid this message next time.
DESeq2
doesn’t have a separate function for calculating the PCA, it calculates and plots it in a single step.
# use the DESeq function to plot PCA
plotPCA(vsd, intgroup = 'treatment')
# this is a ggplot object, so we can modify it like a normal ggplot
plotPCA(vsd, intgroup = 'treatment') +
geom_point(size = 6) +
scale_color_manual(values = c('gold3', 'hotpink4')) +
theme_classic()
While technically data should be normally distributed before calculating the PCA, it can be done without normalization. If you’re curious about the differences (usually minimal and definitely neglible for this data), you can can install the glmpca
package and calculate the PCA using glmpca::glmpca()
which explicitly corrects for non-normally distributed data.
# Calculate the PCA
prcomp(t(counts(dds))) -> dds_prcomp
# plot
dds_prcomp$x %>%
as.data.frame() %>%
rownames_to_column('sample_id') %>%
mutate(treatment = ifelse(str_detect(sample_id, 'dac'), 'DAC', 'control')) %>%
ggplot(aes(x = PC1, y = PC2)) +
geom_point(aes(color = treatment), size = 6) +
scale_color_manual(values = c('gold3', 'hotpink4')) +
theme_classic(base_size = 20) +
theme(legend.position = 'top')
DESeq2
calculates differential expression by fitting a negative bionomial generalized linear model with the design you gave the object in the treatment argument. Corrects for multiple testing using the Benjamini-Hotchberg (BH) correction.
dds <- DESeq(dds)
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## -- note: fitType='parametric', but the dispersion trend was not well captured by the
## function: y = a/x + b, and a local regression fit was automatically substituted.
## specify fitType='local' or 'mean' to avoid this message next time.
## final dispersion estimates
## fitting model and testing
Before extracting the differential expression results from the DESeqDataSet
object, we want to get human readable gene names. Our gene names are currently given as Ensembl IDs. However, the HUGO Gene Nomenclature Committee sets standard human readable gene names that we’re going to add to our data, so we can look at the differentally expressed genes and get an idea of what’s affected. Bioconductor has the AnnotationDbi
package, which is an interface to query Bioconductor’s SQLite-based annotation data packages. Here we’ll use AnnotationDbi
functions to query the org.Hs.eg.db
package, which has the human genome annotations
# check which annotations are available.
columns(org.Hs.eg.db)
## [1] "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS"
## [6] "ENTREZID" "ENZYME" "EVIDENCE" "EVIDENCEALL" "GENENAME"
## [11] "GO" "GOALL" "IPI" "MAP" "OMIM"
## [16] "ONTOLOGY" "ONTOLOGYALL" "PATH" "PFAM" "PMID"
## [21] "PROSITE" "REFSEQ" "SYMBOL" "UCSCKG" "UNIGENE"
## [26] "UNIPROT"
# use the Ensembl IDs to find the corresponding HGNC IDs
mapIds(org.Hs.eg.db,
keys = rownames(results(dds)),
column = 'SYMBOL',
keytype = 'ENSEMBL',
multiVals = 'first') %>%
enframe(name = 'ensembl_gene_id', value = 'gene') -> gene_names
## 'select()' returned 1:many mapping between keys and columns
Wrangle the differential expression result object to get a rectangular table.
results(dds) %>%
# Convert to a rectangular, tidy table with no rownames
as.data.frame() %>%
rownames_to_column('ensembl_gene_id') %>%
# Join in the HGNC gene names
left_join(gene_names, by = 'ensembl_gene_id') %>%
# Reorder for readability
dplyr::select(gene, ensembl_gene_id, everything()) %>%
# Add columns indicating statistical significance and logging p-values; this
# will be useful in a minute when we plot the results.
mutate(sig = ifelse(padj < 0.05 & log2FoldChange >= 1,
'sig', 'notsig'),
log_qvalue = -log10(padj)) %>%
# NA is assigned to p-values that are outliers, replace in the log_qvalue
# column with the nonsignificant 0
replace_na(list(sig = 'notsig', log_qvalue = 0)) -> diff_exp_tbl
Save the results of the differential expression test.
#write_tsv(diff_exp_tbl, 'diff_exp.tsv')
An MA plot is a type of scatterplot with average values plotted on the x-axis and the difference in values plotted on the y-axis. As standard for RNA-seq, here we’ll plot the mean expression count on the x-axis and the log2 fold change in expression on the y-axis for each gene. This lets us examine how gene expression changed between groups relative to their overall expression.
# Create labels for the number and percentage of significantly up- and down-
# regulated genes
diff_exp_tbl %>%
mutate(direction = ifelse(log2FoldChange < 0, 'down', 'up')) %>%
group_by(direction, sig) %>%
dplyr::count() %>%
ungroup() %>%
complete(direction, sig, fill = list(n = 0)) %>%
na.omit() %>%
filter(sig == 'sig') %>%
mutate(label = paste0(n, ', ', round((n / nrow(diff_exp_tbl)), 1), '%'),
baseMean = 1750,
log2FoldChange = c(-3, 4)) -> ma_labels
# plot
ggplot(diff_exp_tbl, aes(x = baseMean, y = log2FoldChange)) +
geom_point(aes(color = sig)) +
scale_color_manual(values = c('gray30', 'firebrick3')) +
geom_hline(yintercept = 0, color = 'gray60', linetype = 'dashed') +
geom_text(data = ma_labels, aes(label = label), size = 8) +
labs(x = 'Mean Expression (Counts)', y = 'Log2 Fold Change') +
theme_classic(base_size = 20) +
theme(legend.position = 'none')
A volcano plot is a type of scatterplot with the difference in values plotted on the x-axis and the significance of those values plotted on the y-axis. As standard for RNA-seq, here we’ll plot the log2 fold change in expression on the x-axis and the negative log10 corrected p-values (q-values) on the y-axis for each gene. We plot negative log10 significance values for better visualization because now the most signficant values appear at the top of the plot. The volcano plot highlights the most significant most change values.
# Create labels for the number and percentage of significantly up- and down-
# regulated genes
diff_exp_tbl %>%
mutate(direction = ifelse(log2FoldChange < 0, 'down', 'up')) %>%
group_by(direction, sig) %>%
dplyr::count() %>%
ungroup() %>%
complete(direction, sig, fill = list(n = 0)) %>%
na.omit() %>%
filter(sig == 'sig') %>%
mutate(label = paste0(n, ', ', round((n / nrow(diff_exp_tbl)), 1), '%'),
log2FoldChange = c(-3.5, 3.5),
log_qvalue = 45) -> volc_labels
# plot
ggplot(diff_exp_tbl, aes(x = log2FoldChange, y = log_qvalue)) +
geom_point(aes(color = sig)) +
scale_color_manual(values = c('gray30', 'firebrick3')) +
geom_hline(yintercept = -log10(0.05), color = 'gray60', linetype = 'dashed') +
geom_vline(xintercept = c(-1, 1), color = 'gray60', linetype = 'dashed') +
geom_text(data = volc_labels, aes(label = label), size = 8) +
labs(x = 'Log2 Fold Change', y = '-Log10 QValue') +
coord_cartesian(xlim = c(-5, 5)) +
theme_classic(base_size = 20) +
theme(legend.position = 'none')