Hierarchical Clustering


Prep

In order to make, ggplot plots of hierarchical cluster, we’ll need the ggplot extensions ggdendro. If you don’t already have it installed, uncomment the code in the chunk below and install it now.

#install.packages('ggdendro')


We’re going to use some simulated data from a normal distribution to demo hierarchical clustering. Run the code in the chunk below before continuing.

set.seed(42)
tibble(x = rnorm(6, mean = 5, sd = 2), 
       y = x + runif(6),
       label = c('A', 'B', 'C', 'D', 'E', 'F')) -> clust_demo



Getting the data in the correct format with dist()

The clustering algorithm requires a distance matrix. A distance matrix calculates the euclidean distance between every row in the table and returns a distance matrix with the distances between all rows.


We need to convert our data into a distance matrix using the function dist()

clust_demo %>% 
# make the tibble into a dataframe because we need to make the labels
# into rownames and tibbles don't allow rownames
  as.data.frame() %>%
# turn the data labels into rownames so they're carried through the distance
# matrix and hierarchical clustering calculations
  column_to_rownames('label') %>%
# use dist() to calculate a distance matrix
  dist(.) -> clust_demo_dist


Calculate the clustering with hclust()

The hclust() function does the hierarchical clustering calculations.

hclust(clust_demo_dist) -> clust_demo_hclust


Base R plot the hierarchical clustering

You can use the base R plot() to directly plot the hclust object.

plot(clust_demo_hclust)


Plot hierarchical clustering in ggplot using ggdendro

It requires more wrangling to plot the hclust clustering with ggplot, but the ggendro package will do most of the wrangling for us.

# start with the saved hclust object
clust_demo_hclust %>% 
# as.dendrogram() turns the hclust results into a special dendrogram class 
# that R uses for representing any kind of tree
  as.dendrogram() %>% 
# dendro_data() turns the dendrogram class data into numbers that are plottable
  dendro_data() -> clust_demo_ggdendro


ggdendrogram()

Plot using ggdendro’s helper function, ggdendro(). It plots the dendrogram as a ggplot object.

ggdendrogram(clust_demo_ggdendro)

You can modify this like any other ggplot plot. For example, I don’t like the y axis labels, so I’ll remove them.

ggdendrogram(clust_demo_ggdendro) +
  theme(axis.text.y = element_blank())


Plotting the dendrogram with straight ggplot()

However, you can’t see (unless you look at the package code) what exact ggplot commands ggdendrogram() is plotting, which is super annoying when you want to make adjustments to the plot. However, as you can see when you look at the dendro_data() table below, all the numbers are there, so we can just plot it ourselves.

clust_demo_ggdendro
## $segments
##         x         y  xend      yend
## 1  2.6875 5.9744985 1.500 5.9744985
## 2  1.5000 5.9744985 1.500 1.2034621
## 3  1.5000 1.2034621 1.000 1.2034621
## 4  1.0000 1.2034621 1.000 0.0000000
## 5  1.5000 1.2034621 2.000 1.2034621
## 6  2.0000 1.2034621 2.000 0.0000000
## 7  2.6875 5.9744985 3.875 5.9744985
## 8  3.8750 5.9744985 3.875 3.2020659
## 9  3.8750 3.2020659 3.000 3.2020659
## 10 3.0000 3.2020659 3.000 0.0000000
## 11 3.8750 3.2020659 4.750 3.2020659
## 12 4.7500 3.2020659 4.750 1.1513913
## 13 4.7500 1.1513913 4.000 1.1513913
## 14 4.0000 1.1513913 4.000 0.0000000
## 15 4.7500 1.1513913 5.500 1.1513913
## 16 5.5000 1.1513913 5.500 0.6038454
## 17 5.5000 0.6038454 5.000 0.6038454
## 18 5.0000 0.6038454 5.000 0.0000000
## 19 5.5000 0.6038454 6.000 0.6038454
## 20 6.0000 0.6038454 6.000 0.0000000
## 
## $labels
##   x y label
## 1 1 0     B
## 2 2 0     F
## 3 3 0     A
## 4 4 0     D
## 5 5 0     C
## 6 6 0     E
## 
## $leaf_labels
## NULL
## 
## $class
## [1] "dendrogram"
## 
## attr(,"class")
## [1] "dendro"


Now we can plot it with ggplot()

# for once, don't put any data in ggplot() !!!
# the stem and label information is in separate tables, so we want to supply
# separate data to separate geoms
ggplot() +
# the segments table contains the information for plotting branches, so supply
# that to geom_segment() to plot the branches of the tree
  geom_segment(data = clust_demo_ggdendro$segments, aes(x = x, y = y, xend = xend, yend = yend)) +
# the labels table has the labels for the ends of the branches, so supply that
# to geom_text() to label the ends of the branches
  geom_text(data = clust_demo_ggdendro$labels, aes(x = x, y = y, label = label), vjust = 1.25) +
  theme_dendro()