splSetToGet <- "PBMMC,ETV6-RUNX1"
splSetVec <- unlist(strsplit(splSetToGet, ","))

nbPcToComp <- 50
figSize <- 7

library(tidyverse) # data wrangling and plotting (ggplot2)
library(scater) # scRnaSeq QC
library(scran) # scRnaSeq normalisation
library(bluster) # scRnaSeq clustering
library(dynamicTreeCut) # tree cutting in clustering
library(cluster) # for silhouette
library(igraph) # for graph-based clustering and plotting networks 
library(reticulate) # for graph-based clustering with leiden
library(leiden) # for community detection (clustering)
library(pheatmap) # for heatmap plotting
library(patchwork) # to combine plots
fontsize <- theme(axis.text=element_text(size=12), axis.title=element_text(size=16))

1 Clustering

Source: clustering methods in the Hemberg group material and its ‘biocellgen variant’, with some of its text copied with few edits only. Also see the OSCA chapter on clustering and a benchmark study.

Once we have normalized the data and removed confounders we can carry out analyses that are relevant to the biological questions at hand. The exact nature of the analysis depends on the dataset. One of the most promising applications of scRNA-seq is de novo discovery and annotation of cell-types based on transcription profiles. This requires the identification of groups of cells based on the similarities of the transcriptomes without any prior knowledge of the labels, or unsupervised clustering. To avoid the challenges caused by the noise and high dimensionality of the scRNA-seq data, clustering is performed after feature selection and dimensionality reduction, usually on the PCA output.

We will introduce three widely used clustering methods: 1) hierarchical, 2) k-means and 3) graph-based clustering. In short, the first two were developed first and are faster for small data sets, while the third is more recent and better suited for scRNA-seq, especially large data sets. All three identify non-overlapping clusters.

We will apply them to the denoised log-expression values on the data set studied and measure clustering quality.

1.1 Load data

# Read mnn.out object in:
sce <- readRDS("../Robjects/DataIntegration_mnn.out.Rds") # auto.merge

Number of cells: 3500.

Number of genes: 8149.

1.2 Clustering cells into putative subpopulations

1.2.1 Hierarchical clustering

Hierarchical clustering builds a hierarchy of clusters yielding a dendrogram that groups together cells with similar expression patterns across the chosen genes.

There are two types of strategies:

Agglomerative (bottom-up): each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy.
Divisive (top-down): all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy.

The raw data:

knitr::include_graphics("../Images/bioCellGenHierar1.png")

The hierarchical clustering dendrogram:

knitr::include_graphics("../Images/bioCellGenHierar2.png")

Pros:
- deterministic method
- returns partitions at all levels along the dendrogram
Cons:
- computationally expensive in time and memory that increase proportionally to the square of the number of data points

1.2.1.1 Clustering

Here we will apply hierarchical clustering on the Euclidean distances between cells, using the Ward D2 criterion to minimize the total variance within each cluster.

# get PCs
pcs <- reducedDim(sce, "corrected")
# compute distance
pc.dist <- dist(pcs)
# derive tree
hc.tree <- hclust(pc.dist, method="ward.D2")
hcd <- as.dendrogram(hc.tree)

The dendrogram below shows each cell as a leaf.

plot(hcd, type = "rectangle", ylab = "Height", leaflab = "none")

Clusters are identified in the dendrogram using the shape of branches as well as their height (‘dynamic tree cut’ [@doi:10.1093/bioinformatics/btm563]).

# identify clusters by cutting branches, requesting a minimum cluster size of 20 cells.
hc.clusters <- unname(cutreeDynamic(hc.tree,
                    distM = as.matrix(pc.dist),
                    minClusterSize = 20,
                    verbose = 0))

Cell counts for each cluster (rows) and each sample group (columns):

# per sample group
table(hc.clusters, sce$SampleGroup)

##            
## hc.clusters ETV6-RUNX1 HHD PBMMC PRE-T
##          1         675   0    16     0
##          2         442   0   210     0
##          3         221   0   424     0
##          4         235   0   122     0
##          5         169   0   165     0
##          6         143   0   138     0
##          7           6   0   185     0
##          8          68   0   117     0
##          9          15   0    76     0
##          10         26   0    47     0

Cell counts for each cluster (rows) and each sample (columns):

# per sample
table(hc.clusters, sce$SampleName)

##            
## hc.clusters ETV6-RUNX1_1 ETV6-RUNX1_2 ETV6-RUNX1_3 ETV6-RUNX1_4 PBMMC_1b
##          1           268          190           56          161        6
##          2            38          214          136           54       70
##          3             4           16          179           22      190
##          4            61           55           28           91       83
##          5           128           10            7           24       81
##          6             0            1           23          119        0
##          7             0            2            0            4       43
##          8             0            9           49           10        8
##          9             1            3            7            4       17
##          10            0            0           15           11        2
##            
## hc.clusters PBMMC_2 PBMMC_3
##          1        6       4
##          2       27     113
##          3      118     116
##          4        9      30
##          5       44      40
##          6      130       8
##          7       49      93
##          8       47      62
##          9       31      28
##          10      39       6

This data (cell number) may also be shown as heatmap.

tmpTab <- table(hc.clusters, sce$SampleName)
rownames(tmpTab) = paste("hc", rownames(tmpTab), sep = "_")

# columns annotation with cell name:
mat_col <- colData(sce) %>% data.frame() %>% select(SampleName, SampleGroup) %>% unique
rownames(mat_col) <- mat_col$SampleName
mat_col$SampleName <- NULL

# Prepare colours for clusters:
colourCount = length(unique(mat_col$SampleGroup))
getPalette = grDevices::colorRampPalette(RColorBrewer::brewer.pal(9, "Set1"))

mat_colors <- list(SampleGroup = getPalette(colourCount))
names(mat_colors$SampleGroup) <- unique(mat_col$SampleGroup)

Heatmap, with samples ordered as in sample sheet:

# without column clustering
pheatmap(tmpTab,
           cluster_cols = FALSE,
           annotation_col    = mat_col,
           annotation_colors = mat_colors)

Heatmap, with samples ordered by similarity of cell distribution across clusters:

# with column clustering
pheatmap(tmpTab,
           annotation_col    = mat_col,
           annotation_colors = mat_colors)

If clusters mostly include cells from one sample or the other, it suggests that the samples differ, and/or the presence of batch effect.

Let us show cluster assignments on the t-SNE, with cells shaped by cell type, colored by cluster and sized total UMI counts (‘sum’).

# store cluster assignment in SCE object:
sce$cluster <- factor(hc.clusters)
# make, store and show TSNE plot:
g <- plotTSNE(sce, colour_by = "cluster", shape_by="SampleGroup")
g

Split by sample group:

# split by sample and show:
g <- g + facet_wrap(. ~ sce$SampleGroup)
g

In some areas cells are not all assigned to the same cluster.

We can also check the contribution of each sample by painting cells by ‘SampleName’:

g <- plotTSNE(sce, colour_by = "SampleName", shape_by="SampleGroup") +
  facet_wrap(. ~ sce$SampleGroup)
g

1.2.1.2 Separatedness

The congruence of clusters may be assessed by computing the sillhouette for each cell. The larger the value the closer the cell to cells in its cluster than to cells in other clusters. Cells closer to cells in other clusters have a negative value. Good cluster separation is indicated by clusters whose cells have large silhouette values.

We first compute silhouette values.

sil <- silhouette(hc.clusters, dist = pc.dist)

We then plot silhouettes with one color per cluster and cells with a negative silhouette with the color of their closest cluster. We also add the average silhouette for each cluster and all cells.

# prepare colours:
clust.col <- scater:::.get_palette("tableau10medium") # hidden scater colours
sil.cols <- clust.col[ifelse(sil[,3] > 0, sil[,1], sil[,2])]
sil.cols <- sil.cols[order(-sil[,1], sil[,3])]
# plot:
plot(sil,
     main = paste(length(unique(hc.clusters)), "clusters"),
     border = sil.cols,
     col = sil.cols,
     do.col.sort = FALSE)

The plot shows cells with negative silhouette indicating that too many clusters may have been defined. The method and parameters used defined clusters with properties that may not fit the data set, eg clusters with the same diameter.

1.2.2 k-means

1.2.2.1 Description

In k-means clustering, the goal is to partition N cells into k different clusters. In an iterative manner, cluster centers are defined and each cell is assigned to its nearest cluster.

The aim is to minimise within-cluster variation and maximise between-cluster variation, using the following steps:

randomly select k data points to serve as initial cluster centers,
for each point, compute the distance between that point and each of the centroids and assign the point to the cluster with the closest centroid,
calculate the mean of each cluster (the ‘mean’ in ‘k-mean’) to define its centroid, and for each point compute the distance to these means to choose the closest, repeat until the distance between centroids and data points is minimal (ie clusters do not change) or the maximum number of iterations is reached,
compute the total variation within clusters,
assign new centroids and repeat steps above

Pros:
- fast
Cons:
- assumes a pre-determined number of clusters
- sensitive to outliers
- tends to define equally-sized clusters

1.2.2.2 Example

The dendogram built above suggests there may be ten large cell populations.

Let us define ten clusters.

# define clusters with kmeans()
# because results depend on the initial cluster centers,
# it is usually best to try several times,
# by setting 'nstart' to say 20,
# kmeans() will then retain the run with the lowest within cluster variation.
kclust <- kmeans(pcs, centers = 10, nstart = 20) 

# compute silhouette
sil <- silhouette(kclust$cluster, dist(pcs))

# plot silhouette
clust.col <- scater:::.get_palette("tableau10medium") # hidden scater colours
sil.cols <- clust.col[ifelse(sil[,3] > 0, sil[,1], sil[,2])]
sil.cols <- sil.cols[order(-sil[,1], sil[,3])]
plot(sil, main = paste(length(unique(kclust$cluster)), "clusters"), 
    border=sil.cols, col=sil.cols, do.col.sort=FALSE)

The t-SNE plot below shows cells color-coded by cluster membership:

sce$k10 <- factor(kclust$cluster)
# make, store and show TSNE plot:
g <- plotTSNE(sce, colour_by = "k10", shape_by="SampleGroup")
g

Split by sample type:

g + facet_wrap(~ sce$SampleGroup)

To find the most appropriate number of clusters, one performs the analysis for a series of values for k and for each k value computes a ‘measure of fit’ of the clusters defined. Several metrics exist. We will look at:

within-cluster sum-of-squares
silhouette
gap statistic

1.2.2.3 Within-cluster sum-of-squares

The within-cluster sum-of-squares is the sum of the squared deviations from each observation and the cluster centroid. This metric measures the variability of observations within a cluster. It decreases as k increases, by an amount that decreases with k. Indeed, for low k values increasing k usually improves clustering, as shown by a sharp drop in the within-cluster sum-of-squares. For higher k values, increasing k does not reduce the within-cluster sum-of-squares much: dividing clusters further only slightly reduces variablility inside clusters. On a plot of the within-cluster sum-of-squares against k, the curve shows two parts: the first with a steep negative slope and the second with a small slope. The ‘elbow’ of the curve indicates the most appropriate number of clusters.

library(broom)
library(tibble)
library(tidyr)
library(purrr)

# get PCA matrix (i.e. rotated values)
points <- as_tibble(pcs)

# define clusters for different number of clusters
# from 1 to 20:
kclusts <- tibble(k = 1:20) %>%
  mutate(
    kclust = map(k, ~kmeans(points, .x)), # define clusters
    tidied = map(kclust, tidy), # 'flatten' kmeans() output into tibble
    glanced = map(kclust, glance), # convert model or other R object to convert to single-row data frame
    augmented = map(kclust, augment, points) # extract per-observation information
  )

# get cluster assignments
# unnest a list column with unnest(),
# i.e. make each element of the list its own row.
clusters <- kclusts %>%
  unnest(tidied)

# get assignments
assignments <- kclusts %>% 
  unnest(augmented)

# get clustering outcome
clusterings <- kclusts %>%
  unnest(glanced)

We now plot the total within-cluster sum-of-squares and decide on k.

ggplot(clusterings, aes(k, tot.withinss)) +
  geom_point() +
  geom_line()

clusterings %>% 
  mutate(tot.withinss.diff = tot.withinss - lag(tot.withinss)) %>%
  arrange(desc(tot.withinss.diff))

# get the smallest negative drop
k_neg <- clusterings %>% 
  mutate(tot.withinss.diff = tot.withinss - lag(tot.withinss)) %>%
  filter(tot.withinss.diff < 0) %>%
  slice_max(tot.withinss.diff, n = 1) %>%
  pull(k)
k_neg <- k_neg -1
# get the first positive diff
k_pos <- clusterings %>% 
  mutate(tot.withinss.diff = tot.withinss - lag(tot.withinss)) %>%
  filter(tot.withinss.diff >= 0) %>%
  slice_min(k, n = 1) %>%
  pull(k)
k_pos <- k_pos -1

The plot above suggests reasonable values for k may be:

11 (where the drop in total within-cluster sum-of-squares is the smallest)
10 (lowest cluster number where the difference in consecutive total within-cluster sum-of-squares is positive)

1.2.2.4 Silhouette

We now compute the Silhouette values for each set of clusters defined above for the series of values of k.

# compute pairwise distance between cells in the PC space,
# as it is needed to compute silhouette:
pcDist <- dist(pcs)
# compute silhouette for each set of clusters defined above for a series of values of k:
Ks=sapply(2:20,
        function(i) {
            tmpClu <- as.numeric(kclusts$augmented[[i]]$.cluster)
            #table(tmpClu)
            sil <- silhouette(tmpClu, pcDist)
            summary(sil)$avg.width
        }
)
# plot average width against k:
plot(2:20, Ks,
     xlab="k",
     ylab="av. silhouette",
     type="b",
     pch=19)

1.2.2.5 Gap statistic

Because the variation quantity decreases as the number of clusters increases, a more reliable approach is to compare the observed variation to that expected from a null distribution. With the gap statistic, the expected variation is computed:

by generating several reference data sets using the uniform distribution (no cluster),
for each such set and the series of k values, by clustering and computing the within-cluster variation.

The gap statistic measures for a given k the difference between the expected and observed variation. The most appropriate number of clusters is that with the higest gap value.

We will use cluster::clusGap() to compute the gap statistic for k between 1 and 20.

set.seed(123)
gaps <- cluster::clusGap(
  x = pcs,
  FUNcluster = kmeans,
  K.max = 20,
  nstart = 5, # low for expediency here but should use higher
  B = 10 # low for expediency here but should use higher
)

## find the "best" k
best.k <- cluster::maxSE(gaps$Tab[, "gap"], gaps$Tab[, "SE.sim"])
# in case the the best k was '1',
# skip first row of output table:
gapsTab <- gaps$Tab[2:nrow(gaps$Tab),]
best.k <- cluster::maxSE(gapsTab[, "gap"], gapsTab[, "SE.sim"])

The “optimal” k value is 9.

We next copy the cluster assignment to the SCE object.

df <- as.data.frame(assignments)
# here only copy outcome for k == 10
sce$kmeans10 <- as.numeric(df[df$k == 10, ".cluster"])

clust.col <- scater:::.get_palette("tableau10medium") # hidden scater colours
sil <- silhouette(sce$kmeans10, dist = pc.dist)
sil.cols <- clust.col[ifelse(sil[,3] > 0, sil[,1], sil[,2])]
sil.cols <- sil.cols[order(-sil[,1], sil[,3])]
plot(sil, main = paste(length(unique(sce$kmeans10)), "clusters"), 
    border=sil.cols, col=sil.cols, do.col.sort=FALSE)

# copy kbest
df <- as.data.frame(assignments)
sce$kmeans.best <- as.numeric(df[df$k == best.k, ".cluster"])

# plot sihouette
clust.col <- colorRampPalette(RColorBrewer::brewer.pal(9,"RdBu"))(best.k)
sil <- silhouette(sce$kmeans.best, dist = pc.dist)
sil.cols <- clust.col[ifelse(sil[,3] > 0, sil[,1], sil[,2])]
sil.cols <- sil.cols[order(-sil[,1], sil[,3])]
plot(sil, main = paste(length(unique(sce$kmeans.best)), "clusters"), 
    border=sil.cols, col=sil.cols, do.col.sort=FALSE)

For the same number of clusters (10), fewer cells have negative values with k-means than with hierarchical clustering. The average silhouette is similar though, and all clusters show a wide range.

1.2.2.6 Multi-metric

One may also use multiple metrics rather than only one and check how they concord, using NbClust::NbClust() (see manual for a description of metrics computed):

# run only once and save outcome to file to load later.
tmpFn <- "../Robjects/caron_postDeconv_5hCellPerSpl_500CellK2to20NbClust.Rds"
if(file.exists(tmpFn))
{
    nb <- readRDS(file=tmpFn)
} else {
    library(NbClust)
    tmpInd <- sample(nrow(pcs), 500)
        nb = NbClust(data=pcs[tmpInd,],
                     distance = "euclidean",
                     min.nc = 2,
                     max.nc = 20,
                     method = "kmeans",
                     index=c("kl","ch","cindex","db","silhouette","duda","pseudot2",
                             "beale","ratkowsky","gap","gamma","mcclain","gplus","tau",
                             "sdindex","sdbw"))
        saveRDS(nb, file=tmpFn)
        rm(tmpInd)
}
rm(tmpFn)

Table showing the number of methods selecting values of k as the best choice:

table(nb$Best.nc[1,])

## 
##  2  3  8  9 13 17 20 
##  7  4  1  1  1  1  1

rm(nb)

No strong support for a single sensible value of k is observed.

1.2.3 Graph-based clustering

Graph-based clustering entails building a nearest-neighbour (NN) graph using cells as nodes and their similarity as edges, then identifying ‘communities’ of cells within the network (more details there). In this k-NN graph two nodes (cells), say A and B, are connected by an edge if the distance between them is amongst the k smallest distances from A to other cells (‘KNN’), and from B to other cells (shared-NN, ‘SNN’). An edge may be weighted based on the number or closeness of shared nearest neighbours. Clusters are identified using metrics related to the number of neighbours (‘connections’) to find groups of highly interconnected cells. “The value of ‘k’ can be roughly interpreted as the anticipated size of the smallest subpopulation” (see scran’s buildSNNGraph() manual).

Pros:
- fast and memory efficient (avoids the need to construct a distance matrix for all pairs of cells)
- no assumptions on the shape of the clusters or the distribution of cells within each cluster
- no need to specify a number of clusters to identify (but the size of the neigbourhood used affects the size of clusters)
Cons:
- loss of information beyond neighboring cells, which can affect community detection in regions with many cells.

The plot below shows the same data set as a network built using three different numbers of neighbours: 5, 15 and 25 (from here).

We will:

build the graph,
define clusters,
check membership across samples,
show membership on a t-SNE plot,
assess clustering quality.

We now build the shared nearest neighbour (SNN) graph, using scran’s buildSNNGraph() with:

the reduced and denoised data set (PCs)
the default number of neighbours (k=10)
the default type of edge weight (type=“rank”)

# compute graph using buildSNNGraph
# check manual with: ?buildSNNGraph
snn.gr <- buildSNNGraph(sce, use.dimred="corrected")

The number of cells in the data set is large so we randomly choose 1000 nodes (cells) in the network before plotting the resulting smaller network. Cells are color-coded by sample type:

# subset graph down to 1000 cells
# https://igraph.org/r/doc/subgraph.html

# Add vertices (nodes. ie cells) annotation
V(snn.gr)$SampleName <- colData(sce)$SampleName
V(snn.gr)$SampleGroup <- as.character(colData(sce)$SampleGroup)

# pick 1000 nodes randomly
edgesToGet <- sample(nrow(snn.gr[]), 1000)

# subset graph for these 1000 ramdomly chosen nodes
snn.gr.subset <- subgraph(snn.gr, edgesToGet)
g <- snn.gr.subset

# set colors for clusters
if(length(unique(V(g)$SampleGroup)) <= 2)
{
  cols <- colorspace::rainbow_hcl(length(unique(V(g)$SampleGroup)))
} else {
  cols <- RColorBrewer::brewer.pal(n = length(unique(V(g)$SampleGroup)), name = "RdYlBu")
}
names(cols) <- unique(V(g)$SampleGroup)

# plot graph
plot.igraph(
  g, layout = layout_with_fr(g),
  vertex.size = 3, vertex.label = NA,
  vertex.color = cols[V(g)$SampleGroup],
  frame.color = cols[V(g)$SampleGroup],
  main = "default parameters"
)

# add legend
legend('bottomright',
       legend=names(cols),
       pch=21,
       col=cols, # "#777777",
       pt.bg=cols,
       pt.cex=1, cex=.6, bty="n", ncol=1)

1.2.3.1 Modularity

Several methods to detect clusters (‘communities’) in networks rely on the ‘modularity’ metric. For a given partition of cells into clusters, modularity measures how separated clusters are from each other, based on the difference between the observed and expected weight of edges between nodes. For the whole graph, the closer to 1 the better.

1.2.3.2 Walktrap method

The walktrap method relies on short random walks (a few steps) through the network. These walks tend to be ‘trapped’ in highly-connected regions of the network. Node similarity is measured based on these walks. Nodes are first each assigned their own community. Pairwise distances are computed and the two closest communities are grouped. These steps are repeated a given number of times to produce a dendrogram. Hierarchical clustering is then applied to the distance matrix. The best partition is that with the highest modularity.

# identify clusters with walktrap
# default number of steps: 4
cluster.out <- cluster_walktrap(snn.gr)

We will now count cells in each cluster for each sample type and each sample.

We first retrieve membership.

# cluster assignments are stored in the membership slot
wt.clusters <- cluster.out$membership

Cluster number and sizes (number of cells):

table(wt.clusters)

## wt.clusters
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
##   87   47  445 1037  236  135   75   26   17   45   60  177   79   81  182   14 
##   17   18 
##  142  615

The table below shows the cluster distribution across sample types. Many clusters are observed in only one sample type.

table(wt.clusters, sce$SampleGroup)

##            
## wt.clusters ETV6-RUNX1 HHD PBMMC PRE-T
##          1           0   0    87     0
##          2           1   0    46     0
##          3         362   0    83     0
##          4         987   0    50     0
##          5         150   0    86     0
##          6          33   0   102     0
##          7           3   0    72     0
##          8          10   0    16     0
##          9           3   0    14     0
##          10         14   0    31     0
##          11         27   0    33     0
##          12         68   0   109     0
##          13         39   0    40     0
##          14         12   0    69     0
##          15          5   0   177     0
##          16          0   0    14     0
##          17         79   0    63     0
##          18        207   0   408     0

The table below shows the cluster distribution across samples. Most clusters comprise cells from several replicates of a same sample type, while few are observed in only one sample.

tmpTab <- table(wt.clusters, sce$SampleName)
rownames(tmpTab) = paste("4-step", rownames(tmpTab), sep = "_")

# columns annotation with cell name:
mat_col <- colData(sce) %>% data.frame() %>% select(SampleName, SampleGroup) %>% unique
rownames(mat_col) <- mat_col$SampleName
mat_col$SampleName <- NULL

# Prepare colours for clusters:
colourCount = length(unique(mat_col$SampleGroup))
getPalette = grDevices::colorRampPalette(RColorBrewer::brewer.pal(9, "Set1"))

mat_colors <- list(SampleGroup = getPalette(colourCount))
names(mat_colors$SampleGroup) <- unique(mat_col$SampleGroup)

pheatmap(tmpTab,
           annotation_col    = mat_col,
           annotation_colors = mat_colors)

We next check the effect of the number of steps on the clusters detected by increasing it (default number of steps: 4).

With 15-step walks:

Cluster sizes:

cluster.out.s15 <- cluster_walktrap(snn.gr, steps = 15)
wt.clusters.s15 <- cluster.out.s15$membership
table(wt.clusters.s15)

## wt.clusters.s15
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
##   71  502  143   47  654   75  190 1131   17  163   85   93  138   14  177

Comparison with 4-step clusters:

tmpTab <- table(wt.clusters, wt.clusters.s15)
rownames(tmpTab) = paste("4-step", rownames(tmpTab), sep = "_")
colnames(tmpTab) = paste("15-step", colnames(tmpTab) , sep = "_")
pheatmap(tmpTab)

The network may also be built and clusters identified at once with bluster::clusterRows() that also allows for large data sets to first cluster cells with k-means into a large number of clusters and then perform graph-based clustering on cluster centroids.

mat <- reducedDim(sce, "corrected")
dim(mat)

## [1] 3500   50

set.seed(100) # for the k-means
two.out <- clusterRows(mat, TwoStepParam(
  first = KmeansParam(centers=100),
  second = NNGraphParam(k=4, cluster.fun = "walktrap")),
  full = TRUE)
plotTSNE(sce, colour_by=I(two.out$clusters))

plotUMAP(sce, colour_by=I(two.out$clusters))

Distribution of the size of the 100 k-means clusters:

# 'kmeans': assignment of each cell to a k-means cluster
summary(data.frame(table(two.out$objects$first$kmeans$cluster))$Freq)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     3.0    19.0    33.5    35.0    47.0   108.0

Graph-based cluster sizes in cell number:

# 'igraph': assignment of each cell to a graph-based cluster operating on the k-means clusters
table(two.out$clusters)

## 
##   1   2   3   4   5   6   7   8   9  10 
## 354 136 997 620 194 116 363 160 173 387

The heatmap below displays the contingency table for the graph-based clusters (rows) and the k-means clusters (columns) used to build the network (color-code by size, i.e. number of cells).

tmpTab <- table(
  two.out$clusters,
    two.out$objects$first$kmeans$cluster)
rownames(tmpTab) = paste("igraph", rownames(tmpTab), sep = "_")
colnames(tmpTab) = paste("kmeans", colnames(tmpTab) , sep = "_")
pheatmap(tmpTab, show_colnames = FALSE, cluster_rows = FALSE, cluster_cols = !FALSE)

Graph-based cluster sizes in k-mean cluster number:

# ‘membership’, the graph-based cluster to which each node is assigned
table(two.out$objects$second$communities$membership)

## 
##  1  2  3  4  5  6  7  8  9 10 
## 14  6 21 17  6  6  9  5  7  9

The plot below shows the network build with the 200 k-means clusters used the identify the graph-based clusters.

g <- two.out$objects$second$graph
V(g)$membership <- two.out$objects$second$communities$membership

# set colors for clusters
rgb.palette <- colorRampPalette(c("purple","yellow"), space="rgb") # Seurat-like
cols <- rgb.palette(length(unique(V(g)$membership)))
names(cols) <- as.character(1:length(unique(V(g)$membership)))

# plot graph
plot.igraph(
  g, layout = layout_with_fr(g),
  vertex.size = 3, vertex.label = NA,
  vertex.color = cols[V(g)$membership],
  frame.color = cols[V(g)$membership],
  main = "with 100 k-means clusters"
)

# add legend
legend('bottomright',
       legend=names(cols),
       pch=21,
       col=cols, # "#777777",
       pt.bg=cols,
       pt.cex=1, cex=.6, bty="n", ncol=2)

1.2.3.3 Louvain method

With the Louvain method, nodes are also first assigned their own community. This hierarchical agglomerative method then progresses in two-step iterations: 1) nodes are re-assigned one at a time to the community for which they increase modularity the most, if at all, 2) a new, ‘aggregate’ network is built where nodes are the communities formed in the previous step. This is repeated until modularity stops increasing. The diagram below is copied from this article.

We now apply the Louvain approach, store its outcome in the SCE object and show cluster sizes.

ig.louvain <- igraph::cluster_louvain(snn.gr)
cl <- ig.louvain$membership
cl <- factor(cl)
# store membership
sce$louvain <- cl
# show cluster sizes:
table(sce$louvain)

## 
##   1   2   3   4   5   6   7   8   9  10  11  12 
## 518 437 634 184 141 165 607 282 161 245 109  17

The plot below displays the network with nodes (the 1000 randomly chosen cells) color-coded by cluster membership:

V(snn.gr)$SampleName <- colData(sce)$SampleName
V(snn.gr)$SampleGroup <- as.character(colData(sce)$SampleGroup)
V(snn.gr)$louvain <- as.character(colData(sce)$louvain)

# once only # edgesToGet <- sample(nrow(snn.gr[]), 1000)

snn.gr.subset <- subgraph(snn.gr, edgesToGet)
g <- snn.gr.subset

tmpLayout <- layout_with_fr(g)

rgb.palette <- colorRampPalette(c("purple","yellow"), space="rgb") # Seurat-like
cols <- rgb.palette(length(unique(V(g)$louvain)))
names(cols) <- as.character(1:length(unique(V(g)$louvain)))

# Plot the graph, color by cluster assignment
igraph::plot.igraph(
  g, layout = tmpLayout,
  vertex.color = cols[V(g)$louvain],
  frame.color = cols[V(g)$louvain],
  vertex.size = 3, vertex.label = NA, main = "Louvain"
)

# add legend
legend('bottomright',
       legend=names(cols),
       pch=21,
       col=cols, # "#777777",
       pt.bg=cols,
       pt.cex=1, cex=.6, bty="n", ncol=2)

The t-SNE plot shows cells color-coded by cluster membership:

fontsize <- theme(axis.text=element_text(size=12), axis.title=element_text(size=16))
# show clusters on TSNE
p <- plotTSNE(sce, colour_by="louvain") + fontsize
p

Split by sample type:

# facet by sample group:
p + facet_wrap(~ sce$SampleGroup)

1.2.3.4 Leiden method

The Leiden method improves on the Louvain method by garanteeing that at each iteration clusters are connected and well-separated. The method includes an extra step in the iterations: after nodes are moved (step 1), the resulting partition is refined (step2) and only then the new aggregate network made, and refined (step 3). The diagram below is copied from this article.

# have logical to check we can use the required python modules:
leidenOk <- FALSE
# use the reticulate R package to use python modules
library(reticulate)
# set path to python
use_python("/home/ubuntu/.local/share/r-miniconda/envs/r-reticulate/bin/python")
# check availability of python modules and set logical accordingly
if(reticulate::py_module_available("igraph")
& reticulate::py_module_available("leidenalg"))
{
    leidenOk <- TRUE
}

We now apply the Leiden approach, store its outcome in the SCE object and show cluster sizes.

# mind eval above is set to the logical set above
# so that chunk will only run if python modules were found.
library("leiden")
adjacency_matrix <- igraph::as_adjacency_matrix(snn.gr)
partition <- leiden(adjacency_matrix)
# store membership
sce$leiden <- factor(partition)
# show cluster sizes:
table(sce$leiden)

## 
##   1   2   3   4   5   6   7   8   9  10 
## 675 611 548 498 282 272 184 160 156 114

The plot below displays the network with nodes (the 1000 randomly chosen cells) color-coded by cluster membership:

V(snn.gr)$leiden <- as.character(colData(sce)$leiden)

# once only # edgesToGet <- sample(nrow(snn.gr[]), 1000)

snn.gr.subset <- subgraph(snn.gr, edgesToGet)
g <- snn.gr.subset

rgb.palette <- colorRampPalette(c("purple","yellow"), space="rgb") # Seurat-like
cols <- rgb.palette(length(unique(V(g)$leiden)))
names(cols) <- as.character(1:length(unique(V(g)$leiden)))

## Plot the graph, color by cluster assignment
igraph::plot.igraph(
  g, layout = tmpLayout,
  vertex.color = cols[V(g)$leiden],
  frame.color = cols[V(g)$leiden],
  vertex.size = 3, vertex.label = NA, main = "Leiden"
)
legend('bottomright',
       legend=names(cols),
       pch=21,
       col=cols, # "#777777",
       pt.bg=cols,
       pt.cex=2, cex=.8, bty="n", ncol=2)

The t-SNE plot below shows cells color-coded by cluster membership:

# show clusters on TSNE
p <- plotTSNE(sce, colour_by="leiden") + fontsize
p

Split by sample type:

# facet by sample group:
p + facet_wrap(~ sce$SampleGroup)

1.2.3.5 Modularity to assess clusters quality

We now compute modularity to assess clusters quality (the division of cells into several groups, ‘clusters’, is compared to that in an equivalent random network with no structure).

The overall modularity obtained with the Louvain method is 0.71.

Modularity can also be computed for each cluster, with scran’s clusterModularity() (now in the bluster package). This function returns an upper diagonal matrix with values for each pair of clusters.

# compute cluster-wise modularities,
# with clusterModularity(),
# which returns a matrix keeping values for each pair of clusters
mod.out <- bluster::pairwiseModularity(snn.gr,
                                       #wt.clusters,
                                       sce$louvain,
                                       get.weights=TRUE)
# Modularity is proportional to the cluster size,
# so we compute the ratio of the observed to expected weights
# for each cluster pair
ratio <- mod.out$observed/mod.out$expected
lratio <- log10(ratio + 1) # on log scale to improve colour range

We display below the cluster-wise modularity on a heatmap. Clusters that are well separated mostly comprise intra-cluster edges and harbour a high modularity score on the diagonal and low scores off that diagonal. Two poorly separated clusters will share edges and the pair will have a high score.

pheatmap(lratio, cluster_rows=FALSE, cluster_cols=FALSE, 
    color=colorRampPalette(c("white", "blue"))(100))

The cluster-wise modularity matrix may also be shown as a network, with clusters as nodes and edges weighted by modularity:

cluster.gr <- igraph::graph_from_adjacency_matrix(
    ratio, 
    mode="undirected",
    weighted=TRUE,
    diag=FALSE)
plot(cluster.gr,
     label.cex = 0.2,
     edge.width = igraph::E(cluster.gr)$weight)

1.2.3.6 Choose the number of clusters

Some community detection methods are hierarchical, e.g. walktrap. With these methods, one can choose the number of clusters.

We can now retrieve a given number of clusters, e.g. 10 here to match the k-means clusters defined above:

# remember cluster.out keeps the walktrap outcome
cluster.out.10 <- igraph::cut_at(cluster.out, no = 10)
cluster.out.10 <- factor(cluster.out.10)
table(cluster.out.10)

## cluster.out.10
##    1    2    3    4    5    6    7    8    9   10 
##  162 2731  139   47   26   17   45  177   14  142

The plot below displays the network with nodes (the 1000 randomly chosen cells) color-coded by cluster membership, split by sample type:

# annotate nodes
V(snn.gr)$walkCutAt10 <- as.character(cluster.out.10)

# subset graph
snn.gr.subset <- subgraph(snn.gr, edgesToGet)
g <- snn.gr.subset

# make colors:
cols <- rgb.palette(length(unique(V(g)$walkCutAt10)))
names(cols) <- as.character(1:length(unique(V(g)$walkCutAt10)))

# Plot the graph, color by cluster assignment
igraph::plot.igraph(
  g, layout = tmpLayout,
  vertex.color = cols[V(g)$walkCutAt10],
  vertex.size = 3, vertex.label = NA, main = "walkCutAt10"
)
legend('bottomright',
      legend=names(cols),
       pch=21,
       col="#777777",
       pt.bg=cols,
       pt.cex=2, cex=.8, bty="n", ncol=2)

The t-SNE plot below shows cells color-coded by cluster membership split by sample type:

sce$cluster.out.10 <- cluster.out.10
sce$cluster.out.10.col <- cols[cluster.out.10]

# show clusters on TSNE
#p <- plotTSNE(sce, colour_by="cluster.out.10") + fontsize
tSneCoord <- as.data.frame(reducedDim(sce, "TSNE"))
colnames(tSneCoord) <- c("TSNE1", "TSNE2")
# add sample type:
tSneCoord$SampleGroup <- sce$SampleGroup
# add cluster info:
tSneCoord$cluster.out.10.col <- sce$cluster.out.10.col
# draw plot
p2 <- ggplot(tSneCoord, aes(x=TSNE1, y=TSNE2,
                color=cluster.out.10,
                shape=SampleGroup)) +
    geom_point()
p2 <- p2 + scale_color_manual(values = cols)
# facet by sample group:
p2 + facet_wrap(~ sce$SampleGroup) +
      theme(legend.text=element_text(size=rel(0.7)))

1.3 Comparing two sets of clusters

The contingency table below shows the concordance between the walk-trap (rows) and k-means (columns) clustering methods:

# rows: walktrap
# columns: kmeans
tmpTab <- table(sce$cluster.out.10, sce$kmeans10)
tmpTab

##     
##        1   2   3   4   5   6   7   8   9  10
##   1  130   1   0   0   0   0   0   2  29   0
##   2    1 746 399  11   1 194   0 482 259 638
##   3    0   0   0   0 132   0   7   0   0   0
##   4    0   0   0   0   0   0  10   0  37   0
##   5    0   0   0   0   1   0  25   0   0   0
##   6    0   0   0   9   0   4   0   0   4   0
##   7    0   0   0   0   0   0  45   0   0   0
##   8    0   0   0 177   0   0   0   0   0   0
##   9    0   0   0   3   0   0   0   1  10   0
##   10   0   0   0   0 140   0   1   1   0   0

Concordance is moderate, with some clusters defined with one method comprising cells assigned to several clusters defined with the other method.

Concordance can also be displayed on a heatmap:

rownames(tmpTab) = paste("walk", rownames(tmpTab), sep = "_")
colnames(tmpTab) = paste("kmeans10", colnames(tmpTab) , sep = "_")
pheatmap(tmpTab)

Concordance can also be measured with the adjusted Rand index that ranges between 0 and 1 from random partitions to perfect agreement. Here, the adjusted Rand index is 0.1472816.

1.4 Expression of known marker genes

We first show clusters on a t-SNE plot:

cluToUse <- "louvain"
plotTSNE(sce, colour_by = cluToUse) +
  ggtitle("louvain clusters")

Having identified clusters, we now display the level of expression of cell type marker genes to quickly match clusters with cell types (e.g. MS4A1 for B cell marker gene). For each marker we will plot its expression on a t-SNE, and show distribution across each cluster on a violin plot.

Example with B cell marker genes MS4A1 and CD79A:

rownames(sce) <- scater::uniquifyFeatureNames(
  rowData(sce)$ID,
  rowData(sce)$Symbol)
p1 <- plotTSNE(sce, by_exprs_values = "reconstructed", colour_by = "MS4A1") +
  ggtitle("B cells: MS4A1")
p2 <- plotTSNE(sce, by_exprs_values = "reconstructed", colour_by = "CD79A") +
  ggtitle("B cells: CD79A")
gridExtra::grid.arrange(p1, p2, ncol=2)

rm(p1, p2)

p1 <- plotExpression(sce,
               exprs_values = "reconstructed",
               x=cluToUse,
               colour_by=cluToUse,
               features= "MS4A1") +
  ggtitle("B cells")
p2 <- plotExpression(sce,
               exprs_values = "reconstructed",
               x=cluToUse,
               colour_by=cluToUse,
               features= "CD79A") +
  ggtitle("B cells")
gridExtra::grid.arrange(p1, p2, ncol=2)

rm(p1, p2)

We now show the distribution of expression levels for marker genes of other PBMC types.

markGenesFull <- list(
  "Naive CD4+ T cells"=c("IL7R", "CCR7"),
  "Memory CD4+ T cells"=c("IL7R", "S100A4"),
  "B cells"=c("MS4A1"),
  "CD8+ T cells"=c("CD8A"),
  "NK cells"=c("GNLY", "NKG7"),
  "CD14+ Monocytes"=c("CD14", "LYZ"),
  "FCGR3A+ Monocytes"=c("FCGR3A", "MS4A7"),
  "Dendritic Cells"=c("FCER1A", "CST3"),
  "Platelets"=c("PPBP")
)
markGenesAvail <- lapply(markGenesFull,
                         function(x){
                           x[x %in% rownames(sce)]
                         })

plotExpression2 <- function(sce, ctx, markGenesAvail) {
  if(length(markGenesAvail[[ctx]])>0) {
    plotExpression(sce, exprs_values = "reconstructed",
                   x=cluToUse, colour_by=cluToUse,
                   features=markGenesAvail[[ctx]]) + ggtitle(ctx)
  }
}

# Naive CD4+ T
ctx <- "Naive CD4+ T cells"
plotExpression2(sce, ctx, markGenesAvail)

# Memory CD4+
ctx <- "Memory CD4+ T cells"
plotExpression2(sce, ctx, markGenesAvail)

# B cells
ctx <- "B cells"
plotExpression2(sce, ctx, markGenesAvail)

# CD8+ T
ctx <- "CD8+ T cells"
plotExpression2(sce, ctx, markGenesAvail)

# NK
ctx <- "NK cells"
plotExpression2(sce, ctx, markGenesAvail)

# CD14+ Mono
ctx <- "CD14+ Monocytes"
plotExpression2(sce, ctx, markGenesAvail)

# FCGR3A+ Mono
ctx <- "FCGR3A+ Monocytes"
plotExpression2(sce, ctx, markGenesAvail)

# DC
ctx <- "Dendritic Cells"
plotExpression2(sce, ctx, markGenesAvail)

# Platelet
ctx <- "Platelets"
plotExpression2(sce, ctx, markGenesAvail)

1.5 Save data

Write SCE object to file.

saveRDS(sce, file="../caron_postDeconv_5hCellPerSpl_dsi_PBMMC_ETV6-RUNX1_clust.Rds")

1.6 Exercise

Imagine ETV6-RUNX1_4 failed leaving you with three ETV6-RUNX1 replicates, … but the corrected values remain the same somehow.

Define Louvain and Leiden clusters and compare the outcome. Maybe check clustering quality using per-cluster modularity, too.

1.7 Session information

sessionInfo()

## R version 4.1.0 (2021-05-18)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.5 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C         LC_TIME=C           
##  [4] LC_COLLATE=C         LC_MONETARY=C        LC_MESSAGES=C       
##  [7] LC_PAPER=C           LC_NAME=C            LC_ADDRESS=C        
## [10] LC_TELEPHONE=C       LC_MEASUREMENT=C     LC_IDENTIFICATION=C 
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] broom_0.7.8                 patchwork_1.1.1            
##  [3] pheatmap_1.0.12             leiden_0.3.8               
##  [5] reticulate_1.20             igraph_1.2.6               
##  [7] cluster_2.1.2               dynamicTreeCut_1.63-1      
##  [9] bluster_1.2.1               scran_1.20.1               
## [11] scater_1.20.1               scuttle_1.2.0              
## [13] SingleCellExperiment_1.14.1 SummarizedExperiment_1.22.0
## [15] Biobase_2.52.0              GenomicRanges_1.44.0       
## [17] GenomeInfoDb_1.28.1         IRanges_2.26.0             
## [19] S4Vectors_0.30.0            BiocGenerics_0.38.0        
## [21] MatrixGenerics_1.4.0        matrixStats_0.59.0         
## [23] forcats_0.5.1               stringr_1.4.0              
## [25] dplyr_1.0.7                 purrr_0.3.4                
## [27] readr_1.4.0                 tidyr_1.1.3                
## [29] tibble_3.1.2                ggplot2_3.3.5              
## [31] tidyverse_1.3.1             knitr_1.33                 
## 
## loaded via a namespace (and not attached):
##  [1] ggbeeswarm_0.6.0          colorspace_2.0-2         
##  [3] ellipsis_0.3.2            mclust_5.4.7             
##  [5] XVector_0.32.0            BiocNeighbors_1.10.0     
##  [7] fs_1.5.0                  rstudioapi_0.13          
##  [9] farver_2.1.0              fansi_0.5.0              
## [11] lubridate_1.7.10          xml2_1.3.2               
## [13] sparseMatrixStats_1.4.0   jsonlite_1.7.2           
## [15] dbplyr_2.1.1              png_0.1-7                
## [17] compiler_4.1.0            httr_1.4.2               
## [19] dqrng_0.3.0               backports_1.2.1          
## [21] assertthat_0.2.1          Matrix_1.3-4             
## [23] limma_3.48.1              cli_3.0.1                
## [25] BiocSingular_1.8.1        htmltools_0.5.1.1        
## [27] tools_4.1.0               rsvd_1.0.5               
## [29] gtable_0.3.0              glue_1.4.2               
## [31] GenomeInfoDbData_1.2.6    rappdirs_0.3.3           
## [33] Rcpp_1.0.7                cellranger_1.1.0         
## [35] jquerylib_0.1.4           vctrs_0.3.8              
## [37] DelayedMatrixStats_1.14.0 xfun_0.24                
## [39] beachmat_2.8.0            rvest_1.0.0              
## [41] lifecycle_1.0.0           irlba_2.3.3              
## [43] statmod_1.4.36            edgeR_3.34.0             
## [45] zlibbioc_1.38.0           scales_1.1.1             
## [47] hms_1.1.0                 RColorBrewer_1.1-2       
## [49] yaml_2.2.1                gridExtra_2.3            
## [51] sass_0.4.0                stringi_1.7.3            
## [53] highr_0.9                 ScaledMatrix_1.0.0       
## [55] BiocParallel_1.26.1       rlang_0.4.11             
## [57] pkgconfig_2.0.3           bitops_1.0-7             
## [59] evaluate_0.14             lattice_0.20-44          
## [61] labeling_0.4.2            cowplot_1.1.1            
## [63] tidyselect_1.1.1          magrittr_2.0.1           
## [65] R6_2.5.0                  generics_0.1.0           
## [67] metapod_1.0.0             DelayedArray_0.18.0      
## [69] DBI_1.1.1                 pillar_1.6.1             
## [71] haven_2.4.1               withr_2.4.2              
## [73] RCurl_1.98-1.3            modelr_0.1.8             
## [75] crayon_1.4.1              utf8_1.2.1               
## [77] rmarkdown_2.9             viridis_0.6.1            
## [79] locfit_1.5-9.4            grid_4.1.0               
## [81] readxl_1.3.1              reprex_2.0.0             
## [83] digest_0.6.27             munsell_0.5.0            
## [85] beeswarm_0.4.0            viridisLite_0.4.0        
## [87] vipor_0.4.5               bslib_0.2.5.1

Introduction to single-cell RNA-seq analysis

Clustering

Stephane Ballereau

November 2021

1 Clustering

1.1 Load data

1.2 Clustering cells into putative subpopulations

1.2.1 Hierarchical clustering

1.2.1.1 Clustering

1.2.1.2 Separatedness

1.2.2 k-means

1.2.2.1 Description

1.2.2.2 Example

1.2.2.3 Within-cluster sum-of-squares

1.2.2.4 Silhouette

1.2.2.5 Gap statistic

1.2.2.6 Multi-metric

1.2.3 Graph-based clustering

1.2.3.1 Modularity

1.2.3.2 Walktrap method

1.2.3.3 Louvain method

1.2.3.4 Leiden method

1.2.3.5 Modularity to assess clusters quality

1.2.3.6 Choose the number of clusters

1.3 Comparing two sets of clusters

1.4 Expression of known marker genes

1.5 Save data

1.6 Exercise

1.7 Session information