Clustering

September 2022

Single Cell RNAseq Analysis Workflow

Motivation

The data has been QC’d and normalized, and batch corrected.

We can now ask biological questions.

unsupervised clustering: identification of groups of cells based on the similarities of the transcriptomes without any prior knowledge of the labels usually using the PCA output
de novo discovery and annotation of cell-types based on transcription profiles

Single Cell RNAseq Analysis Workflow

Graph-based clustering

Nearest-Neighbour (NN) graph:

cells as nodes
their similarity as edges

In a NN graph two nodes (cells), say X and Y, are connected by an edge if:

the distance between them is amongst the k smallest distances from X to other cells, ‘KNN’

the above plus the distance between them is amongst the k smallest distances from X to other cells shared-NN (’SNN).

Once edges have been defined, they can be weighted by various metrics.

Graph-based clustering

Example with different numbers of neighbours:

Graph-based clustering

Pros

fast and memory efficient (no distance matrix for all pairs of cells)
no assumptions on the shape of the clusters or the distribution of cells within each cluster
no need to specify a number of clusters to identify

Cons

loss of information beyond neighboring cells, which can affect community detection in regions with many cells.

Modularity

Several methods to detect clusters (‘communities’) in networks rely on the ‘modularity’ metric.

Modularity measures how separated clusters are from each other.

Modularity is a ratio between the observed weights of the edges within a cluster versus the expected weights if the edges were randomly distributed between all nodes.

For the whole graph, the closer to 1 the better.

Walktrap

The walktrap method relies on short random walks (a few steps) through the network.

These walks tend to be ‘trapped’ in highly-connected regions of the network.

Node similarity is measured based on these walks.

Nodes are first each assigned their own community.
Pairwise distances are computed and the two closest communities are grouped.
These steps are repeated a given number of times to produce a dendrogram.
- Hierarchical clustering is applied to the distance matrix.
The best partition is that with the highest modularity.

Walktrap

Network example:

Louvain

Hierarchical agglomerative method

Nodes are also first assigned their own community.

Two-step iterations:

nodes are re-assigned one at a time to the community for which they increase modularity the most,
a new, ‘aggregate’ network is built where nodes are the communities formed in the previous step.

This is repeated until modularity stops increasing.

(Blondel et al, Fast unfolding of communities in large networks)

(Traag et al, From Louvain to Leiden: guaranteeing well-connected communities)

Leiden

Issue with the Louvain method: some communities may be disconnected:

(Traag et al, From Louvain to Leiden: guaranteeing well-connected communities)

Leiden

The Leiden method improves on the Louvain method

by garanteeing that at each iteration clusters are connected and well-separated.

The method includes an extra step in the iterations:

after nodes are moved (step 1),
the resulting partition is refined (step2)
and only then the new aggregate network made, and refined (step 3).

(Traag et al, From Louvain to Leiden: guaranteeing well-connected communities)

Separatedness - silhouette width

Congruence of clusters may be assessed by computing the silhouette width for each cell.

For each cell in the cluster calculate the the average distance to all other cells in the cluster and the average distance to all cells not in the cluster. The cells silhouette width is the difference between these divided by the maximum of the two values.

Cells with a large silhouette are strongly related to cells in the cluster, cells with a negative silhouette width are more closely related to other clusters.

Good cluster separation is indicated by clusters whose cells have large silhouette values.

Separatedness - silhouette width

Cluster-wise modularity to assess clusters quality

Clusters that are well separated mostly comprise intra-cluster edges and harbour a high modularity score on the diagonal and low scores off that diagonal.

Two poorly separated clusters will share edges and the pair will have a high score.