Introduction to single-cell RNA-seq analysis

November 2021

Outline

Motivation
Initial methods
Graph-based methods
- walktrap
- louvain
- leiden

Single Cell RNAseq Analysis Workflow

Motivation

The data has been QCed and normalized, confounders removed, noise limited, dimensionality reduced.

We can now ask biological questions.

de novo discovery and annotation of cell-types based on transcription profiles
unsupervised clustering:
- identification of groups of cells
- based on the similarities of the transcriptomes
- without any prior knowledge of the labels
- usually using the PCA output

Single Cell RNAseq Analysis Workflow

Motivation

We will introduce three widely used clustering methods:

hierarchical
k-means
graph-based

The first two were developed first and are faster for small data sets.

The third is more recent and better suited for scRNA-seq, especially large data sets.

All three identify non-overlapping clusters.

Hierarchical clustering

Hierarchical clustering builds:

a hierarchy of clusters
yielding a dendrogram (i.e. tree)
- that groups together cells with similar expression patterns
- across the chosen genes.

There are two types of strategies:

Agglomerative (bottom-up):
- each observation (cell) starts in its own cluster,
- pairs of clusters are merged as one moves up the hierarchy.
Divisive (top-down):
- all observations (cells) start in one cluster,
- splits are performed recursively as one moves down the hierarchy.

Hierarchical clustering

The raw data:

The hierarchical clustering dendrogram:

Hierarchical clustering

Example: the Caron data set:

Pros:

deterministic method
returns partitions at all levels along the dendrogram

Cons:

computationally expensive in time and memory
- that increase proportionally
- to the square of the number of data points

k-means clustering

Goal: partition cells into k different clusters.

In an iterative manner,

cluster centers are defined
each cell is assigned to its nearest cluster

Aim:

minimise within-cluster variation
maximise between-cluster variation

Pros:

fast

Cons:

assumes a pre-determined number of clusters
sensitive to outliers
tends to define equally-sized clusters

k-means clustering

Set of steps to repeat:

randomly select k data points to serve as initial cluster centers,
for each centers, 1) compute distance to centroids, 2) assign to closest cluster,
calculate the mean of each cluster (the ‘mean’ in ‘k-mean’) to define its centroid,
for each point compute the distance to these means to choose the closest,
repeat until the distance between centroids and data points is minimal (ie clusters do not change) or the maximum number of iterations is reached, compute the total variation within clusters

Separatedness

Congruence of clusters may be assessed by computing the sillhouette for each cell.

The larger the value the closer the cell to cells in its cluster than to cells in other clusters.

Cells closer to cells in other clusters have a negative value.

Good cluster separation is indicated by clusters whose cells have large silhouette values.

Graph-based clustering

Nearest-Neighbour (NN) graph:

cells as nodes
their similarity as edges

Aim: identify ‘communities’ of cells within the network

In a NN graph two nodes (cells), say X and Y, are connected by an edge:

if the distance between them is amongst:

the k smallest distances from X to other cells, ‘KNN’)
and from Y to other cells for shared-NN, ‘SNN’.

Clusters are identified using metrics related to the number of neighbours (‘connections’) to find groups of highly interconnected cells.

Graph-based clustering

Example with different numbers of neighbours:

Graph-based clustering

Pros

fast and memory efficient (no distance matrix for all pairs of cells)
no assumptions on the shape of the clusters or the distribution of cells within each cluster
no need to specify a number of clusters to identify

Cons

loss of information beyond neighboring cells, which can affect community detection in regions with many cells.

Modularity

Several methods to detect clusters (‘communities’) in networks rely on the ‘modulatrity’ metric.

For a given partition of cells into clusters,

modularity measures how separated clusters are from each other,

based on the difference between the observed and expected weight of edges between nodes.

For the whole graph, the closer to 1 the better.

Walktrap

The walktrap method relies on short random walks (a few steps) through the network.

These walks tend to be ‘trapped’ in highly-connected regions of the network.

Node similarity is measured based on these walks.

Nodes are first each assigned their own community.
Pairwise distances are computed and the two closest communities are grouped.
These steps are repeated a given number of times to produce a dendrogram.
- Hierarchical clustering is applied to the distance matrix.
The best partition is that with the highest modularity.

Walktrap

Network example:

Louvain

Hierarchical agglomerative method

Nodes are also first assigned their own community.

Two-step iterations:

nodes are re-assigned one at a time to the community for which they increase modularity the most,
a new, ‘aggregate’ network is built where nodes are the communities formed in the previous step.

This is repeated until modularity stops increasing.

(Blondel et al, Fast unfolding of communities in large networks)

(Traag et al, From Louvain to Leiden: guaranteeing well-connected communities)

Leiden

Issue with the Louvain method: some communities may be disconnected:

(Traag et al, From Louvain to Leiden: guaranteeing well-connected communities)

Leiden

The Leiden method improves on the Louvain method

by garanteeing that at each iteration clusters are connected and well-separated.

The method includes an extra step in the iterations:

after nodes are moved (step 1),
the resulting partition is refined (step2)
and only then the new aggregate network made, and refined (step 3).

(Traag et al, From Louvain to Leiden: guaranteeing well-connected communities)

Cluster-wise modularity to assess clusters quality

Clusters that are well separated mostly comprise intra-cluster edges and harbour a high modularity score on the diagonal and low scores off that diagonal.

Two poorly separated clusters will share edges and the pair will have a high score.

Recap

hierarchical and k-means methods are fast for small data sets
graph-based methods are better suited for large data sets and cluster detection