Motivation
Initial methods
Graph-based methods
- walktrap
- louvain
- leiden
CRUK bioinfomatics summer school - July 2021
Motivation
Initial methods
Graph-based methods
The data has been QCed and normalized, confounders removed, noise limited, dimensionality reduced.
We can now ask biological questions.
de novo discovery and annotation of cell-types based on transcription profiles
unsupervised clustering:
We will introduce three widely used clustering methods:
The first two were developed first and are faster for small data sets
The third is more recent and better suited for scRNA-seq, especially large data sets.
All three identify non-overlapping clusters.
Hierarchical clustering builds:
There are two types of strategies:
The raw data:
The hierarchical clustering dendrogram:
Example: the Caron data set:
Pros:
Cons:
Goal: partition cells into k different clusters.
In an iterative manner,
Aim:
Pros:
Cons:
Steps:
=> assign new centroids and repeat steps above
Congruence of clusters may be assessed by computing the sillhouette for each cell.
The larger the value the closer the cell to cells in its cluster than to cells in other clusters.
Cells closer to cells in other clusters have a negative value.
Good cluster separation is indicated by clusters whose cells have large silhouette values.
Nearest-Neighbour (NN) graph:
Aim: identify ‘communities’ of cells within the network
In a NN graph two nodes (cells), say X and Y, are connected by an edge:
if the distance between them is amongst:
Clusters are identified using metrics related to the number of neighbours (‘connections’) to find groups of highly interconnected cells.
Example with different numbers of neighbours:
Pros
Cons
Several methods to detect clusters (‘communities’) in networks rely on the ‘modulatrity’ metric.
For a given partition of cells into clusters,
modularity measures how separated clusters are from each other,
based on the difference between the observed and expected weight of edges between nodes.
For the whole graph, the closer to 1 the better.
The walktrap method relies on short random walks (a few steps) through the network.
These walks tend to be ‘trapped’ in highly-connected regions of the network.
Node similarity is measured based on these walks.
Hierarchical agglomerative method
Nodes are also first assigned their own community.
Two-step iterations:
This is repeated until modularity stops increasing.
(Blondel et al, Fast unfolding of communities in large networks)
(Traag et al, From Louvain to Leiden: guaranteeing well-connected communities)
Issue with the Louvain method: some communities may be disconnected:
(Traag et al, From Louvain to Leiden: guaranteeing well-connected communities)
The Leiden method improves on the Louvain method
by garanteeing that at each iteration clusters are connected and well-separated.
The method includes an extra step in the iterations:
(Traag et al, From Louvain to Leiden: guaranteeing well-connected communities)
Clusters that are well separated mostly comprise intra-cluster edges and harbour a high modularity score on the diagonal and low scores off that diagonal.
Two poorly separated clusters will share edges and the pair will have a high score.
hierarchical and k-means methods are fast for small data sets
graph-based methods are better suited for large data sets and cluster detection