January 2023
The data has been QC’d, normalized, and batch corrected.
We can now start to understand the dataset by identifying cell types. This involves two steps:
unsupervised clustering: identification of groups of cells based on the similarities of the transcriptomes without any prior knowledge of the labels usually using the PCA output
annotation of cell-types based on transcription profiles
Pros
Cons
The steps of involved:
Identify edges between nodes (cells) to generate a graph
Weight the edges with a similarity score
Identify clusters/communities in the weighted graph
Nearest-Neighbour (NN) graph:
In a NN graph two nodes (cells), say A and B, are connected by an edge if:
or
Once edges have been defined, they can be weighted. By default the weights are calculated using the ‘rank’ method which relates to the highest ranking of their shared neighbours.
Example with different numbers of neighbours (k):
Here we will address three community detection algorithms: walktrap, louvain and leiden.
Modularity
These methods rely on the ‘modularity’ metric to determine a good clustering.
For a given partition of cells into clusters, modularity measures how separated clusters are from each other. This is based on the difference between the observed and expected (i.e. random) weight of edges within and between clusters. For the whole graph, the closer to 1 the better.
Walktrap
The walktrap method relies on short random walks (a few steps) through the network. These walks tend to be ‘trapped’ in highly-connected regions of the network. Node similarity is measured based on these walks.
Nodes are also first assigned their own community.
Two-step iterations:
This is repeated until modularity stops increasing.
(Blondel et al, Fast unfolding of communities in large networks)
(Traag et al, From Louvain to Leiden: guaranteeing well-connected communities)
There is an issue with the Louvain method - some communities may become disconnected.
The Leiden method improves on the Louvain method by guaranteeing that at each iteration clusters are connected and well-separated. The partitioning is refined (step2) before the aggregate network is made.
Silhouette width is an alternative to modularity for determining how well clustered the cells are.
((mean distance to cells in next closest cluster) - (mean distance to other cells in same cluster)) / biggest of those means
Cells with a large positive width are close to cells in their cluster, while cells with a negative silhouette width are closer to cells of another cluster.
Clustering, like a microscope, is a tool to explore the data.
We can zoom in and out by changing the resolution of the clustering parameters, and experiment with different clustering algorithms to obtain alternative perspectives on the data.
Asking for an unqualified “best” clustering is akin to asking for the best magnification on a microscope.
A more relevant question is “how well do the clusters approximate the cell types or states of interest?”. Do you want:
Explore the data, use your biological knowledge!
Image by Les Chatfield from Brighton, England - Fine rotative table Microscope 5, CC BY 2.0, https://commons.wikimedia.org/w/index.php?curid=32225637