April 2022
The data has been QC’d and normalized, and batch corrected.
We can now ask biological questions.
unsupervised clustering: identification of groups of cells based on the similarities of the transcriptomes without any prior knowledge of the labels usually using the PCA output
de novo discovery and annotation of cell-types based on transcription profiles
Nearest-Neighbour (NN) graph:
In a NN graph two nodes (cells), say X and Y, are connected by an edge if:
or
Once edges have been defined, they can be weighted by various metrics.
Example with different numbers of neighbours:
Pros
Cons
Several methods to detect clusters (‘communities’) in networks rely on the ‘modularity’ metric.
Modularity measures how separated clusters are from each other.
Modularity is a ratio between the observed weights of the edges within a cluster versus the expected weights if the edges were randomly distributed between all nodes.
For the whole graph, the closer to 1 the better.
The walktrap method relies on short random walks (a few steps) through the network.
These walks tend to be ‘trapped’ in highly-connected regions of the network.
Node similarity is measured based on these walks.
Network example:
Hierarchical agglomerative method
Nodes are also first assigned their own community.
Two-step iterations:
This is repeated until modularity stops increasing.
(Blondel et al, Fast unfolding of communities in large networks)
(Traag et al, From Louvain to Leiden: guaranteeing well-connected communities)
Issue with the Louvain method: some communities may be disconnected:
(Traag et al, From Louvain to Leiden: guaranteeing well-connected communities)
The Leiden method improves on the Louvain method
by garanteeing that at each iteration clusters are connected and well-separated.
The method includes an extra step in the iterations:
(Traag et al, From Louvain to Leiden: guaranteeing well-connected communities)
Congruence of clusters may be assessed by computing the silhouette width for each cell.
For each cell in the cluster calculate the the average distance to all other cells in the cluster and the average distance to all cells not in the cluster. The cells silhouette width is the difference between these divided by the maximum of the two values.
Cells with a large silhouette are strongly related to cells in the cluster, cells with a negative silhouette width are more closely related to other clusters.
Good cluster separation is indicated by clusters whose cells have large silhouette values.
Clusters that are well separated mostly comprise intra-cluster edges and harbour a high modularity score on the diagonal and low scores off that diagonal.
Two poorly separated clusters will share edges and the pair will have a high score.