April 2022
The data has been QC’d and normalized, and batch corrected.
We can now ask biological questions.
unsupervised clustering: identification of groups of cells based on the similarities of the transcriptomes without any prior knowledge of the labels usually using the PCA output
de novo discovery and annotation of cell-types based on transcription profiles
Nearest-Neighbour (NN) graph:
In a NN graph two nodes (cells), say X and Y, are connected by an edge if:
or
Once edges have been defined, they can be weighted by various metrics.
Example with different numbers of neighbours:
Pros
Cons
Several methods to detect clusters (‘communities’) in networks rely on the ‘modularity’ metric.
Modularity measures how separated clusters are from each other.
Modularity is a ratio between the observed weights of the edges within a cluster versus the expected weights if the edges were randomly distributed between all nodes.
For the whole graph, the closer to 1 the better.
The walktrap method relies on short random walks (a few steps) through the network.
These walks tend to be ‘trapped’ in highly-connected regions of the network.
Node similarity is measured based on these walks.
Network example:
Hierarchical agglomerative method
Nodes are also first assigned their own community.
Two-step iterations:
This is repeated until modularity stops increasing.
(Blondel et al, Fast unfolding of communities in large networks)
(Traag et al, From Louvain to Leiden: guaranteeing well-connected communities)
Issue with the Louvain method: some communities may be disconnected:
(Traag et al, From Louvain to Leiden: guaranteeing well-connected communities)
The Leiden method improves on the Louvain method
by garanteeing that at each iteration clusters are connected and well-separated.
The method includes an extra step in the iterations:
(Traag et al, From Louvain to Leiden: guaranteeing well-connected communities)
Congruence of clusters may be assessed by computing the silhouette width for each cell.
For each cell in the cluster calculate the the average distance to all other cells in the cluster and the average distance to all cells not in the cluster. The cells silhouette width is the difference between these divided by the maximum of the two values.
Cells with a large silhouette are strongly related to cells in the cluster, cells with a negative silhouette width are more closely related to other clusters.
Good cluster separation is indicated by clusters whose cells have large silhouette values.
Clusters that are well separated mostly comprise intra-cluster edges and harbour a high modularity score on the diagonal and low scores off that diagonal.
Two poorly separated clusters will share edges and the pair will have a high score.
Our goal is to identify genes that are driving the separation between clusters
Different effect size scores that quantify:
Differences in the mean expression level
Differences in the rank of expression
Differences in the proportion of cells expressing the gene
Each is calculated pairwise between each possible combination of clusters
For each cluster we will generate the effect size scores between it and every other cluster. In order to simplify our analysis, a number of summary statistics will be generated for each set of scores:
Understand what are we trying to compare with the different scores (difference in mean expression, difference in probability of being expressed, probability of being highly/lowly expressed)
Strictly speaking, identifying genes differentially expressed between clusters is statistically flawed, since the clusters were themselves defined based on the gene expression data itself. Validation is crucial as a follow-up from these analyses.
Do not use batch-integrated expression data for calculating marker gene scores, instead, include batch in the statistical model (the scoreMarkers()
function has the block
argument to achieve this)
Normalization strategy has a big influence on the results in differences in expression between cell and between clusters.
A lot of what you get might be noise. Take two random set of cells and run DE and you probably with have a few significant genes with most of the commonly used tests.
It’s important to assess and validate the results. Think of the results as hypotheses that need independent verification (e.g. microscopy, qPCR)