Clustering and Marker Genes

April 2022

Single Cell RNAseq Analysis Workflow

Motivation

The data has been QC’d and normalized, and batch corrected.

We can now ask biological questions.

unsupervised clustering: identification of groups of cells based on the similarities of the transcriptomes without any prior knowledge of the labels usually using the PCA output
de novo discovery and annotation of cell-types based on transcription profiles

Single Cell RNAseq Analysis Workflow

Graph-based clustering

Nearest-Neighbour (NN) graph:

cells as nodes
their similarity as edges

In a NN graph two nodes (cells), say X and Y, are connected by an edge if:

the distance between them is amongst the k smallest distances from X to other cells, ‘KNN’

the above plus the distance between them is amongst the k smallest distances from X to other cells shared-NN (’SNN).

Once edges have been defined, they can be weighted by various metrics.

Graph-based clustering

Example with different numbers of neighbours:

Graph-based clustering

Pros

fast and memory efficient (no distance matrix for all pairs of cells)
no assumptions on the shape of the clusters or the distribution of cells within each cluster
no need to specify a number of clusters to identify

Cons

loss of information beyond neighboring cells, which can affect community detection in regions with many cells.

Modularity

Several methods to detect clusters (‘communities’) in networks rely on the ‘modularity’ metric.

Modularity measures how separated clusters are from each other.

Modularity is a ratio between the observed weights of the edges within a cluster versus the expected weights if the edges were randomly distributed between all nodes.

For the whole graph, the closer to 1 the better.

Walktrap

The walktrap method relies on short random walks (a few steps) through the network.

These walks tend to be ‘trapped’ in highly-connected regions of the network.

Node similarity is measured based on these walks.

Nodes are first each assigned their own community.
Pairwise distances are computed and the two closest communities are grouped.
These steps are repeated a given number of times to produce a dendrogram.
- Hierarchical clustering is applied to the distance matrix.
The best partition is that with the highest modularity.

Walktrap

Network example:

Louvain

Hierarchical agglomerative method

Nodes are also first assigned their own community.

Two-step iterations:

nodes are re-assigned one at a time to the community for which they increase modularity the most,
a new, ‘aggregate’ network is built where nodes are the communities formed in the previous step.

This is repeated until modularity stops increasing.

(Blondel et al, Fast unfolding of communities in large networks)

(Traag et al, From Louvain to Leiden: guaranteeing well-connected communities)

Leiden

Issue with the Louvain method: some communities may be disconnected:

(Traag et al, From Louvain to Leiden: guaranteeing well-connected communities)

Leiden

The Leiden method improves on the Louvain method

by garanteeing that at each iteration clusters are connected and well-separated.

The method includes an extra step in the iterations:

after nodes are moved (step 1),
the resulting partition is refined (step2)
and only then the new aggregate network made, and refined (step 3).

(Traag et al, From Louvain to Leiden: guaranteeing well-connected communities)

Separatedness - silhouette width

Congruence of clusters may be assessed by computing the silhouette width for each cell.

For each cell in the cluster calculate the the average distance to all other cells in the cluster and the average distance to all cells not in the cluster. The cells silhouette width is the difference between these divided by the maximum of the two values.

Cells with a large silhouette are strongly related to cells in the cluster, cells with a negative silhouette width are more closely related to other clusters.

Good cluster separation is indicated by clusters whose cells have large silhouette values.

Separatedness - silhouette width

Cluster-wise modularity to assess clusters quality

Clusters that are well separated mostly comprise intra-cluster edges and harbour a high modularity score on the diagonal and low scores off that diagonal.

Two poorly separated clusters will share edges and the pair will have a high score.

Single Cell RNAseq Analysis Workflow

Identifying Cluster Marker Genes

Our goal is to identify genes that are driving the separation between clusters

Different effect size scores that quantify:

Differences in the mean expression level
Differences in the rank of expression
Differences in the proportion of cells expressing the gene

Each is calculated pairwise between each possible combination of clusters

Differences in the mean expression level

“Cohen’s d”

This is the log fold change of mean gene expression that has been standardized by scaling by the average of the standard deviation across the groups.
This can be interpreted in a similar way to log fold change in that a positive value indicates upregulation in the cluster of interest.

Differences in the rank of expression

“Area Under the Curve”

This quantifies the ability to distinguish between two gene expression distributions.
It can be interpreted as the likelihood that any random cell in the cluster of interest will have a higher expression of the gene than any random cell in the other cluster.
It ranges from 0 to 1, where 1 can be interpreted as upregulation, 0 downregulation, and 0.5 as no difference.

Differences in the proportion of cells expressing the gene

log-fold-change detected

This is the log fold change in the proportion of cells in which the gene is detected in the cluster of interest versus the proportion of cells in which the gene is detected in the other cluster.
Positive values indicate that the gene is detected in a more cells in the cluster of interest than the other cluster.
This takes no account of the magnitude of the gene expression.

Summary statistics for pairwise comparisons

For each cluster we will generate the effect size scores between it and every other cluster. In order to simplify our analysis, a number of summary statistics will be generated for each set of scores:

mean.X - this is the mean of the score across all pairwise comparisons. It gives the relative expression of the gene versus the average of the other clusters.
min.X - this is the minimum score obtained across all pairwise comparisons. This is the most stringent summary statistic for detecting upregulated genes, if the score is high, then the gene is upregulated in the cluster of interest relative to all other clusters.
median.X - this is the median of the score across all pairwise comparisons. It is more robust to outliers than the mean.
max.X - this is the maximum score obtained across all pairwise comparisons. This is the least stringent summary statistic for detecting upregulated genes as a high score only indicates that the gene is upregulated in the cluster of interest relative to at least one other clusters.
rank.X - This is the minimum ranking (“min-rank”) of that gene by that score across all clusters.

So, what’s really important?

Understand what are we trying to compare with the different scores (difference in mean expression, difference in probability of being expressed, probability of being highly/lowly expressed)
Strictly speaking, identifying genes differentially expressed between clusters is statistically flawed, since the clusters were themselves defined based on the gene expression data itself. Validation is crucial as a follow-up from these analyses.
Do not use batch-integrated expression data for calculating marker gene scores, instead, include batch in the statistical model (the scoreMarkers() function has the block argument to achieve this)
Normalization strategy has a big influence on the results in differences in expression between cell and between clusters.
A lot of what you get might be noise. Take two random set of cells and run DE and you probably with have a few significant genes with most of the commonly used tests.
It’s important to assess and validate the results. Think of the results as hypotheses that need independent verification (e.g. microscopy, qPCR)