April 2024
The data has been QC’d, normalized, and batch corrected.
We can now start to understand the dataset by identifying cell types. This involves two steps:
unsupervised clustering: identification of groups of cells based on the similarities of the transcriptomes without any prior knowledge of the labels usually using the PCA output
annotation of cell-types based on transcription profiles
Pros
Cons
The steps involved:
Nearest-Neighbour (NN) graph:
In a NN graph two nodes (cells), say A and B, are connected by an edge if:
or
Once edges have been defined, they can be weighted. By default the weights are calculated using the ‘rank’ method which relates to the highest ranking of their shared neighbours.
Example with different numbers of neighbours (k):
What makes a commuity?
A community is a cohesive subgroup within a network has following characteristics
Here we will address three community detection algorithms: walktrap, louvain and leiden.
These methods rely on the modularity metric to determine a good clustering.
For a given partition of cells into clusters, modularity measures how separated clusters are from each other. This is based on the difference between the observed and expected (i.e. random) weight of edges within and between clusters. For the whole graph, the closer to 1 the better.
The walktrap method relies on short random walks (a few steps) through the network. These walks tend to be ‘trapped’ in highly-connected regions of the network. Node similarity is measured based on these walks.
The walktrap method relies on short random walks (a few steps) through the network. These walks tend to be ‘trapped’ in highly-connected regions of the network. Node similarity is measured based on these walks.
Pons and Latapy, Computing communities in large networks using random walks)
Nodes are also first assigned their own community.
Two-step iterations:
This is repeated until modularity stops increasing.
(Blondel et al, Fast unfolding of communities in large networks)
(Traag et al, From Louvain to Leiden: guaranteeing well-connected communities)
There is an issue with the Louvain method - some communities may become disconnected.
The Leiden method improves on the Louvain method by guaranteeing that at each iteration clusters are connected and well-separated. The partitioning is refined (step2) before the aggregate network is made.
Silhouette width provides a measure of how well clustered each cell is. It compares the mean distance between each cell and cells in the same cluster (cohesion) to the mean distance to other cells in the next closest cluster (separation).
\[ \frac{cohesion - separation}{\max(cohesion, separation)} \]
Cells with a large positive width are close to cells in their cluster, while cells with a negative silhouette width are closer to cells of another cluster.
Clustering, like a microscope, is a tool to explore the data.
We can zoom in and out by changing the resolution of the clustering parameters, and experiment with different clustering algorithms to obtain alternative perspectives on the data.
Asking for an unqualified “best” clustering is akin to asking for the best magnification on a microscope.
A more relevant question is “how well do the clusters approximate the cell types or states of interest?”. Do you want:
Explore the data, use your biological knowledge!
Image by Les Chatfield from Brighton, England - Fine rotative table Microscope 5, CC BY 2.0, https://commons.wikimedia.org/w/index.php?curid=32225637
Our goal is to identify genes that are differently expressed between clusters
Calculate effect sizes that capture differences in:
These are calculated in pairwise cluster comparisons.
scran::scoreMarkers()
functionFor each cluster the function computes the effect size scores between it and every other cluster.
scoreMarkers( sce, groups = sce$louvain15 # clusters to compare block = sce$SampleGroup, # covariates in statistical model )
Outputs a list of DataFrame
with summary statistics for the metrics we just covered (columns named with suffix cohen
, AUC
and detected
).
scran::scoreMarkers()
: summary statisticsUnderstand what are we trying to compare with the different scores:
Strictly speaking, identifying genes differentially expressed between clusters is statistically flawed, since the clusters were themselves defined based on the gene expression data itself. Validation is crucial as a follow-up from these analyses.
Do not use batch-integrated expression data for calculating marker gene scores, instead, include batch in the statistical model (the scoreMarkers()
function has the block
argument to achieve this).
Normalization strategy has a big influence on the results in differences in expression between cell and between clusters.
A lot of what you get might be noise. Take two random set of cells and run DE and you probably with have a few significant genes with most of the commonly used tests.
It’s important to assess and validate the results. Think of the results as hypotheses that need independent verification (e.g. microscopy, qPCR)