Marker Gene Identification

Nov 2021

Single Cell RNAseq Analysis Workflow

Identifying Cluster Marker Genes

Our goal is to identify genes that are differently expressed between clusters

exclusively expressed in a single cluster or not
different methods that test for:
- differences in the mean expression level
- differences in the rank of expression
- differences in the proportion of cells expressing the gene
compile a summary table

Differential expression

Differential expression is comparative. Common comparisons include:
pairwise cluster comparisons
- eg. cluster 1 vs cluster2, cluster 1 vs cluster 3, cluster 2 vs cluster 3, etc…
for a given cluster find ‘marker genes’ that are:
- DE compared to at least one other cluster
- DE compared to each of the other clusters
- DE compared to “most” of the other clusters
- DE and up-regulated (easier to interpret)
cell-type comparisons (if cell type is known) - with and without clustering

`findMarkers`

findMarkers(
  sce, 
  groups = sce$louvain,       # clusters to compare
  block = sce$SampleGroup,    # covariates in statistical model
  test.type = "t",            # t-test (default)
  direction = "any",          # test for either higher or lower expression (default)
  lfc = 0,                    # null hypothesis log-fold-change = 0 (default)
  pval.type = "any"           # ranking of p-values based on any comparison (default)
)

`findMarkers`

findMarkers(
  sce, 
  groups = sce$louvain,       # clusters to compare
  block = sce$SampleGroup,    # covariates in statistical model
  test.type = "t",            # t-test (default)
  direction = "any",          # test for either higher or lower expression (default)
  lfc = 0,                    # null hypothesis log-fold-change = 0 (default)
  pval.type = "any"           # ranking of p-values based on any comparison (default)
)

Gene-wise null hypothesis

t-test: “Is the mean expression of a gene in cluster 1 and cluster 2 the same?”
Wilcoxon rank-sum test: “It is equally likely that a randomly selected cell from cluster 1 has higher or lower expression of a gene than a randomly selected cell from cluster 2?”
Binomial test: “Is the probability of a gene being expressed the same in cluster 1 and cluster 2?”

Statistical challenges

To an extent, all these models poorly capture the underlying features of the data.

high noise levels (technical and biological factors)
small library sizes
small amounts of available mRNAs result in amplification biases and dropout events
3’ bias, partial coverage, uneven depth of transcripts
stochastic nature of transcription
multimodality in gene expression (presence of multiple possible cell states within a cell population)

Performance of different tests

However:

t-test and Wilcoxon rank-sum test work well in practice, given at least few dozens cells to compare
Bulk RNA-seq analysis methods do not generally perform worse than those specifically developed for scRNA-seq
Filtering out lowly expressed genes in quite important for good performance of bulk methods (edgeR, DEseq2)

(source: Soneson & Robinson 2018)

So, what’s really important?

understand what are we trying to compare with the different tests (difference in mean expression, difference in probability of being expressed, probability of being highly/lowly expressed)
It’s important to understand the underlying data
It’s important to assess and validate the results
- Strictly speaking, identifying genes differentially expressed between clusters is statistically flawed, since the clusters were themselves defined based on the gene expression data itself. Validation is crucial as a follow-up from these analyses.

Things to think about: during analysis

Do not use batch-integrated expression data for differential analysis
- Instead, include batch in the statistical model (the findMarkers() function has the block argument to achieve this)
Depending on the method you choose use: counts, normalised counts or log-normalized counts.
Normalization strategy has a big influence on the results in differential expression.
- e.g comparing cell types with few expressed genes vs a cell type with many genes.

Things to think about: after analysis

A lot of what you get might be noise. Take two random set of cells and run DE and you probably with have a few significant genes with most of the commonly used tests.
Think of the results as hypotheses that need independent verification (e.g. microscopy, qPCR)