Nov 2021
Our goal is to identify genes that are differently expressed between clusters
exclusively expressed in a single cluster or not
different methods that test for:
differences in the mean expression level
differences in the rank of expression
differences in the proportion of cells expressing the gene
compile a summary table
Differential expression is comparative. Common comparisons include:
pairwise cluster comparisons
for a given cluster find ‘marker genes’ that are:
cell-type comparisons (if cell type is known) - with and without clustering
findMarkers
findMarkers( sce, groups = sce$louvain, # clusters to compare block = sce$SampleGroup, # covariates in statistical model test.type = "t", # t-test (default) direction = "any", # test for either higher or lower expression (default) lfc = 0, # null hypothesis log-fold-change = 0 (default) pval.type = "any" # ranking of p-values based on any comparison (default) )
findMarkers
findMarkers( sce, groups = sce$louvain, # clusters to compare block = sce$SampleGroup, # covariates in statistical model test.type = "t", # t-test (default) direction = "any", # test for either higher or lower expression (default) lfc = 0, # null hypothesis log-fold-change = 0 (default) pval.type = "any" # ranking of p-values based on any comparison (default) )
t-test: “Is the mean expression of a gene in cluster 1 and cluster 2 the same?”
Wilcoxon rank-sum test: “It is equally likely that a randomly selected cell from cluster 1 has higher or lower expression of a gene than a randomly selected cell from cluster 2?”
Binomial test: “Is the probability of a gene being expressed the same in cluster 1 and cluster 2?”
To an extent, all these models poorly capture the underlying features of the data.
high noise levels (technical and biological factors)
small library sizes
small amounts of available mRNAs result in amplification biases and dropout events
3’ bias, partial coverage, uneven depth of transcripts
stochastic nature of transcription
multimodality in gene expression (presence of multiple possible cell states within a cell population)
However:
t-test and Wilcoxon rank-sum test work well in practice, given at least few dozens cells to compare
Bulk RNA-seq analysis methods do not generally perform worse than those specifically developed for scRNA-seq
Filtering out lowly expressed genes in quite important for good performance of bulk methods (edgeR, DEseq2)
(source: Soneson & Robinson 2018)
understand what are we trying to compare with the different tests (difference in mean expression, difference in probability of being expressed, probability of being highly/lowly expressed)
It’s important to understand the underlying data
It’s important to assess and validate the results
Do not use batch-integrated expression data for differential analysis
findMarkers()
function has the block
argument to achieve this)Depending on the method you choose use: counts, normalised counts or log-normalized counts.
Normalization strategy has a big influence on the results in differential expression.
A lot of what you get might be noise. Take two random set of cells and run DE and you probably with have a few significant genes with most of the commonly used tests.
Think of the results as hypotheses that need independent verification (e.g. microscopy, qPCR)