26.07.2021
Goals:
identify genes that differentially expressed between clusters,
exclusively or not,
using different methods that test:
the mean expression level,
the whole distribution,
or the proportion of cells expressing the gene
compile a summary table.
Challenges:
Over-interpretation of the results
Combining different types of marker identification
Recommendations:
Think of the results as hypotheses that need verification.
Identify all markers conserved between conditions for each cluster
Identify markers that are differentially expressed between specific clusters
Differential expression is comparative. Common comparisons include:
pairwise cluster comparisons,
for a given cluster finding ‘marker genes’ that are:
cell-type comparisons (if cell type is known) (with and without clustering)
findMarkers
findMarkers
high noise levels (technical and biological factors)
low library sizes
low amoung of available mRNAs result in amplification biases and dropout events
3’ bias, partial coverage, uneven depth
stochastic nature of transcription
multimodality in gene expression (presence of multiple possible cell states within a cell population)
The better model fits to the data, the better (more accurate) statistics
When we cannot fit a model to our data, we resort to non-parametric models (e.g. Wilcoxon rank-sum test, Kruskal-Wallis, Kolmogorov-Smirnov test)
Non-parametric tests generally convert observed expression values to ranks
They test whether the distribution of ranks for one group are significantly different from the distribution of ranks for the other group
May fail in presence of large number of tied values, such as the case of dropouts (zeros) in scRNA-seq
If the conditions for a parametric test hold, then it will be typically more powerful that a non-parametric test
Wilcoxon rank-sum test: “It is equally likely that a randomly selected cell from cluster 1 will have higher or lower expression of the gene than a randomly selected cell from cluster 2”
Binomial test: “Probability of being expressed is the same in cluster 1 and cluster 2”
t-test: “Mean expression of genes in cluster 1 and cluster 2 are the same”
Some highlights:
t-test and Wilcoxon work well, given at least few dozens cells to compare
Bulk RNA-seq analysis methods do not generally perform worse than those specifically developed for scRNA-seq
Filtering out lowly expressed genes in quite important for good performance of bulk methods (edgeR, DEseq2)
It’s important to understand what are we trying to compare, e.g. mean expressions, or probability of being expressed
It’s important to understand the data
It’s important to assess and validate the results
Always go back to RNA assay (or similar) for doing differential expression.
Depending on the method you chose use: counts, normalised counts or lognormalized counts.
Normalization strategy has a big influence on the results in differential expression, size factors may help.
Do not forget to account for batch effect
block
argument in findMarkers
to model batch effect“Always go back to RNA assay (or similar) for doing differential expression.”
“Do not forget to account for batch effect”
block
command