Introduction to single-cell RNA-seq analysis - Differential expression and abundance between conditions

CRUK bioinfomatics summer school - July 2021

Outline

Clusters and/or cell types have been identified, we now want to compare sample groups:

Replicates are samples not cells:

Pseudo-bulk:

An example. TSNE plots showing clusters and sample groups (left) and samples (right):

Workflow:

Method:

quasi-likelihood (QL) methods from the edgeR package
negative binomial generalized linear model (NB GLM)
- to handle overdispersed count data
- in experiments with limited replication

Steps:

Remove samples with very low library sizes, e.g. < 20 cells
- better normalisation
Remove genes that are lowly expressed,
- reduces computational work,
- improves the accuracy of mean-variance trend modelling
- decreases the severity of the multiple testing correction
- filter: log-CPM threshold in a minimum number of samples, smallest sample group
Correct for composition biases
- by computing normalization factors with the trimmed mean of M-values method
Test whether the log-fold change between sample groups is significantly different from zero
- estimate the negative binomial (NB) dispersions
- estimate the quasi-likelihood dispersions, uncertainty and variability of the per-gene variance

Aim: test for significant changes in per-cluster cell abundance across conditions

Example: which cell types are depleted or enriched upon treatment?

Methods were developed for flow cytometry.

Steps:

Count cells assigned to each label, i.e. cluster or cell type
Same workflow as for differential expression above,
- except counts are of cells per label, not of reads per gene
Share information across labels
- to improve our estimates of the biological variability in cell abundance between replicates.