- Motivation
- Biases
- Depth bias
- Composition bias
- Mean-variance correlation
- Normalisation strategies
- Feature Selection
April 2026
We derive biological insights downstream by comparing cells against each other.
Differences in the total the UMI count of each cell make direct comparison of expression profiles between cells inappropriate.
Why do total transcript molecules (UMI counts) detected between cells differ?
Normalization aims to reduce technical differences whilst preserving biological differences, thus allowing meaningful comparison of expression profiles between cells.
Depth bias: Read differences between cells
Simple library size normalization accounts for the depth bias
Mean and variance of raw counts for genes are correlated
More highly expressed genes tend to look more variable because larger numbers result in higher variance
Mean and variance of raw counts for genes are correlated
More highly expressed genes tend to look more variable because larger numbers result in higher variance
A gene expressed at a low level tends to have a low variance across cells:
var(c(2, 4, 2, 4, 2, 4, 2, 4)) = 1.14
A gene with the same proportional differences between cells, but expressed at a higher level will have higher variance:
var(c( 20, 40, 20, 40, 20, 40, 20, 40)) = 114.29
Normalization has two steps
e.g. DESeq’s size factor
Simple log transformation of the counts is not ideal for scRNA-seq data as it as it fails to fully account for the mean-variance relationship in the data, and can lead to overemphasis of lowly expressed genes.
The primary goal of scTansform is to achieve improved variance stabilization.
Simple log transformation of the counts is not ideal for scRNA-seq data as it as it fails to fully account for the mean-variance relationship in the data, and can lead to overemphasis of lowly expressed genes.
The primary goal of scTansform is to achieve improved variance stabilization.
Steps:
Model the “expected counts” for each gene using a regularized negative binomial regression model
\[ \log(\text{expected count}) = \log(\text{sequencing depth}) \times \beta + \text{other factors} \]
Calculate Pearson residuals from the negative binomial regression model - these residuals become the normalized values for downstream analysis.
"This procedure omits the need for heuristic steps including pseudocount addition or log-transformation and improves common downstream analytical tasks such as variable gene selection, dimensional reduction, and differential expression. We named this method sctransform."
Model the “expected counts” for each gene using a regularized negative binomial regression model
Calculate Pearson residuals from the negative binomial regression model - these residuals become the normalized values for downstream analysis.
\[\text{Pearson residual} = \frac{\text{Observed value} - \text{Expected value}}{\sqrt{\text{Expected variance}}}\]
If a gene has exactly the expression level we’d expect based on the model → residual = 0
If a gene is expressed higher than expected → positive residual
If a gene is expressed lower than expected → negative residual
Model the “expected counts” for each gene using a regularized negative binomial regression model
Calculate Pearson residuals from the negative binomial regression model - these residuals become the normalized values for downstream analysis.
\[\text{Pearson residual} = \frac{\text{Observed value} - \text{Expected value}}{\sqrt{\text{Expected variance}}}\]
These residuals have several desirable properties for downstream analysis:
Standardized scale: All genes end up on the same scale regardless of their expression level
Variance stabilization: Unlike raw counts, these residuals have roughly the same variance across all expression levels.
Interpretation: Positive residuals mean higher-than-expected expression; negative means lower-than-expected. A residual of +2 means “this gene is expressed about 2 standard deviations higher than expected”.
Really Useful Website:
https://biostatsquid.com/sctransform-simple-explanation/
By default these are used for downstream dimensionality reduction and clustering (although in most cases you can change this)
Select genes which capture biologically-meaningful variation, while reducing the number of genes which only contribute to technical noise