February 2026

Outline

  • Motivation
  • Biases
    • Depth bias
    • Composition bias
    • Mean-variance correlation
  • Normalisation strategies
  • Feature Selection

Workflow

Workflow

Raw UMI counts distribution

Why do UMI counts differ among the cells?

  • We derive biological insights downstream by comparing cells against each other.

  • But the UMI count differences makes it harder to compare cells.

  • Why do total transcript molecules (UMI counts) detected between cells differ?

    • Biological:
      • Cell subtype differences - size and transcriptional activity, variation in gene expression
    • Technical: scRNA data is inherently noisy
      • Low mRNA content per cell
      • cell-to-cell differences in mRNA capture efficiency
      • Variable sequencing depth
      • PCR amplification efficiency

Normalization reduces technical differences so that differences between cells are not technical but biological, allowing meaningful comparison of expression profiles between cells.

Depth bias

Depth bias: Read differences between cells

Simple library size normalization accounts for the depth bias

Composition bias

  • Few genes contribute to most read counts
  • In this example, the total read counts are the same across the cells
  • Gene 1 contributes 80% of reads in cell2, leaving other genes with fewer read counts.

  • Library size normalization can not correct composition bias.

Mean-variance correlation

Mean and variance of raw counts for genes are correlated

More highly expressed genes tend to look more variable because larger numbers result in higher variance

A gene expressed at a low level tends to have a low variance across cells:

var(c(2,4,2,4,2,4,2,4)) = 1.14

A gene with the same proportional differences between cells, but expressed at a higher level will have higher variance:

var(c(20,40,20,40,20,40,20,40)) = 114.29

Mean-variance correlation

If we take the logs of the expression values, the variances are the same for both genes:

var(log(c(2,4,2,4,2,4,2,4))) = 0.14

var(log(c(20,40,20,40,20,40,20,40))) = 0.14

This “variable stabilising transformation” helps to remove the correlation between mean and variance

General principle behind normalisation

Normalization has two steps

  1. Scaling
    • Calculate size factors or normalization factors that represents the relative depth bias in each cell
    • Scale the counts for each gene in each cell by dividing the raw counts with cell specific size factor
  2. Transformation: Transform the data after scaling
    • Per million (e.g. CPM)
    • log2 (e.g. Deconvolution)
    • Pearson residuals (eg. sctransform)

Bulk RNAseq methods are not suitable for scRNAseq data

CPM: convert raw counts to counts-per-million (CPM)

  • for each cell
  • by dividing counts by the library size then multiplying by 1.000.000.
  • does not address compositional bias caused by highly expressed genes that are also differentially expressed between cells.

DESeq’s size factor

  • For each gene, compute geometric mean across cells
  • For each cell
    • compute for each gene the ratio of its expression to its geometric mean,
    • derive the cell’s size factor as the median ratio across genes.
  • Not suitable for sparse scRNA-seq data as the geometric mean is computed on non-zero values only.

SCTransform

“This procedure omits the need for heuristic steps including pseudocount addition or log-transformation and improves common downstream analytical tasks such as variable gene selection, dimensional reduction, and differential expression. We named this method sctransform.”

Steps:

  1. Estimate size factors using a regularized negative binomial regression model

  2. Scale the counts by dividing the raw counts with the estimated size factors

  3. Apply a variance stabilizing transformation to the scaled counts by calculating Pearson residuals from the negative binomial regression model

SCTransform

Pearson residual = (Observed value – Expected value) / √(Expected variance)

If a gene has exactly the expression level we’d expect based on the model → residual = 0

If a gene is expressed higher than expected → positive residual

If a gene is expressed lower than expected → negative residual

These residuals become our normalised values and they’re useful for several reasons:

  • Standardized scale: All genes end up on the same scale regardless of their expression level

  • Variance stabilization: Unlike raw counts, these residuals have roughly the same variance across all expression levels. The transformation ensures that genes with different expression levels can be compared fairly – the variance no longer depends on the mean expression level.

  • Interpretation: Positive residuals mean higher-than-expected expression; negative means lower-than-expected. A residual of +2 means “this gene is expressed about 2 standard deviations higher than expected”.

SCTransform

Inside SCTransform there is a ‘selection’ of the the most variable genes.

By default these are used for downstream dimensionality reduction and clustering (although in most cases you can change this)

  • Which genes should we use for downstream analysis?

Select genes which capture biologically-meaningful variation, while reducing the number of genes which only contribute to technical noise

Recap

  • We get different total counts for each cell due to technical factors (depth bias)
  • A simplistic library size normalisation (e.g. CPM) removes a large part of this bias
  • However, composition bias causes spurious differences between cells
  • Early methods developed for bulk RNA-seq are not appropriate for sparse scRNA- seq data.
  • The SCTransform method uses a regularized negative binomial regression model to estimate size factors, and applies a variance stabilizing transformation to the scaled counts.
  • This accounts for both sequencing depth (library size) and gene-specific effects simultaneously.