June 2022

Outline

  • Motivation

  • Initial methods

  • Deconvolution

  • sctransform

Workflow

Motivation

Systematic differences in sequencing coverage between libraries occur because of:

  • low input material,
  • differences in cDNA capture
  • differences in PCR amplification.

Normalisation removes such differences so that differences between cells are not technical but biological, allowing meaningful comparison of expression profiles between cells.

Normalisation and batch correction have different aims:

Normalisation addresses technical differences only, while batch correction considers both technical and biological differences.

Sources: chapters on normalisation in the OSCA book, the Hemberg group materials and sctransform.

Initial methods

  • In scaling normalization, the “normalization factor” is an estimate of the library size relative to the other cells.

  • Steps usually include:

    • computation of a cell-specific ‘scaling’ or ‘size’ factor
      • that represents the relative bias in that cell
    • division of all counts for the cell by that factor to remove that bias.
  • Assumption: any cell specific bias will affect genes the same way.

Examples

CPM: convert raw counts to counts-per-million (CPM)

  • for each cell
  • by dividing counts by the library size then multiplying by 1.000.000.
  • does not address compositional bias caused by highly expressed genes that are also differentially expressed between cells.

DESeq’s size factor

  • For each gene, compute geometric mean across cells.
  • For each cell
    • compute for each gene the ratio of its expression to its geometric mean,
    • derive the cell’s size factor as the median ratio across genes.
  • Not suitable for sparse scRNA-seq data as the geometric mean is computed on non-zero values only.

Deconvolution

Deconvolution strategy Lun et al 2016:

Steps:

  • compute scaling factors,
  • apply scaling factors

sctransform

With scaling normalisation a correlation remains between the mean and variation of expression (heteroskedasticity).

This affects downstream dimensionality reduction as the few main new dimensions are usually correlated with library size.

sctransform addresses the issue by:

  • regressing library size out of raw counts
  • providing residuals to use as normalized and variance-stabilized expression values

sctransform

Variables

  • model the expression of each gene as a negative binomial random variable with a mean that depends on other variables

  • which model the differences in sequencing depth between cells

  • and used as independent variables in a regression model

Regression

  • fit model parameters for each gene

  • combine data across genes using the relationship between gene mean and parameter values to fit parameters

  • transform each observed UMI count into a Pearson residual

    • ~ number of standard deviations away from the expected mean
  • expect mean of 0 and stable variance across the range of expression

sctransform

Example of the transformation outcome for two genes:

  • UMI counts and pearson residuals against library size
  • with expected UMI counts in pink

Recap

Early methods developed for bulk RNA-seq are not appropriate for sparse scRNA-seq data.

Simpler, faster scaling methods are ok for clustering.

The deconvolution method draws information from pools of cells to derive cell-based scaling factors that account for composition bias.

The sctransform method uses sequencing depth and information across genes to stabilise expression variance across the expression range.