20/09/2022

Outline

  • Motivation

  • Initial methods

  • Deconvolution

  • sctransform

Workflow

Workflow

Raw UMI counts distribution

Why UMI counts differ among the cells?

  • We derive biological insights downstream by comparing cells against each other.

  • But the UMI counts differences makes it harder to compare cells.

  • Why total transcript molecules (UMI counts) detected between cells differ?

    • Biological:
      • Cell sub type differences, like size and transcription activity etc.
    • Technical: scRNA data is inherently noisy
      • Low mRNA content per cell
      • cell-to-cell differences in mRNA capture efficiency
      • Variable sequencing depth
      • PCR amplification efficiency

Normalization removes technical differences so that differences between cells are not technical but biological, allowing meaningful comparison of expression profiles between cells.

Effect of Normalization on UMI counts distribution

General principle behind normalisation

  • Normalization has two steps
    1. Scaling
    • First get size factors or normalization factors
    • Usually one size factor per cell
    • Scale the counts by divide the raw counts of a cell with cell specific size factor
    1. Transformation: Transform the data after scaling
    • Per million
    • log2
    • square root transformation
    • Pearson residuals (eg. sctransform)

Normalization toy examle

  • CPM: Counts-per-million
  • One of the initial normalization methods for counts data

Bulk RNAseq methods are not suitable for scRNAseq data

  • Some of the bulk RNAseq normalization methods ..
    • Library size normalization
    • DEseq
    • edgeR-TMM
  • Although bulk RNAseq data also suffers from technical variability, extreme sparsity of scRNA-seq data is suitable for bulk based methods.

Comparision of bulk methods

scRNAseq specifc normalization methods

  • Many scRNaseq specific methods developed
    • Deconvolution
    • sctransform
  • These methods overcome the sparsity by pooling the cells.

Deconvolution

Deconvolution strategy Lun et al 2016:

Steps:

  • compute scaling factors by pooling cells
  • apply scaling factors to get scaled data
  • log2 transform the data

sctransform

  • All the above normalization methods use one factor for a cell to normalize all the genes in that cell
  • A single scaling factor does not effectively normalize both lowly and highly expressed genes
  • Any correlation between between total UMI counts and gene UMI counts indicates existence of technical variability

sctransform

Algorithm

  • Expression of a gene is modeled by a negative binomial random variable with a mean that depends on library size
  • Library size is used as the independent variable in a regression model
  • The model is fit for each gene, then combined data across genes is used to fit parameters
  • Transform UMI counts to Pearson residuals ( the number of standard deviations away from the expected mean).

sctransform

Example of the transformation outcome for two genes:

  • UMI counts and pearson residuals against library size
  • with expected UMI counts in pink

Is normalization working?

  • What are the signs that the normalization method is working?
    • Using PCA: Library size should not be a major contributor to top few principle components.
    • Does clustering make biological sense?
    • Take a look at the plots of cell UMI counts versus gene UMI counts

Recap

  • Early methods developed for bulk RNA-seq are not appropriate for sparse scRNA-seq data.

  • The deconvolution method draws information from pools of cells to derive cell-based scaling factors that account for composition bias.

  • The sctransform method uses sequencing depth and information across genes to stabilise expression variance across the expression range.