05/12/2024

Outline

  • Motivation
  • Biases
    • Depth bias
    • Composition bias
    • Mean-variance correlation
  • Normalisation strategies
  • Deconvolution

Workflow

Workflow

Raw UMI counts distribution

Why do UMI counts differ among the cells?

  • We derive biological insights downstream by comparing cells against each other.

  • But the UMI count differences makes it harder to compare cells.

  • Why do total transcript molecules (UMI counts) detected between cells differ?

    • Biological:
      • Cell subtype differences - size and transcriptional activity, variation in gene expression
    • Technical: scRNA data is inherently noisy
      • Low mRNA content per cell
      • cell-to-cell differences in mRNA capture efficiency
      • Variable sequencing depth
      • PCR amplification efficiency

Normalization reduces technical differences so that differences between cells are not technical but biological, allowing meaningful comparison of expression profiles between cells.

Depth bias

Depth bias: Read differences between cells

Simple library size normalization accounts for the depth bias

Composition bias

  • Few genes contribute to most read counts
  • In this example, the total read counts are the same across the cells
  • Gene 1 contributes 80% of reads in cell2, leaving other genes with fewer read counts.

  • Library size normalization can not correct composition bias.

  • The deconvolution approach will we use takes account of both depth and compositions biases

Mean-variance correlation

Mean and variance of raw counts for genes are correlated

More highly expressed genes tend to look more variable because larger numbers result in higher variance

A gene expressed at a low level tends to have a low variance across cells:

var(c(2,4,2,4,2,4,2,4)) = 1.14

A gene with the same proportional differences between cells, but expressed at a higher level will have higher variance:

var(c(20,40,20,40,20,40,20,40)) = 114.29

Mean-variance correlation

If we take the logs of the expression values, the variances are the same for both genes:

var(log(c(2,4,2,4,2,4,2,4))) = 0.14

var(log(c(20,40,20,40,20,40,20,40))) = 0.14

This “variable stabilising transformation” helps to remove the correlation between mean and variance

General principle behind normalisation

Normalization has two steps

  1. Scaling
    • Calculate size factors or normalization factors that represents the relative depth bias in each cell
    • Scale the counts for each gene in each cell by dividing the raw counts with cell specific size factor
  2. Transformation: Transform the data after scaling
    • Per million (e.g. CPM)
    • log2 (e.g. Deconvolution)
    • Pearson residuals (eg. sctransform)

Bulk RNAseq methods are not suitable for scRNAseq data

CPM: convert raw counts to counts-per-million (CPM)

  • for each cell
  • by dividing counts by the library size then multiplying by 1.000.000.
  • does not address compositional bias caused by highly expressed genes that are also differentially expressed between cells.

DESeq’s size factor

  • For each gene, compute geometric mean across cells
  • For each cell
    • compute for each gene the ratio of its expression to its geometric mean,
    • derive the cell’s size factor as the median ratio across genes.
  • Not suitable for sparse scRNA-seq data as the geometric mean is computed on non-zero values only.

Bulk RNA-seq normalization methods fail for scRNA-seq data

Deconvolution

Deconvolution strategy Lun et al 2016:

Steps:

  • compute scaling factors by pooling cells
  • apply scaling factors to get scaled data
  • log2 transform the data

Recap

  • We get different total counts for each cell due to technical factors (depth bias)
  • A simplistic library size normalisation (e.g. CPM) removes a large part of this bias
  • However, composition bias causes spurious differences between cells
  • Early methods developed for bulk RNA-seq are not appropriate for sparse scRNA- seq data.
  • The deconvolution method draws information from pools of cells to derive cell- based scaling factors that account for composition bias in scRNA-seq data.

In the demonstration and exercises we will see the effect of deconvolution on the data.