- Motivation
- Biases
- Depth bias
- Composition bias
- Mean-variance correlation
- Normalisation strategies
- Deconvolution
25/01/2023
We derive biological insights downstream by comparing cells against each other.
But the UMI count differences makes it harder to compare cells.
Why do total transcript molecules (UMI counts) detected between cells differ?
Normalization reduces technical differences so that differences between cells are not technical but biological, allowing meaningful comparison of expression profiles between cells.
Consider two genes A:B, in two cells types, blue and green.
We normalize here by dividing UMI counts for each gene by the total UMI counts in a cell and multiplying by 100.
There is no differential expression, we have just sequenced twice as much in the second cell type.
Simple library size normalization accounts for the depth bias
Consider three genes A:B:C, in two cell types.
Just one gene is DE but library size normalization makes all look differentially expressed after normalisation
The deconvolution approach will we use takes account of both depth and compositions biases
Mean and variance of raw counts for genes are correlated
More highly expressed genes tend to look more variable because larger numbers result in higher variance
A gene expressed at a low level tends to have a low variance across cells:
var(c(2,4,2,4,2,4,2,4)) = 1.14
A gene with the same proportional differences between cells, but expressed at a higher level will have higher variance:
var(c(20,40,20,40,20,40,20,40)) = 114.29
If we take the logs of the expression values, the variances are the same for both genes:
var(log(c(2,4,2,4,2,4,2,4))) = 0.14
var(log(c(20,40,20,40,20,40,20,40))) = 0.14
This “variable stabilising transformation” helps to remove the correlation between mean and variance
Normalization has two steps
CPM: convert raw counts to counts-per-million (CPM)
DESeq’s size factor
Deconvolution strategy Lun et al 2016:
Steps:
In the demonstration and exercises we will see the effect of deconvolution on the data.