Motivation
Initial methods
Deconvolution
sctransform
July 2021
Motivation
Initial methods
Deconvolution
sctransform
Systematic differences in sequencing coverage between libraries occur because of:
Normalisation removes such differences so that differences between cells are not technical but biological, allowing meaningful comparison of expression profiles between cells.
Normalisation and batch correction have different aims:
Normalisation addresses technical differences only, while batch correction considers both technical and biological differences.
Sources: chapters on normalisation in the OSCA book, the Hemberg group materials and sctransform.
In scaling normalization, the “normalization factor” is an estimate of the library size relative to the other cells.
Steps usually include:
Assumption: any cell specific bias will affect genes the same way.
CPM: convert raw counts to counts-per-million (CPM)
DESeq’s size factor
Deconvolution strategy Lun et al 2016:
Steps:
With scaling normalisation a correlation remains between the mean and variation of expression (heteroskedasticity).
This affects downstream dimensionality reduction as the few main new dimensions are usually correlated with library size.
sctransform addresses the issue by:
Variables
model the expression of each gene as a negative binomial random variable with a mean that depends on other variables
which model the differences in sequencing depth between cells
and used as independent variables in a regression model
Regression
fit model parameters for each gene
combine data across genes using the relationship between gene mean and parameter values to fit parameters
transform each observed UMI count into a Pearson residual
expect mean of 0 and stable variance across the range of expression
Example of the transformation outcome for two genes:
Early methods developed for bulk RNA-seq are not appropriate for sparse scRNA-seq data.
The deconvolution method draws information from pools of cells to derive cell-based scaling factors.
The sctransform method uses sequencing depth and information across genes to stabilise expression variance across the expression range.