May 2023
A few ways our data can be arranged (software-dependent too)
one large SCE object containing many samples
many single-sample SCE objects, QC’d in isolation
multiple large SCE objects with multiple samples
Important we make sure things match up
Different bioconductor versions
Different analysts may have formatted things differently
aggr
A useful quick look
Gaussian/Linear Regression - removeBatchEffect (limma), comBat (sva), rescaleBatches or regressBatches (batchelor)
Mutual Nearest Neighbours (MNN) correction - Haghverdi et al 2018
mnnCorrect (batchelor)
FastMNN (batchelor)
And many more!
Different methods may have strenghts and weaknesses
Benchmark studies can be used as a reference to choose suitable method
Assumptions (quoted from the paper):
We can look at the ‘mixing’ between batches and calculate the variance in the log-normalized cell abundances across batches for each cluster.
Clusters are ranked by variance for manual inspection.
If variance is too high it could indicate there isn’t sufficient correction.
If you use fastMNN in the absence of a batch effect, it may not work correctly
It is possible to remove genuine biological heterogeneity
fastMNN can be instructed to skip the batch correction if the batch effect is below a threshold. You can use the effect sizes it calculates to do this.
In reality the absence of any batch effect would warrant further investigation.
The value in batch correction is that it enables you to see population heterogeneity within clusters/celltypes across batches.
However the corrected values should not be used for gene based analysis eg. DE/marker detection.