26th July 2021

Workflow

Why do we need to think about data intergration?

  • Practicalities of our Experimental Design

  • Different 10X runs at different times OR just the same sample run twice

  • Obscure real biological changes

Data Integration Workflow

Formatting our data

A few ways our data can be arranged (software dependent too)

  • single sample SCE objects QCed in isolation

  • large SCE object containing many samples

  • multiple large SCE objects with multiple samples

Important we make sure things match up

  • Different bioconductor versions

  • Different analysts may have formatted things differently

Cellranger aggr

A useful quick look

Checking for batch effects

Batch Corrections

  • Gaussian/Linear Regression - removeBatchEffect (limma), comBat (sva), rescaleBatches or regressBatches (batchelor)

  • Mutual nearest neighbour correction - Haghverdi et al 2018 Nature Biotechnology

    • mnnCorrect (batchelor)

    • FastMNN (batchelor)

FastMNN

  1. Perform a multi-sample PCA on the (cosine-)normalized expression values to reduce dimensionality.
  2. Identify MNN pairs in the low-dimensional space between a reference batch and a target batch.
  3. Remove variation along the average batch vector in both reference and target batches.
  4. Correct the cells in the target batch towards the reference, using locally weighted correction vectors.
  5. Merge the corrected target batch with the reference, and repeat with the next target batch.

Haghverdi L, Lun ATL, Morgan MD, Marioni JC (2018). Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nat. Biotechnol. 36(5):421

Checking our correction has worked

Checking our correction hasn’t over worked

  • If you use fastMNN in the absence of a batch effect, it may not work correctly

  • It is possible to remove genuine biological heterogeneity

  • fastMNN can be instructed to skip the batch correction if the batch effect is below a threshold. You can use the effect sizes it calculates to do this.

  • In reality the absence of any batch effect would warrant further investigation.

Using the corrected values

The value in batch correction is that it enables you to see population heterogeneity within clusters/celltypes across batches.

  • Also increases the number of cells you have

However the corrected values should not be used for gene based analysis eg. DE/marker detection.

  • fastMNN doesn’t preserve the magnitude or direction of per-gene expression and may have introduced artifical agreement between batches on the gene level.