1 Batch correction - GSM3872442 set

When we do single cell RNA-Seq experiments most of the time inevitably there will be technical batch effects. For example, logistically maybe we can’t put all of our samples in the same run and so have to split them up. Even within samples all sequenced together there can be batch effects from the previous steps. We need to check and correct for them.

1.1 Learning objectives

Understand different batch correction methods using one sample sequenced in two different runs. We will look at differences in :

  • Normalisation order

  • Correction with limma

  • Correction with mnnCorrect

  • Correction with fastMNN

  • Correction with Harmony

1.2 Data

GSM3872442 is a single PBMMC sample sequenced as a pool of two libraries: SRR9264351 and SRR9264352.

We will use this sample to illustrate batch correction methods.

Load object

sce <- readRDS("~/Downloads/caron_sce_nz_postQc.Rds")

Select the GSM3872442 cells:

sample1.nz.sce <- SingleCellExperiment(list(counts=counts(sce[, sce$Run %in% c("SRR9264351")])),
                                       colData=colData(sce[, sce$Run %in% c("SRR9264351")]))

sample2.nz.sce <- SingleCellExperiment(list(counts=counts(sce[, sce$Run %in% c("SRR9264352")])),
                                       colData=colData(sce[, sce$Run %in% c("SRR9264352")]))

saveRDS(sample1.nz.sce, "Robjects/BC_sample1.rds")
saveRDS(sample2.nz.sce, "Robjects/BC_sample2.rds")

The above two code chunks just illustrate how the starting demonstration files for this chapter were made, if you are working from our github repo you will not need this. if you are testing batch corrections on your own samples, for this first part, read in each sample as a separate sce object.

sample1.sce <- readRDS("Robjects/BC_sample1.rds")
sample2.sce <- readRDS("Robjects/BC_sample2.rds")

1.3 Normalise each separately and re-pool

sample1.qclusters <- quickCluster(sample1.sce, method="igraph")
sample1.sce <- computeSumFactors(sample1.sce, min.mean=0.1, cluster=sample1.qclusters)
sample1.sce <- logNormCounts(sample1.sce)

sample2.qclusters <- quickCluster(sample2.sce, method="igraph")
sample2.sce <- computeSumFactors(sample2.sce, min.mean=0.1, cluster=sample2.qclusters)
sample2.sce <- logNormCounts(sample2.sce)

Re-pool:

# recombine the normalized samples together
all.samp.exprs <- do.call(cbind,
                          list("SRR9264351"=exprs(sample1.sce),
                               "SRR9264352"=exprs(sample2.sce)))
colnames(all.samp.exprs) <- c(as.character(colData(sample1.sce)$Barcode),
                              as.character(colData(sample2.sce)$Barcode))

For the PCA we want to quickly select the genes that are most informative. We will use the top 2000 genes with the highest variance.

gene.variances <- apply(all.samp.exprs, 1, var)
names(gene.variances) <- rownames(all.samp.exprs)
highly.variable.genes <- names(gene.variances[order(gene.variances, decreasing=TRUE)])[1:2000]

Perform PCA:

# we need to use a fast approximate algorithm for PCA on large data sets
# this algorithm has a stochastic component,
# so we need to fix the seed number to get the same result each time
set.seed(42)
separate.hvg.pca <- irlba::prcomp_irlba(t(all.samp.exprs[highly.variable.genes, ]), n=5) # we only need a few components
separate.hvg.pcs <- as.data.frame(separate.hvg.pca$x) # extract the principal components
separate.hvg.pcs$Cell <- colnames(all.samp.exprs) # set the sample column as the cell IDs

# combine the PCs with the sample information into a single data frame for plotting
samples.info <- data.frame("Cell"=colnames(all.samp.exprs),
                           "Run"=c(rep("SRR9264351", ncol(sample1.sce)), 
                                   rep("SRR9264352", ncol(sample2.sce))))

# merge the two data frames together
separate.pca.merge <- merge(separate.hvg.pcs, samples.info, by='Cell')

You can merge and plot the PCA results.

ggplot(separate.pca.merge, aes(x=PC1, y=PC2, fill=Run)) +
  geom_point(shape=21, size=3) +
  theme_minimal()

It is easier to bind the two sce objects together into a single sce (remember to insure the colData has the same columns) and the plot whatever you need.

sce.sep <- cbind(sample1.sce, sample2.sce)

sce.sep <- runPCA(sce.sep)

plotPCA(sce.sep, colour_by="Run", size_by = "sum")

sce.sep <- runTSNE(sce.sep, dimred="PCA")
plotTSNE(sce.sep, colour_by="Run", size_by = "sum")

sce.sep <- runUMAP(sce.sep, dimred="PCA")
plotUMAP(sce.sep, colour_by="Run", size_by = "sum")

From the above plots you can see that even though the samples should be the same there is a batch effect from being run seperately. Now we can see it we can try to remove it.

1.4 Normalise batches together

sample3.sce <- SingleCellExperiment(list(counts=counts(sce[, sce$Run %in% c("SRR9264351", "SRR9264352")])),
                                       colData=colData(sce[, sce$Run %in% c("SRR9264351", "SRR9264352")]))

saveRDS(sample3.sce, "./Robjects/BC_sample3.rds")

The above chunk illustrates how the combined object was made (sample 3) but if you are using our repo you can just load it in and process it with the preceeding steps.

sample3.sce <- readRDS("./Robjects/BC_sample3.rds")

sample3.qclusters <- quickCluster(sample3.sce, method="igraph")
sample3.sce <- computeSumFactors(sample3.sce, min.mean=0.1, cluster=sample3.qclusters)
sample3.sce <- logNormCounts(sample3.sce)

pool.exprs <- exprs(sample3.sce)
colnames(pool.exprs) <- gsub(colData(sample3.sce)$Barcode, pattern="-", replacement=".")

Find the 2000 genes with the highest variance:

gene.variances <- apply(pool.exprs, 1, var)
names(gene.variances) <- rownames(pool.exprs)
highly.variable.genes <- names(gene.variances[order(gene.variances, decreasing=TRUE)])[1:2000]

Perform PCA:

# we need to use a fast approximate algorithm for PCA on large data sets
# this algorithm has a stochastic component, so we need to fix the seed number to get the same result each time
set.seed(42)
combined.hvg.pca <- irlba::prcomp_irlba(t(pool.exprs[highly.variable.genes, ]), n=5) # we only need a few components
combined.hvg.pcs <- as.data.frame(combined.hvg.pca$x) # extract the principal components
combined.hvg.pcs$Cell <- colnames(pool.exprs) # set the sample column as the cell IDs

# combine the PCs with the sample information into a single data frame for plotting
samples.info <- data.frame("Cell"=colnames(pool.exprs),
                           "Run"=colData(sample3.sce)$Run)

# merge the two data frames together
combined.pca.merge <- merge(combined.hvg.pcs, samples.info, by='Cell')

Plot PC1-PC2 plane, with cells colored by ‘Run’ (and sized according to library size):

sample3.sce <- runPCA(sample3.sce)
plotPCA(sample3.sce, colour_by="Run", size_by = "sum")

Plot the TSNE:

sample3.sce <- runTSNE(sample3.sce, dimred="PCA")
plotTSNE(sample3.sce, colour_by="Run", size_by = "sum")

Plot the UMAP:

sample3.sce <- runUMAP(sample3.sce, dimred="PCA")
plotUMAP(sample3.sce, colour_by="Run", size_by = "sum")

Notice there is a difference o the plots between running these step separately or together.

1.5 Batch correction

Make sure you sce object contains the information about which run and that information is a factor. This will be needed for future steps.

sample3.sce$Run <- factor(sample3.sce$Run)
sample3.sce$batch <- sample3.sce$Run

1.5.1 Gaussian (normal) linear models

We can use Limma functions to remove the batch effect using a linear model.

suppressMessages(require(limma))
lm_design_batch <- model.matrix(~0 + batch, data = colData(sample3.sce))
fit_lm_batch <- lmFit(logcounts(sample3.sce), lm_design_batch)
resids_lm_batch <- residuals(fit_lm_batch, logcounts(sample3.sce))
assay(sample3.sce, "lm_batch") <- resids_lm_batch

reducedDim(sample3.sce, "PCA_lm_batch") <- reducedDim(
  runPCA(sample3.sce, exprs_values = "lm_batch"), "PCA")

plotReducedDim(sample3.sce, dimred = "PCA_lm_batch",
        colour_by = "batch", 
        size_by = "sum",
        shape_by = "Sample.Name"
        ) +
  ggtitle("LM - regress out batch")

However, as you can see from the PCA the batch effect has not been entirely removed.

1.6 mnnCorrect

mnnCorrect uses a mutual nearest neighbour approach to batch correct. There is extensive documentation on how it works and there is a vignette here:

https://bioconductor.org/packages/release/bioc/vignettes/batchelor/inst/doc/correction.html

The original paper is here:

https://pubmed.ncbi.nlm.nih.gov/29608177/

Everything we need is included in the Batchelor library available in bioconductor. Batchelor is from the same authors as scater and scran and also features in the OSCA manual.

1.6.1 Check presence of batch effect

Batchelor commands to make the two batches and identify highly variable genes for faster dimensionality reduction. It has a slightly different workflow so here we demonstrate from the beginning again. We remake separate sce objects from our combined sample.

library(batchelor)

sce1 <- sample3.sce[, sample3.sce$Run == "SRR9264351"]
sce2 <- sample3.sce[, sample3.sce$Run == "SRR9264352"]
library(scran)
dec1 <- modelGeneVar(sce1)
dec2 <- modelGeneVar(sce2)
combined.dec <- combineVar(dec1, dec2)
chosen.hvgs <- combined.dec$bio > 0
summary(chosen.hvgs)
##    Mode   FALSE    TRUE 
## logical    7655    8974

As a diagnostic, we check that there actually is a batch effect across these datasets by checking that they cluster separately. Here, we combine the two SingleCellExperiment objects without any correction using the NoCorrectParam() flag, and we informally verify that cells from different batches are separated using a t-SNE plot.

There is a moderate batch effect.

library(scater)
combined <- correctExperiments(A=sce1, B=sce2, PARAM=NoCorrectParam())
## Warning in .eliminate_overlaps(colnames(colData(merged)), combine.coldata, :
## ignoring 'colData' fields with same name as 'batchCorrect' output
combined <- runPCA(combined, subset_row=chosen.hvgs)
combined <- runTSNE(combined, dimred="PCA")
combined <- runUMAP(combined, dimred="PCA")
plotPCA(combined, colour_by="batch")

plotTSNE(combined, colour_by="batch")

plotUMAP(combined, colour_by="batch")

reducedDim(sample3.sce, "PCA_noCor") <- reducedDim(combined, "PCA")
reducedDim(sample3.sce, "TSNE_noCor") <- reducedDim(combined, "TSNE")
reducedDim(sample3.sce, "UMAP_noCor") <- reducedDim(combined, "UMAP")

1.6.2 Correct batch effect with mnnCorrect

Gene expression values are used to identify cells with similar expression patterns in both batches.

We need the normalised counts:

batch1 <- logcounts(sce1)
batch2 <- logcounts(sce2)

Now we can run mnnCorrect, even with 2 samples it can take a while.

# returns a matrix with rownames only for the gene subset,
# at the top of the matrix
# preventing copy of that corrected matrix as an assay in the SCE object

# mmnCorrect returns the corrected gene expression matrix directly

y <- batchelor::mnnCorrect(
          batch1, batch2,  
      correct.all = TRUE,
          k = 20,
          sigma = 0.1,
          cos.norm.in = TRUE,
          svd.dim = 2
        )

Copy the corrected values to the SCE object:

assay(sample3.sce, "mnn") <- assay(y, "corrected")

Show reduced dimension plots and check for improved mixing of cells from the two sets:

sample3.sce <- runPCA(sample3.sce, exprs_values = "mnn")
plotPCA(sample3.sce, colour_by="batch")

reducedDim(sample3.sce, "PCA_mnn") <- reducedDim(sample3.sce, "PCA")
sample3.sce <- runTSNE(sample3.sce, dimred="PCA_mnn")
plotTSNE(sample3.sce, colour_by="batch")

reducedDim(sample3.sce, "TSNE_mnn") <- reducedDim(sample3.sce, "TSNE")
sample3.sce <- runUMAP(sample3.sce, dimred="PCA_mnn")
plotUMAP(sample3.sce, colour_by="batch")

reducedDim(sample3.sce, "UMAP_mnn") <- reducedDim(sample3.sce, "UMAP")

The plots show nice amounts of mixing.

1.7 fastMNN

This method is faster than mnnCorrect as it identifies nearest neighbours after dimensionality reduction and is also avaliable in the Batchelor package.

fx <- batchelor::fastMNN(
                      sample3.sce,
                      batch = sample3.sce$Run
            )

Copy the corrected values to the SCE object:

# fastMNN may drop some genes
# so we may not be able to keep the outcome in 'assay'
assay(sample3.sce, "fastmnn") <- assay(fx, "reconstructed")

Show reduced dimension plots and check for improved mixing of cells from the two sets:

fastmnn_pca <- runPCA(assay(sample3.sce, "fastmnn"), rank=2) # slow
reducedDim(sample3.sce, "PCA_fastmnn") <- fastmnn_pca$rotation

plotReducedDim(
  sample3.sce,
  dimred = "PCA_fastmnn",
  colour_by = "batch",
  size_by = "sum",
  shape_by = "Sample.Name"
) + ggtitle("PCA plot: fastMNN") 

sample3.sce <- runTSNE(sample3.sce, dimred="PCA_fastmnn")
plotTSNE(sample3.sce, colour_by="batch")

reducedDim(sample3.sce, "TSNE_fastmnn") <- reducedDim(sample3.sce, "TSNE")
sample3.sce <- runUMAP(sample3.sce, dimred="PCA_fastmnn")
plotUMAP(sample3.sce, colour_by="batch")

reducedDim(sample3.sce, "UMAP_fastmnn") <- reducedDim(sample3.sce, "UMAP")

1.8 Harmony

Harmony [Korsunsky2018fast] is a newer batch correction method, which is also designed to operate on PC space. The algorithm proceeds to iteratively cluster the cells, with the objective function formulated to promote cells from multiple datasets within each cluster. Once a clustering is obtained, the positions of the centroids of each dataset are obtained on a per-cluster basis and the coordinates are corrected. This procedure is iterated until convergence. Harmony comes with a theta parameter that controls the degree of batch correction (higher values lead to more dataset integration), and can account for multiple experimental and biological factors on input (see variant of the ‘Hemberg course’).

library(harmony)
## Loading required package: Rcpp
reducedDim(sample3.sce, "PCA_logcounts") <- reducedDim(
  runPCA(sample3.sce, exprs_values = "logcounts")
)

#Seeing how the end result of Harmony is an altered dimensional reduction space created on the basis of PCA, we plot the obtained manifold here and exclude it from the rest of the follow-ups in the section.

pca <- as.matrix(reducedDim(sample3.sce, "PCA_logcounts"))
harmony_emb <- HarmonyMatrix(pca,
                 sample3.sce$batch,
                 theta=2,
                 do_pca=FALSE)
## Harmony 1/10
## Harmony 2/10
## Harmony 3/10
## Harmony 4/10
## Harmony 5/10
## Harmony 6/10
## Harmony 7/10
## Harmony 8/10
## Harmony 9/10
## Harmony converged after 9 iterations
reducedDim(sample3.sce, "harmony") <- harmony_emb

plotReducedDim(
    sample3.sce,
    dimred = 'harmony',
    colour_by = "batch",
    size_by = "sum",
    shape_by = "Sample.Name"
)

1.9 Session information

sessionInfo()
## R version 4.1.0 (2021-05-18)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Catalina 10.15.7
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] harmony_0.1.0               Rcpp_1.0.7                 
##  [3] batchelor_1.8.0             limma_3.48.1               
##  [5] Cairo_1.5-12.2              BiocSingular_1.8.1         
##  [7] dplyr_1.0.7                 scran_1.20.1               
##  [9] scater_1.20.1               ggplot2_3.3.5              
## [11] scuttle_1.2.0               SingleCellExperiment_1.14.1
## [13] SummarizedExperiment_1.22.0 Biobase_2.52.0             
## [15] GenomicRanges_1.44.0        GenomeInfoDb_1.28.1        
## [17] IRanges_2.26.0              S4Vectors_0.30.0           
## [19] BiocGenerics_0.38.0         MatrixGenerics_1.4.0       
## [21] matrixStats_0.59.0         
## 
## loaded via a namespace (and not attached):
##  [1] bitops_1.0-7              tools_4.1.0              
##  [3] bslib_0.2.5.1             ResidualMatrix_1.2.0     
##  [5] utf8_1.2.1                R6_2.5.0                 
##  [7] irlba_2.3.3               vipor_0.4.5              
##  [9] uwot_0.1.10               DBI_1.1.1                
## [11] colorspace_2.0-2          withr_2.4.2              
## [13] tidyselect_1.1.1          gridExtra_2.3            
## [15] compiler_4.1.0            BiocNeighbors_1.10.0     
## [17] DelayedArray_0.18.0       labeling_0.4.2           
## [19] sass_0.4.0                scales_1.1.1             
## [21] stringr_1.4.0             digest_0.6.27            
## [23] rmarkdown_2.9             XVector_0.32.0           
## [25] pkgconfig_2.0.3           htmltools_0.5.1.1        
## [27] sparseMatrixStats_1.4.0   highr_0.9                
## [29] rlang_0.4.11              FNN_1.1.3                
## [31] DelayedMatrixStats_1.14.0 jquerylib_0.1.4          
## [33] generics_0.1.0            farver_2.1.0             
## [35] jsonlite_1.7.2            BiocParallel_1.26.1      
## [37] RCurl_1.98-1.3            magrittr_2.0.1           
## [39] GenomeInfoDbData_1.2.6    Matrix_1.3-4             
## [41] ggbeeswarm_0.6.0          munsell_0.5.0            
## [43] fansi_0.5.0               viridis_0.6.1            
## [45] lifecycle_1.0.0           stringi_1.7.2            
## [47] yaml_2.2.1                edgeR_3.34.0             
## [49] zlibbioc_1.38.0           Rtsne_0.15               
## [51] grid_4.1.0                dqrng_0.3.0              
## [53] crayon_1.4.1              lattice_0.20-44          
## [55] cowplot_1.1.1             beachmat_2.8.0           
## [57] locfit_1.5-9.4            metapod_1.0.0            
## [59] knitr_1.33                pillar_1.6.1             
## [61] igraph_1.2.6              codetools_0.2-18         
## [63] ScaledMatrix_1.0.0        glue_1.4.2               
## [65] evaluate_0.14             vctrs_0.3.8              
## [67] tidyr_1.1.3               gtable_0.3.0             
## [69] purrr_0.3.4               assertthat_0.2.1         
## [71] xfun_0.24                 rsvd_1.0.5               
## [73] RSpectra_0.16-0           viridisLite_0.4.0        
## [75] tibble_3.1.2              beeswarm_0.4.0           
## [77] cluster_2.1.2             bluster_1.2.1            
## [79] statmod_1.4.36            ellipsis_0.3.2