1 Batch correction - GSM3872442 set

When we do single cell RNA-Seq experiments most of the time inevitably there will be technical batch effects. For example, logistically maybe we can’t put all of our samples in the same run and so have to split them up. Even within samples all sequenced together there can be batch effects from the previous steps. We need to check and correct for them.

1.1 Learning objectives

Understand different batch correction methods using one sample sequenced in two different runs. We will look at differences in :

  • Normalisation order

  • Correction with limma

  • Correction with mnnCorrect

  • Correction with fastMNN

  • Correction with Harmony

1.2 Data

GSM3872442 is a single PBMMC sample sequenced as a pool of two libraries: SRR9264351 and SRR9264352.

We will use this sample to illustrate batch correction methods.

Load object

sce <- readRDS("~/Downloads/caron_sce_nz_postQc.Rds")

Select the GSM3872442 cells:

sample1.nz.sce <- SingleCellExperiment(list(counts=counts(sce[, sce$Run %in% c("SRR9264351")])),
                                       colData=colData(sce[, sce$Run %in% c("SRR9264351")]))

sample2.nz.sce <- SingleCellExperiment(list(counts=counts(sce[, sce$Run %in% c("SRR9264352")])),
                                       colData=colData(sce[, sce$Run %in% c("SRR9264352")]))

saveRDS(sample1.nz.sce, "Robjects/BC_sample1.rds")
saveRDS(sample2.nz.sce, "Robjects/BC_sample2.rds")

The above two code chunks just illustrate how the starting demonstration files for this chapter were made, if you are working from our github repo you will not need this. if you are testing batch corrections on your own samples, for this first part, read in each sample as a separate sce object.

sample1.sce <- readRDS("Robjects/BC_sample1.rds")
sample2.sce <- readRDS("Robjects/BC_sample2.rds")

1.3 Normalise each separately and re-pool

sample1.qclusters <- quickCluster(sample1.sce, method="igraph")
sample1.sce <- computeSumFactors(sample1.sce, min.mean=0.1, cluster=sample1.qclusters)
sample1.sce <- logNormCounts(sample1.sce)

sample2.qclusters <- quickCluster(sample2.sce, method="igraph")
sample2.sce <- computeSumFactors(sample2.sce, min.mean=0.1, cluster=sample2.qclusters)
sample2.sce <- logNormCounts(sample2.sce)

Re-pool:

# recombine the normalized samples together
all.samp.exprs <- do.call(cbind,
                          list("SRR9264351"=exprs(sample1.sce),
                               "SRR9264352"=exprs(sample2.sce)))
colnames(all.samp.exprs) <- c(as.character(colData(sample1.sce)$Barcode),
                              as.character(colData(sample2.sce)$Barcode))

For the PCA we want to quickly select the genes that are most informative. We will use the top 2000 genes with the highest variance.

gene.variances <- apply(all.samp.exprs, 1, var)
names(gene.variances) <- rownames(all.samp.exprs)
highly.variable.genes <- names(gene.variances[order(gene.variances, decreasing=TRUE)])[1:2000]

Perform PCA:

# we need to use a fast approximate algorithm for PCA on large data sets
# this algorithm has a stochastic component,
# so we need to fix the seed number to get the same result each time
set.seed(42)
separate.hvg.pca <- irlba::prcomp_irlba(t(all.samp.exprs[highly.variable.genes, ]), n=5) # we only need a few components
separate.hvg.pcs <- as.data.frame(separate.hvg.pca$x) # extract the principal components
separate.hvg.pcs$Cell <- colnames(all.samp.exprs) # set the sample column as the cell IDs

# combine the PCs with the sample information into a single data frame for plotting
samples.info <- data.frame("Cell"=colnames(all.samp.exprs),
                           "Run"=c(rep("SRR9264351", ncol(sample1.sce)), 
                                   rep("SRR9264352", ncol(sample2.sce))))

# merge the two data frames together
separate.pca.merge <- merge(separate.hvg.pcs, samples.info, by='Cell')

You can merge and plot the PCA results.

ggplot(separate.pca.merge, aes(x=PC1, y=PC2, fill=Run)) +
  geom_point(shape=21, size=3) +
  theme_minimal()