1 Data integration - PBMMC and ETV6-RUNX samples

1.1 Learning Objectives

  • Understand why/when batch correction is required
  • Understand where data set integration fits in the workflow
  • Understand one main method for batch correction
  • Understand how to check your batch correction has worked
library(scater)
library(scran)
library(batchelor)
library(bluster)
library(tidyverse)
library(pheatmap)
library(clustree)
library(Cairo)
library(BiocSingular)
library(cowplot)

Source: ‘Integrating Datasets’ chapter in the OSCA book.

1.2 Abbreviations

  • HVG: highly variable genes
  • MNN: mutual nearest neighbors
  • PBMMC: peripheral blood mononuclear cell
  • SCE: SingleCellExperiment

1.3 Motivation

Large single-cell RNA sequencing (scRNA-seq) projects usually need to generate data across multiple batches due to logistical constraints. However, the processing of different batches is often subject to uncontrollable differences, e.g., changes in operator, differences in reagent quality. This results in systematic differences in the observed expression in cells from different batches, which we refer to as “batch effects”. Batch effects are problematic as they can be major drivers of heterogeneity in the data, masking the relevant biological differences and complicating interpretation of the results.

Computational correction of these effects is critical for eliminating batch-to-batch variation, allowing data across multiple batches to be combined for common downstream analysis. However, existing methods based on linear models (Ritchie et al. 2015; Leek et al. 2012) assume that the composition of cell populations are either known or the same across batches. To overcome these limitations, bespoke methods have been developed for batch correction of single-cell data (Haghverdi et al. 2018; Butler et al. 2018; Lin et al. 2019) that do not require a priori knowledge about the composition of the population. This allows them to be used in workflows for exploratory analyses of scRNA-seq data where such knowledge is usually unavailable.

1.4 Load the data

We will load the SCE R object with the normalised counts from after the dimmensionality reduction. This object has each of the 7 individual samples SCE objects in a large list for ease of running the following steps but it is no different to running each sample seperately at this point. Each sample has been subset to 500 cells per sample for demonstration purposes.

sce <- readRDS("./Robjects/caron_postDeconv_5hCellPerSpl_dimRed.Rds")

#split.sce <- split(sce, sample(LETTERS, nrow(sce), replace=TRUE))
x <- subset(sce, , SampleName=="ETV6-RUNX1_1")
sampleNameLevels <- levels(factor(colData(sce)$SampleName))
sampleNameToGet <- grep("PBMMC|ETV6-RUNX1", sampleNameLevels, value = TRUE)
all.sce <- lapply(sampleNameToGet, function(x){ subset(sce, , SampleName==x) })
names(all.sce) <- sampleNameToGet

saveRDS(all.sce, "./Robjects/DataIntegration_all_sce.rds")
all.sce <- readRDS("./Robjects/DataIntegration_all_sce.rds")
all.sce

We then apply the standard workflow to each sample separately:

#--- normalization ---#
# use logNormCounts()
all.sce <- lapply(all.sce, logNormCounts)

#--- variance-modelling ---#
# model variance with modelGeneVar()
# find highly variable genes (HVGs) with getTopHVGs()
all.dec <- lapply(all.sce, modelGeneVar)
all.hvgs <- lapply(all.dec, getTopHVGs, prop=0.1)

#--- dimensionality-reduction ---#
# use runPCA()
# then compute embeddings with runTSNE() and runUMAP()
set.seed(10000)
all.sce <- mapply(FUN=runPCA,
          x=all.sce,
          subset_row=all.hvgs,
          MoreArgs=list(ncomponents=25,
                BSPARAM=RandomParam()),
          SIMPLIFY=FALSE)

set.seed(100000)
all.sce <- lapply(all.sce, runTSNE, dimred="PCA")

set.seed(1000000)
all.sce <- lapply(all.sce, runUMAP, dimred="PCA")

#--- clustering ---#
# cluster each sample separately
for (n in names(all.sce)) {
    g <- buildSNNGraph(all.sce[[n]], k=10, use.dimred='PCA')
    clust <- igraph::cluster_walktrap(g)$membership
    all.sce[[n]]$label  <- factor(clust)
}

To prepare for the batch correction:

  • We subset all batches to the common “universe” of features. In this case, it is straightforward as all the samples in these experiments use the same Ensembl gene annotation.
allNames <- unlist(lapply(all.sce, function(x){rownames(x)}))
allNamesNb <- table(allNames)
universe <- names(allNamesNb)[allNamesNb==7] # where 7 is number of samples
  • The size of this common “universe” of features here is the number of features shared by all 7 samples is: 17700.
# Subsetting the SingleCellExperiment object.
uni.sce <- lapply(all.sce, function(x){x[universe,]})
# Also subsetting the variance modelling results, for convenience.
uni.dec <- lapply(all.dec, function(x){x[universe,]})
  • We rescale each batch to adjust for differences in sequencing depth between batches. The multiBatchNorm() function recomputes log-normalized expression values after adjusting the size factors for systematic differences in coverage between SingleCellExperiment (SCE) objects. (Size factors only remove biases between cells within a single batch.) This improves the quality of the correction by removing one aspect of the technical differences between batches.
# rescale each batch to adjust for differences in sequencing depth between batches
rescaled <- multiBatchNorm(uni.sce, batch = "SampleName")
  • We perform feature selection by averaging the variance components across all batches with the combineVar() function. We compute the average as it is responsive to batch-specific HVGs while still preserving the within-batch ranking of genes.
# compute average variance components across samples
combined.dec <- combineVar(uni.dec)

# identify highly variables genes
# here as those with a positive biological component
chosen.hvgs <- combined.dec$bio > 0
combined.dec$chosenHvg <- chosen.hvgs

Number of HVGs: 9479.

When integrating datasets of variable composition, it is generally safer to err on the side of including more genes than are used in a single dataset analysis, to ensure that markers are retained for any dataset-specific subpopulations that might be present. For a top X selection, this means using a larger X (say, ~5000), or in this case, we simply take all genes above the trend.

Alternatively, a more forceful approach to feature selection can be used based on marker genes from within-batch comparisons.

1.5 Diagnosing batch effects

Before we actually perform any correction, it is worth examining whether there is any batch effect in this dataset. We combine the SCE objects and perform a PCA on the log-expression values for all genes with positive (average) biological components.

# Reminder: the metadata must be the same for each sample for cbind()ing.
# concat matrices:
uncorrected <- do.call(cbind, rescaled)

# copy choice to uncorrected SCE:
all(rownames(combined.dec) == rownames(uncorrected))
## [1] TRUE
rowData(uncorrected) <- cbind(rowData(uncorrected), combined.dec)

saveRDS(uncorrected, "./Robjects/DataIntegration_uncorrected.rds")
# Perform PCA
# Using RandomParam() as it is more efficient for file-backed matrices.
set.seed(0010101010)
uncorrected <- runPCA(uncorrected,
                      subset_row=chosen.hvgs,
                      BSPARAM=BiocSingular::RandomParam())

We use graph-based clustering on the components to obtain a summary of the population structure. Different clustering methods are discussed in a later chapter.

As the samples should be replicates, each cluster should ideally consist of cells from each batch. However, we instead see clusters that are comprised of cells from a single batch. This indicates that cells of the same type are artificially separated due to technical differences between batches.

# build shared nearest-neighbour graph
snn.gr <- buildSNNGraph(uncorrected, use.dimred="PCA")
# identify cluster with the walk trap method
clusters <- igraph::cluster_walktrap(snn.gr)$membership
# get number of cells for each {cluster, batch} pair
clusterTab <- data.frame("clusters" = clusters, "batch" = uncorrected$SampleName)

Cluster size and cell contribution by sample:

ClusterInfo <- clusterTab %>% 
  group_by(clusters, batch) %>%
  summarise(cells = n()) 
p1 <- ggplot(data=ClusterInfo, aes(x=clusters,y=cells, fill=batch)) +
    geom_col() +
    theme(axis.text.x=element_text(angle = 90, hjust = 0)) +
  ggtitle("uncorrected, cell numbers") +
  theme(legend.text = element_text(size = 7))
p2 <- ggplot(data=clusterTab, aes(x=clusters, fill = batch)) +
  geom_bar(position = "fill") +
  scale_y_continuous(labels = scales::percent) +
    theme(axis.text.x=element_text(angle = 90, hjust = 0)) +
  ggtitle("uncorrected, proportions") +
  theme(legend.text = element_text(size = 7))

plot_grid(p1, p2, ncol=1)

We can also visualize the uncorrected coordinates using a t-SNE plot. The strong separation between cells from different batches is consistent with the clustering results.

set.seed(1111001)
uncorrected <- runTSNE(uncorrected, dimred = "PCA")
# draw:
p.tsne <- plotTSNE(uncorrected,
          colour_by = "SampleName",
          shape_by = "SampleGroup") +
theme(legend.text = element_text(size = 7))

p.tsne