1 Data integration - PBMMC and ETV6-RUNX samples

1.1 Learning Objectives

  • Understand why/when batch correction is required
  • Understand where data set integration fits in the workflow
  • Understand one main method for batch correction
  • Understand how to check your batch correction has worked
library(scater)
library(scran)
library(batchelor)
library(bluster)
library(tidyverse)
library(pheatmap)
library(clustree)
library(Cairo)
library(BiocSingular)
library(cowplot)

Source: ‘Integrating Datasets’ chapter in the OSCA book.

1.2 Abbreviations

  • HVG: highly variable genes
  • MNN: mutual nearest neighbors
  • PBMMC: peripheral blood mononuclear cell
  • SCE: SingleCellExperiment

1.3 Motivation

Large single-cell RNA sequencing (scRNA-seq) projects usually need to generate data across multiple batches due to logistical constraints. However, the processing of different batches is often subject to uncontrollable differences, e.g., changes in operator, differences in reagent quality. This results in systematic differences in the observed expression in cells from different batches, which we refer to as “batch effects”. Batch effects are problematic as they can be major drivers of heterogeneity in the data, masking the relevant biological differences and complicating interpretation of the results.

Computational correction of these effects is critical for eliminating batch-to-batch variation, allowing data across multiple batches to be combined for common downstream analysis. However, existing methods based on linear models (Ritchie et al. 2015; Leek et al. 2012) assume that the composition of cell populations are either known or the same across batches. To overcome these limitations, bespoke methods have been developed for batch correction of single-cell data (Haghverdi et al. 2018; Butler et al. 2018; Lin et al. 2019) that do not require a priori knowledge about the composition of the population. This allows them to be used in workflows for exploratory analyses of scRNA-seq data where such knowledge is usually unavailable.

1.4 Load the data

We will load the SCE R object with the normalised counts from after the dimmensionality reduction. This object has each of the 7 individual samples SCE objects in a large list for ease of running the following steps but it is no different to running each sample seperately at this point. Each sample has been subset to 500 cells per sample for demonstration purposes.

sce <- readRDS("./Robjects/caron_postDeconv_5hCellPerSpl_dimRed.Rds")

#split.sce <- split(sce, sample(LETTERS, nrow(sce), replace=TRUE))
x <- subset(sce, , SampleName=="ETV6-RUNX1_1")
sampleNameLevels <- levels(factor(colData(sce)$SampleName))
sampleNameToGet <- grep("PBMMC|ETV6-RUNX1", sampleNameLevels, value = TRUE)
all.sce <- lapply(sampleNameToGet, function(x){ subset(sce, , SampleName==x) })
names(all.sce) <- sampleNameToGet

saveRDS(all.sce, "./Robjects/DataIntegration_all_sce.rds")
all.sce <- readRDS("./Robjects/DataIntegration_all_sce.rds")
all.sce

We then apply the standard workflow to each sample separately:

#--- normalization ---#
# use logNormCounts()
all.sce <- lapply(all.sce, logNormCounts)

#--- variance-modelling ---#
# model variance with modelGeneVar()
# find highly variable genes (HVGs) with getTopHVGs()
all.dec <- lapply(all.sce, modelGeneVar)
all.hvgs <- lapply(all.dec, getTopHVGs, prop=0.1)

#--- dimensionality-reduction ---#
# use runPCA()
# then compute embeddings with runTSNE() and runUMAP()
set.seed(10000)
all.sce <- mapply(FUN=runPCA,
          x=all.sce,
          subset_row=all.hvgs,
          MoreArgs=list(ncomponents=25,
                BSPARAM=RandomParam()),
          SIMPLIFY=FALSE)

set.seed(100000)
all.sce <- lapply(all.sce, runTSNE, dimred="PCA")

set.seed(1000000)
all.sce <- lapply(all.sce, runUMAP, dimred="PCA")

#--- clustering ---#
# cluster each sample separately
for (n in names(all.sce)) {
    g <- buildSNNGraph(all.sce[[n]], k=10, use.dimred='PCA')
    clust <- igraph::cluster_walktrap(g)$membership
    all.sce[[n]]$label  <- factor(clust)
}

To prepare for the batch correction:

  • We subset all batches to the common “universe” of features. In this case, it is straightforward as all the samples in these experiments use the same Ensembl gene annotation.
allNames <- unlist(lapply(all.sce, function(x){rownames(x)}))
allNamesNb <- table(allNames)
universe <- names(allNamesNb)[allNamesNb==7] # where 7 is number of samples
  • The size of this common “universe” of features here is the number of features shared by all 7 samples is: 17700.
# Subsetting the SingleCellExperiment object.
uni.sce <- lapply(all.sce, function(x){x[universe,]})
# Also subsetting the variance modelling results, for convenience.
uni.dec <- lapply(all.dec, function(x){x[universe,]})
  • We rescale each batch to adjust for differences in sequencing depth between batches. The multiBatchNorm() function recomputes log-normalized expression values after adjusting the size factors for systematic differences in coverage between SingleCellExperiment (SCE) objects. (Size factors only remove biases between cells within a single batch.) This improves the quality of the correction by removing one aspect of the technical differences between batches.
# rescale each batch to adjust for differences in sequencing depth between batches
rescaled <- multiBatchNorm(uni.sce, batch = "SampleName")
  • We perform feature selection by averaging the variance components across all batches with the combineVar() function. We compute the average as it is responsive to batch-specific HVGs while still preserving the within-batch ranking of genes.
# compute average variance components across samples
combined.dec <- combineVar(uni.dec)

# identify highly variables genes
# here as those with a positive biological component
chosen.hvgs <- combined.dec$bio > 0
combined.dec$chosenHvg <- chosen.hvgs

Number of HVGs: 9479.

When integrating datasets of variable composition, it is generally safer to err on the side of including more genes than are used in a single dataset analysis, to ensure that markers are retained for any dataset-specific subpopulations that might be present. For a top X selection, this means using a larger X (say, ~5000), or in this case, we simply take all genes above the trend.

Alternatively, a more forceful approach to feature selection can be used based on marker genes from within-batch comparisons.

1.5 Diagnosing batch effects

Before we actually perform any correction, it is worth examining whether there is any batch effect in this dataset. We combine the SCE objects and perform a PCA on the log-expression values for all genes with positive (average) biological components.

# Reminder: the metadata must be the same for each sample for cbind()ing.
# concat matrices:
uncorrected <- do.call(cbind, rescaled)

# copy choice to uncorrected SCE:
all(rownames(combined.dec) == rownames(uncorrected))
## [1] TRUE
rowData(uncorrected) <- cbind(rowData(uncorrected), combined.dec)

saveRDS(uncorrected, "./Robjects/DataIntegration_uncorrected.rds")
# Perform PCA
# Using RandomParam() as it is more efficient for file-backed matrices.
set.seed(0010101010)
uncorrected <- runPCA(uncorrected,
                      subset_row=chosen.hvgs,
                      BSPARAM=BiocSingular::RandomParam())

We use graph-based clustering on the components to obtain a summary of the population structure. Different clustering methods are discussed in a later chapter.

As the samples should be replicates, each cluster should ideally consist of cells from each batch. However, we instead see clusters that are comprised of cells from a single batch. This indicates that cells of the same type are artificially separated due to technical differences between batches.

# build shared nearest-neighbour graph
snn.gr <- buildSNNGraph(uncorrected, use.dimred="PCA")
# identify cluster with the walk trap method
clusters <- igraph::cluster_walktrap(snn.gr)$membership
# get number of cells for each {cluster, batch} pair
clusterTab <- data.frame("clusters" = clusters, "batch" = uncorrected$SampleName)

Cluster size and cell contribution by sample:

ClusterInfo <- clusterTab %>% 
  group_by(clusters, batch) %>%
  summarise(cells = n()) 
p1 <- ggplot(data=ClusterInfo, aes(x=clusters,y=cells, fill=batch)) +
    geom_col() +
    theme(axis.text.x=element_text(angle = 90, hjust = 0)) +
  ggtitle("uncorrected, cell numbers") +
  theme(legend.text = element_text(size = 7))
p2 <- ggplot(data=clusterTab, aes(x=clusters, fill = batch)) +
  geom_bar(position = "fill") +
  scale_y_continuous(labels = scales::percent) +
    theme(axis.text.x=element_text(angle = 90, hjust = 0)) +
  ggtitle("uncorrected, proportions") +
  theme(legend.text = element_text(size = 7))

plot_grid(p1, p2, ncol=1)

We can also visualize the uncorrected coordinates using a t-SNE plot. The strong separation between cells from different batches is consistent with the clustering results.

set.seed(1111001)
uncorrected <- runTSNE(uncorrected, dimred = "PCA")
# draw:
p.tsne <- plotTSNE(uncorrected,
          colour_by = "SampleName",
          shape_by = "SampleGroup") +
theme(legend.text = element_text(size = 7))

p.tsne

p.tsne + facet_wrap(~uncorrected$SampleGroup)

Of course, the other explanation for batch-specific clusters is that there are cell types that are unique to each batch. The degree of intermingling of cells from different batches is not an effective diagnostic when the batches involved might actually contain unique cell subpopulations. If a cluster only contains cells from a single batch, one can always debate whether that is caused by a failure of the correction method or if there is truly a batch-specific subpopulation. For example, do batch-specific metabolic or differentiation states represent distinct subpopulations? Or should they be merged together? We will not attempt to answer this here, only noting that each batch correction algorithm will make different (and possibly inappropriate) decisions on what constitutes “shared” and “unique” populations.

Let us write out the corresponding SCE object.

saveRDS(uncorrected, "./Robjects/DataIntegration_uncorrected.rds")

1.6 Linear regression

Batch effects in bulk RNA sequencing studies are commonly removed with linear regression. This involves fitting a linear model to each gene’s expression profile, setting the undesirable batch term to zero and recomputing the observations sans the batch effect, yielding a set of corrected expression values for downstream analyses. Linear modelling is the basis of the removeBatchEffect() function from the limma package (Ritchie et al. 2015) as well the comBat() function from the sva package (Leek et al. 2012).

To use this approach in a scRNA-seq context, we assume that the composition of cell subpopulations is the same across batches. We also assume that the batch effect is additive, i.e., any batch-induced fold-change in expression is the same across different cell subpopulations for any given gene. These are strong assumptions as batches derived from different individuals will naturally exhibit variation in cell type abundances and expression. Nonetheless, they may be acceptable when dealing with batches that are technical replicates generated from the same population of cells. (In fact, when its assumptions hold, linear regression is the most statistically efficient as it uses information from all cells to compute the common batch vector.) Linear modelling can also accommodate situations where the composition is known a priori by including the cell type as a factor in the linear model, but this situation is even less common.

We use the rescaleBatches() function from the batchelor package to remove the batch effect. This is roughly equivalent to applying a linear regression to the log-expression values per gene, with some adjustments to improve performance and efficiency. For each gene, the mean expression in each batch is scaled down until it is equal to the lowest mean across all batches. We deliberately choose to scale all expression values down as this mitigates differences in variance when batches lie at different positions on the mean-variance trend. (Specifically, the shrinkage effect of the pseudo-count is greater for smaller counts, suppressing any differences in variance across batches.) An additional feature of rescaleBatches() is that it will preserve sparsity in the input matrix for greater efficiency, whereas other methods like removeBatchEffect() will always return a dense matrix.

linear_rescaled <- rescaleBatches(rescaled)
linear_rescaled
## class: SingleCellExperiment 
## dim: 17700 3500 
## metadata(0):
## assays(1): corrected
## rownames(17700): ENSG00000000003 ENSG00000000419 ... ENSG00000288380
##   ENSG00000288398
## rowData names(0):
## colnames(3500): 1_AAACGGGCAGTTCATG-1 1_AAACGGGGTTCACCTC-1 ...
##   12_TTTGTCACAATGAAAC-1 12_TTTGTCATCAGTTGAC-1
## colData names(1): batch
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):

After clustering, we observe that most clusters consist of mixtures of cells from the two replicate batches, consistent with the removal of the batch effect. This conclusion is supported by the apparent mixing of cells from different batches in Figure 13.2. However, at least one batch-specific cluster is still present, indicating that the correction is not entirely complete. This is attributable to violation of one of the aforementioned assumptions, even in this simple case involving replicated batches.

set.seed(1010101010) # To ensure reproducibility of IRLBA.
linear_rescaled <- runPCA(linear_rescaled, subset_row = chosen.hvgs, exprs_values = "corrected")

snn.gr.linear <- buildSNNGraph(linear_rescaled, use.dimred = "PCA")
clusters.linear <- igraph::cluster_walktrap(snn.gr.linear)$membership

clusterTab.linear <- data.frame("clusters" = clusters.linear, "batch" = linear_rescaled$batch)

Cluster size and cell contribution by sample:

ClusterInfo.linear <- clusterTab.linear %>% 
  group_by(clusters,batch) %>%
  summarise(cells = n())
lp1 <- ggplot(data=ClusterInfo.linear, aes(x=clusters,y=cells, fill=batch)) +
    theme(axis.text.x=element_text(angle = 90, hjust = 0)) +
    geom_col() +
    theme(legend.text = element_text(size = 7))
lp2 <- ggplot(data=clusterTab.linear, aes(x=clusters, fill=batch)) +
  geom_bar(position = "fill") +
    theme(axis.text.x=element_text(angle = 90, hjust = 0)) +
  scale_y_continuous(labels = scales::percent) +
  theme(legend.text = element_text(size = 7))
plot_grid(lp1, lp2, ncol=1)

Compute and plot t-SNE:

linear_rescaled <- runTSNE(linear_rescaled, dimred = "PCA")
linear_rescaled$batch <- factor(linear_rescaled$batch)
lp.tsne <- plotTSNE(linear_rescaled, colour_by = "batch")
lp.tsne

lp.tsne + facet_wrap(~uncorrected$SampleGroup)

1.7 Mutual Nearest Neighbour correction

1.7.1 Algorithm overview

Consider a cell a in batch A, and identify the cells in batch B that are nearest neighbors to a in the expression space defined by the selected features. Repeat this for a cell b in batch B, identifying its nearest neighbors in A. Mutual nearest neighbors are pairs of cells from different batches that belong in each other’s set of nearest neighbors. The reasoning is that MNN pairs represent cells from the same biological state prior to the application of a batch effect - see Haghverdi et al. (2018) for full theoretical details. Thus, the difference between cells in MNN pairs can be used as an estimate of the batch effect, the subtraction of which yields batch-corrected values.

Compared to linear regression, MNN correction does not assume that the population composition is the same or known beforehand. This is because it learns the shared population structure via identification of MNN pairs and uses this information to obtain an appropriate estimate of the batch effect. Instead, the key assumption of MNN-based approaches is that the batch effect is orthogonal to the biology in high-dimensional expression space. Violations reduce the effectiveness and accuracy of the correction, with the most common case arising from variations in the direction of the batch effect between clusters. Nonetheless, the assumption is usually reasonable as a random vector is very likely to be orthogonal in high-dimensional space.

1.7.2 Application to the data

The batchelor package provides an implementation of the MNN approach via the fastMNN() function. (Unlike the MNN method originally described by Haghverdi et al. (2018), the fastMNN() function performs PCA to reduce the dimensions beforehand and speed up the downstream neighbor detection steps.) We apply it to our two PBMC batches to remove the batch effect across the highly variable genes in chosen.hvgs. To reduce computational work and technical noise, all cells in all batches are projected into the low-dimensional space defined by the top d principal components. Identification of MNNs and calculation of correction vectors are then performed in this low-dimensional space.

# Using randomized SVD here, as this is faster than 
# irlba for file-backed matrices.
set.seed(1000101001)
mnn.out <- fastMNN(rescaled,
                   auto.merge = TRUE,
                   d = 50,
                   k = 20,
                   subset.row = chosen.hvgs,
                   BSPARAM=BiocSingular::RandomParam(deferred=TRUE))
mnn.out
## class: SingleCellExperiment 
## dim: 9479 3500 
## metadata(2): merge.info pca.info
## assays(1): reconstructed
## rownames(9479): ENSG00000000003 ENSG00000000457 ... ENSG00000288302
##   ENSG00000288398
## rowData names(1): rotation
## colnames(3500): 1_AAACGGGCAGTTCATG-1 1_AAACGGGGTTCACCTC-1 ...
##   12_TTTGTCACAATGAAAC-1 12_TTTGTCATCAGTTGAC-1
## colData names(1): batch
## reducedDimNames(1): corrected
## mainExpName: NULL
## altExpNames(0):
mnn.out.corre.dim <- dim(reducedDim(mnn.out, "corrected"))
mnn.out.recon.dim <- dim(assay(mnn.out, "reconstructed"))

# add feature selection outcome to mmn.out
# used in other analyses later.
columnsToGet <- setdiff(colnames(combined.dec), "per.block")
combined.dec.df <- combined.dec[,columnsToGet] %>%
  data.frame() %>%
  tibble::rownames_to_column("ID") %>%
  filter(ID %in% rownames(rowData(mnn.out)))
rotationMat <- rowData(mnn.out)$rotation
rowData(mnn.out)$ID <- rownames(rowData(mnn.out))
rowData(mnn.out)$rotation <- NULL

rowData(mnn.out) <- rowData(mnn.out) %>%
  data.frame() %>%
  left_join(combined.dec.df, by="ID") %>%
  DataFrame()

# add rotation back
rowData(mnn.out)$rotation <- rotationMat

# also have gene symbol:
# copied from 'uncorrected'
rowData(mnn.out)$Symbol <- rowData(uncorrected)[rowData(mnn.out)$ID,]$Symbol

# tidy
rm(columnsToGet, rotationMat)

The function returns a SCE object containing corrected values for downstream analyses like clustering or visualization. Each column of mnn.out corresponds to a cell in one of the batches, while each row corresponds to an input gene in chosen.hvgs. The batch field in the column metadata contains a vector specifying the batch of origin of each cell.

The corrected matrix in the reducedDims() contains the low-dimensional corrected coordinates for all cells, which we will use in place of the PCs in our downstream analyses (3500 cells and 50 PCs).

A reconstructed matrix in the assays() contains the corrected expression values for each gene in each cell, obtained by projecting the low-dimensional coordinates in corrected back into gene expression space (9479 genes and 3500 cells). We do not recommend using this for anything other than visualization.

The most relevant parameter for tuning fastMNN() is k, which specifies the number of nearest neighbors to consider when defining MNN pairs. This can be interpreted as the minimum anticipated frequency of any shared cell type or state in each batch. Increasing k will generally result in more aggressive merging as the algorithm is more generous in matching subpopulations across batches. It can occasionally be desirable to increase k if one clearly sees that the same cell types are not being adequately merged across batches.

colDataList <- lapply(rescaled, function(x){colData(x)})
colDataDf <- do.call(rbind, colDataList)
colData(mnn.out) <- DataFrame(colDataDf)

1.8 Correction diagnostics

1.8.1 Mixing between batches

We cluster on the low-dimensional corrected coordinates to obtain a partitioning of the cells that serves as a proxy for the population structure. If the batch effect is successfully corrected, clusters corresponding to shared cell types or states should contain cells from multiple batches. We see that all clusters contain contributions from each batch after correction, consistent with our expectation that the batches are replicates of each other.

snn.gr <- buildSNNGraph(mnn.out, use.dimred="corrected", k=20)
clusters.mnn <- igraph::cluster_walktrap(snn.gr)$membership
colLabels(mnn.out) <- factor(clusters.mnn)
clusterTab.mnn <- data.frame(clusters=clusters.mnn, batch=mnn.out$SampleName, source=mnn.out$SampleGroup)

Cluster size and cell contribution by sample, with clusters sorted by size:

ClusterInfo.mnn <- clusterTab.mnn %>% 
  as.data.frame() %>%
  group_by(clusters,batch) %>%
  summarise(cells = n())

mp1 <- ggplot(data=ClusterInfo.mnn, aes(x=clusters,y=cells, fill=batch)) +
    theme(axis.text.x=element_text(angle = 90, hjust = 0)) +
    geom_col() +
    theme(legend.text = element_text(size = 7))
mp2 <- ggplot(data=clusterTab.mnn, aes(x=clusters, fill=batch)) +
  geom_bar(position = "fill") +
    theme(axis.text.x=element_text(angle = 90, hjust = 0)) +
  scale_y_continuous(labels = scales::percent) +
  theme(legend.text = element_text(size = 7))

plot_grid(mp1, mp2, ncol=1)

Cluster size and cell contribution by sample type:

ClusterInfo.mnn.source <- clusterTab.mnn %>% 
  as.data.frame() %>%
  group_by(clusters,source) %>%
  summarise(cells = n())

mp1.s <- ggplot(data=ClusterInfo.mnn.source, aes(x=clusters,y=cells, fill=source)) +
    theme(axis.text.x=element_text(angle = 90, hjust = 0)) +
    geom_col()
mp2.s <- ggplot(data=clusterTab.mnn, aes(x=clusters, fill=source)) +
  geom_bar(position = "fill") +
    theme(axis.text.x=element_text(angle = 90, hjust = 0)) +
  scale_y_continuous(labels = scales::percent)

plot_grid(mp1.s, mp2.s, ncol=1)

We can also compute the variation in the log-abundances to rank the clusters with the greatest variability in their proportional abundances across batches. We can then focus on batch-specific clusters that may be indicative of incomplete batch correction. Obviously, though, this diagnostic is subject to interpretation as the same outcome can be caused by batch-specific populations; some prior knowledge about the biological context is necessary to distinguish between these two possibilities. The table below shows the number of cells for each cluster (row) and sample (column) together with the variance in cell number across these samples (‘var’ column).

Also bear in mind that the variance is computed across 7 samples here and only serves to sort clusters.

tab.mnn <- table(clusters=clusterTab.mnn$clusters, sampleName=clusterTab.mnn$batch)

# Using a large pseudo.count to avoid unnecessarily
# large variances when the counts are low.
norm <- normalizeCounts(tab.mnn, pseudo_count=10)

# Ranking clusters by the largest variances.
rv <- rowVars(norm) %>% 
  round(2)

# show
DataFrame(tab.mnn, var=rv)[order(rv, decreasing=TRUE),]
## DataFrame with 13 rows and 4 columns
##     clusters   sampleName      Freq       var
##     <factor>        <Rle> <integer> <numeric>
## 1         5  ETV6-RUNX1_1       295      2.96
## 2         13 ETV6-RUNX1_1         4      2.39
## 3         1  ETV6-RUNX1_1         0      2.30
## 4         4  ETV6-RUNX1_1       112      1.63
## 5         12 ETV6-RUNX1_1         0      1.63
## ...      ...          ...       ...       ...
## 9         2  ETV6-RUNX1_1        87      0.86
## 10        9  ETV6-RUNX1_1         0      0.77
## 11        6  ETV6-RUNX1_1         2      0.75
## 12        8  ETV6-RUNX1_1         0      0.60
## 13        11 ETV6-RUNX1_1         0      0.35

We can also visualize the corrected coordinates using a t-SNE plot. The presence of visual clusters containing cells from both batches provides a comforting illusion that the correction was successful.

set.seed(0010101010)
mnn.out <- runTSNE(mnn.out, dimred="corrected")
mnn.out <- runUMAP(mnn.out, dimred="corrected")

1.8.2 Challenge 1

Draw the TSNE for the fastmnn corrected data. Compare it with the uncorrected TSNE, what do you think?

Find the challenge markdown for this section in the course materials folder.

1.8.3 Proportion of Variance

For fastMNN(), one useful diagnostic is the proportion of variance within each batch that is lost during MNN correction. Specifically, this refers to the within-batch variance that is removed during orthogonalization with respect to the average correction vector at each merge step. This is returned via the lost.var field in the metadata of mnn.out, which contains a matrix of the variance lost in each batch (column) at each merge step (row).

round(metadata(mnn.out)$merge.info$lost.var,2)
##      ETV6-RUNX1_1 ETV6-RUNX1_2 ETV6-RUNX1_3 ETV6-RUNX1_4 PBMMC_1b PBMMC_2
## [1,]         0.00         0.00         0.00         0.00     0.00    0.06
## [2,]         0.00         0.00         0.00         0.00     0.06    0.03
## [3,]         0.00         0.00         0.12         0.00     0.01    0.01
## [4,]         0.00         0.00         0.01         0.09     0.00    0.00
## [5,]         0.00         0.09         0.00         0.01     0.01    0.01
## [6,]         0.09         0.01         0.01         0.02     0.01    0.02
##      PBMMC_3
## [1,]    0.03
## [2,]    0.03
## [3,]    0.01
## [4,]    0.00
## [5,]    0.01
## [6,]    0.01

Large proportions of lost variance (>10%) suggest that correction is removing genuine biological heterogeneity. This would occur due to violations of the assumption of orthogonality between the batch effect and the biological subspace (Haghverdi et al. 2018). In this case, the proportion of lost variance is small, indicating that non-orthogonality is not a major concern.

The following t-SNE shows the clusters identified:

mp.tsne.cluster <- plotTSNE(mnn.out, colour_by="label", shape_by="SampleGroup")
mp.tsne.cluster

mp.tsne.cluster + facet_wrap(~colData(mnn.out)$SampleName)

The following t-SNE plots show expression levels of known cell type marker genes.

genesToShow <- c(
         "CD79A", # CD79A   B ***
         "CST3", # CST3     monocytes ***
         "CD3D", # CD3D      T cells ***
         "HBA1" # HBA1   erythrocytes ***
        )

tmpInd <- which(rowData(uncorrected)$Symbol %in% genesToShow)
ensToShow <- rowData(uncorrected)$ID[tmpInd]

#B cells:

genex <- ensToShow[1]
p <- plotTSNE(mnn.out, colour_by = genex, by_exprs_values="reconstructed")
p <- p + ggtitle(
        paste("B cells", genex,
        rowData(uncorrected)[genex,"Symbol"])
    )
pB <- p

#T cells:

genex <- ensToShow[3]
p <- plotTSNE(mnn.out, colour_by = genex, by_exprs_values="reconstructed")
p <- p + ggtitle(
        paste("T cells", genex,
        rowData(uncorrected)[genex,"Symbol"])
    )
pT <- p

#monocytes:

genex <- ensToShow[2]
p <- plotTSNE(mnn.out, colour_by = genex, by_exprs_values="reconstructed")
p <- p + ggtitle(
        paste("monocytes", genex,
        rowData(uncorrected)[genex,"Symbol"])
    )
pM <- p

#erythrocytes:

genex <- ensToShow[4]
p <- plotTSNE(mnn.out, colour_by = genex, by_exprs_values="reconstructed")
p <- p + ggtitle(
        paste("erythrocytes", genex,
        rowData(uncorrected)[genex,"Symbol"])
    )
pE <- p
gridExtra::grid.arrange(pB + theme(legend.position="bottom"),
                        pT + theme(legend.position="bottom"),
                        pM + theme(legend.position="bottom"),
                        pE + theme(legend.position="bottom"),
                        ncol=2)

Compare to the uncorrected values:

# B cells
genex <- ensToShow[1]
p <- plotTSNE(uncorrected, colour_by = genex)
p <- p + ggtitle(
        paste("B cells", genex,
        rowData(uncorrected)[genex,"Symbol"])
    )
pBu <- p

#Compare to the uncorrected values, T cells:

genex <- ensToShow[3]
p <- plotTSNE(uncorrected, colour_by = genex)
p <- p + ggtitle(
        paste("T cells", genex,
        rowData(uncorrected)[genex,"Symbol"])
    )
#print(p)
pTu <- p
#```

#Compare to the uncorrected values, monocytes:

genex <- ensToShow[2]
p <- plotTSNE(uncorrected, colour_by = genex)
p <- p + ggtitle(
        paste("monocytes", genex,
        rowData(uncorrected)[genex,"Symbol"])
    )
pMu <- p

#Compare to the uncorrected values, erythrocytes:

genex <- ensToShow[4]
p <- plotTSNE(uncorrected, colour_by = genex)
p <- p + ggtitle(
        paste("erythrocytes", genex,
        rowData(uncorrected)[genex,"Symbol"])
    )
pEu <- p
gridExtra::grid.arrange(pBu + theme(legend.position="bottom"),
                        pTu + theme(legend.position="bottom"),
                        pMu + theme(legend.position="bottom"),
                        pEu + theme(legend.position="bottom"),
                        ncol=2)

1.8.4 Preserving biological heterogeneity

1.8.4.1 Comparison between within-batch clusters and across-batch clusters obtained after MNN correction

Another useful diagnostic check is to compare the clustering within each batch to the clustering of the merged data. Accurate data integration should preserve variance within each batch as there should be nothing to remove between cells in the same batch. This check complements the previously mentioned diagnostics that only focus on the removal of differences between batches. Specifically, it protects us against cases where the correction method simply aggregates all cells together, which would achieve perfect mixing but also discard the biological heterogeneity of interest.

Ideally, we should see a many-to-1 mapping where the across-batch clustering is nested inside the within-batch clusterings. This indicates that any within-batch structure was preserved after correction while acknowledging that greater resolution is possible with more cells. In practice, more discrepancies can be expected even when the correction is perfect, due to the existence of closely related clusters that were arbitrarily separated in the within-batch clustering. As a general rule, we can be satisfied with the correction if the vast majority of entries are zero, though this may depend on whether specific clusters of interest are gained or lost.

Here is a demonstration with one of the samples, where each entry is colored according to the number of cells with each pair of labels (before and after correction), on the log10 scale with pseudocounts (+10) for a smoother color transition (so a minimum value of log10(0+10) == 1).

tabE1 <- table(paste("after", clusters.mnn[linear_rescaled$batch=="ETV6-RUNX1_1"]),
               paste("before", colLabels(rescaled[[1]])))
heatE1 <- pheatmap(log10(tabE1 +10), cluster_rows = FALSE, cluster_cols = FALSE,
                   main = "ETV6-RUNX1_1")
heatE1

The redistribution of cells from one set of clusters to another, here ‘within-batch before’ and ‘across-batch after’ correction, may also be visualized with a clustering tree clustree. See the Extended section linked on the course website.

1.8.4.2 Rand index

Finally, we can summarize the agreement between clusterings by computing the Rand index. This provides a simple metric that we can use to assess the preservation of variation by different correction methods. Larger rand indices (i.e., closer to 1) are more desirable, though this must be balanced against the ability of each method to actually remove the batch effect.

ariVec <- vector(mode = "numeric", length = 7)
sampleNames <- names(rescaled)
names(ariVec) <- sampleNames

for (i in 1:7) {
  ariVec[i] <- pairwiseRand(
    ref=as.integer(colLabels(rescaled[[i]])),
    alt=as.integer(clusters.mnn[linear_rescaled$batch==sampleNames[i]]),
    mode="index")
}
ariVec <- round(ariVec,2)
ariVec
## ETV6-RUNX1_1 ETV6-RUNX1_2 ETV6-RUNX1_3 ETV6-RUNX1_4     PBMMC_1b      PBMMC_2 
##         0.36         0.36         0.71         0.81         0.82         0.72 
##      PBMMC_3 
##         0.85

A sample may show a low Rand index value if cells grouped together in a small cluster before correction are split into distinct clusters after correction because the latter comprise cell populations not observed in that sample but present in other samples.

This would be the case of GSM3872434 with far fewer erythrocytes (grouped in a single cluster) than GSM3872443, in which subtypes can be distinguished.

We can also break down the adjusted Rand index (ARI) into per-cluster ratios for more detailed diagnostics. For example, we could see low ratios off the diagonal if distinct clusters in the within-batch clustering were incorrectly aggregated in the merged clustering. Conversely, we might see low ratios on the diagonal if the correction inflated or introduced spurious heterogeneity inside a within-batch cluster.

# pairwiseRand(), ratio, adjusted
# square numeric matrix is returned with number of rows equal to the number of unique levels in ref.

tabList <- vector(mode = "list", length = 7)
for (i in 1:7) {
  tabList[[i]] <- pairwiseRand(
    ref=as.integer(colLabels(rescaled[[i]])),
    alt=as.integer(clusters.mnn[linear_rescaled$batch==sampleNames[i]])
    )
}
randVal <- unlist(tabList) 

## make breaks from combined range
limits <- c(
  min(randVal, na.rm = TRUE),
  max(randVal, na.rm = TRUE))
limits <- quantile(randVal, probs=c(0.05, 0.95), na.rm = TRUE)

Breaks <- seq(limits[1], limits[2],
              length = 100)

plotList <- vector(mode = "list", length = 7)
for (i in 1:7) {
  plotList[[i]] <- pheatmap(tabList[[i]],
                                 cluster_row=FALSE,
                                 cluster_col=FALSE,
                                 breaks=Breaks,
                                 main=sprintf("%s ratio", sampleNames[i]),
                                 silent=TRUE)
}
grobList <- lapply(plotList, function(x){x[[4]]})
gridExtra::grid.arrange(grobs = grobList,
      ncol=2
)

We suggest limiting the use of per-gene corrected values to visualization, e.g., when coloring points on a t-SNE plot by per-cell expression. This can be more aesthetically pleasing than uncorrected expression values that may contain large shifts on the colour scale between cells in different batches. Use of the corrected values in any quantitative procedure should be treated with caution, and should be backed up by similar results from an analysis on the uncorrected values.

# save the mnn.out object in a file

saveRDS(mnn.out, file="./Robjects/DataIntegration_mnn.out.rds")

1.9 Session information

sessionInfo()
## R version 4.1.0 (2021-05-18)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Catalina 10.15.7
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] cowplot_1.1.1               BiocSingular_1.8.1         
##  [3] Cairo_1.5-12.2              clustree_0.4.3             
##  [5] ggraph_2.0.5                pheatmap_1.0.12            
##  [7] forcats_0.5.1               stringr_1.4.0              
##  [9] dplyr_1.0.7                 purrr_0.3.4                
## [11] readr_1.4.0                 tidyr_1.1.3                
## [13] tibble_3.1.2                tidyverse_1.3.1            
## [15] bluster_1.2.1               batchelor_1.8.0            
## [17] scran_1.20.1                scater_1.20.1              
## [19] ggplot2_3.3.5               scuttle_1.2.0              
## [21] SingleCellExperiment_1.14.1 SummarizedExperiment_1.22.0
## [23] Biobase_2.52.0              GenomicRanges_1.44.0       
## [25] GenomeInfoDb_1.28.1         IRanges_2.26.0             
## [27] S4Vectors_0.30.0            BiocGenerics_0.38.0        
## [29] MatrixGenerics_1.4.0        matrixStats_0.59.0         
## 
## loaded via a namespace (and not attached):
##  [1] Rtsne_0.15                ggbeeswarm_0.6.0         
##  [3] colorspace_2.0-2          ellipsis_0.3.2           
##  [5] XVector_0.32.0            BiocNeighbors_1.10.0     
##  [7] fs_1.5.0                  rstudioapi_0.13          
##  [9] farver_2.1.0              graphlayouts_0.7.1       
## [11] ggrepel_0.9.1             RSpectra_0.16-0          
## [13] fansi_0.5.0               lubridate_1.7.10         
## [15] xml2_1.3.2                sparseMatrixStats_1.4.0  
## [17] knitr_1.33                polyclip_1.10-0          
## [19] jsonlite_1.7.2            ResidualMatrix_1.2.0     
## [21] broom_0.7.8               cluster_2.1.2            
## [23] dbplyr_2.1.1              uwot_0.1.10              
## [25] ggforce_0.3.3             compiler_4.1.0           
## [27] httr_1.4.2                dqrng_0.3.0              
## [29] backports_1.2.1           assertthat_0.2.1         
## [31] Matrix_1.3-4              limma_3.48.1             
## [33] cli_3.0.0                 tweenr_1.0.2             
## [35] htmltools_0.5.1.1         tools_4.1.0              
## [37] rsvd_1.0.5                igraph_1.2.6             
## [39] gtable_0.3.0              glue_1.4.2               
## [41] GenomeInfoDbData_1.2.6    Rcpp_1.0.7               
## [43] cellranger_1.1.0          jquerylib_0.1.4          
## [45] vctrs_0.3.8               DelayedMatrixStats_1.14.0
## [47] xfun_0.24                 beachmat_2.8.0           
## [49] rvest_1.0.0               lifecycle_1.0.0          
## [51] irlba_2.3.3               statmod_1.4.36           
## [53] edgeR_3.34.0              MASS_7.3-54              
## [55] zlibbioc_1.38.0           scales_1.1.1             
## [57] tidygraph_1.2.0           hms_1.1.0                
## [59] RColorBrewer_1.1-2        yaml_2.2.1               
## [61] gridExtra_2.3             sass_0.4.0               
## [63] stringi_1.7.2             highr_0.9                
## [65] ScaledMatrix_1.0.0        BiocParallel_1.26.1      
## [67] rlang_0.4.11              pkgconfig_2.0.3          
## [69] bitops_1.0-7              evaluate_0.14            
## [71] lattice_0.20-44           labeling_0.4.2           
## [73] tidyselect_1.1.1          magrittr_2.0.1           
## [75] R6_2.5.0                  generics_0.1.0           
## [77] metapod_1.0.0             DelayedArray_0.18.0      
## [79] DBI_1.1.1                 pillar_1.6.1             
## [81] haven_2.4.1               withr_2.4.2              
## [83] RCurl_1.98-1.3            modelr_0.1.8             
## [85] crayon_1.4.1              utf8_1.2.1               
## [87] rmarkdown_2.9             viridis_0.6.1            
## [89] locfit_1.5-9.4            grid_4.1.0               
## [91] readxl_1.3.1              FNN_1.1.3                
## [93] reprex_2.0.0              digest_0.6.27            
## [95] munsell_0.5.0             beeswarm_0.4.0           
## [97] viridisLite_0.4.0         vipor_0.4.5              
## [99] bslib_0.2.5.1