projDir <- "/mnt/scratcha/bioinformatics/baller01/20200511_FernandesM_ME_crukBiSs2020"
outDirBit <- "AnaWiSce/Attempt1"
nbPcToComp <- 50
Source: ‘Integrating Datasets’ chapter in the OSCA book.
We will load the R file keeping the SCE object with the normalised counts, and subset 1000 cells per sample.
setName <- "caron"
setSuf <- ""
tmpFn <- sprintf("%s/%s/Robjects/%s_sce_nz_postDeconv%s.Rds", projDir, outDirBit, setName, setSuf)
print(tmpFn)
[1] "/mnt/scratcha/bioinformatics/baller01/20200511_FernandesM_ME_crukBiSs2020/AnaWiSce/Attempt1/Robjects/caron_sce_nz_postDeconv.Rds"
if(!file.exists(tmpFn))
{
knitr::knit_exit()
}
sce <- readRDS(tmpFn)
sce
class: SingleCellExperiment
dim: 18372 45669
metadata(0):
assays(1): counts
rownames(18372): ENSG00000238009 ENSG00000237491 ... ENSG00000275063
ENSG00000271254
rowData names(11): ensembl_gene_id external_gene_name ... detected
gene_sparsity
colnames: NULL
colData names(19): Sample Barcode ... discard cell_sparsity
reducedDimNames(0):
altExpNames(0):
colnames(rowData(sce))[colnames(rowData(sce)) == "strand"] <- "strandNum"
Get data for the PBMMC sample group:
# CaronBourque2020
cb_sampleSheetFn <- file.path(projDir, "Data/CaronBourque2020/SraRunTable.txt")
cb_sampleSheet <- read.table(cb_sampleSheetFn, header=T, sep=",")
splVec <- cb_sampleSheet %>% filter(source_name == "PBMMC") %>%
pull(Sample.Name)
sourceNames <- unique(colData(sce)$source_name)
sceOrig <- sce
sce <- sceOrig[,sce$source_name == "PBMMC" ]
nbCells <- 1000
all.sce <- list()
for(spx in splVec)
{
vec.bc <- colData(sce) %>%
data.frame() %>%
filter(Sample.Name == spx) %>%
sample_n(nbCells) %>%
pull(Barcode)
tmpInd <- which(colData(sce)$Barcode %in% vec.bc)
all.sce[[spx]] <- sce[,tmpInd]
}
#--- normalization ---#
all.sce <- lapply(all.sce, logNormCounts)
#--- variance-modelling ---#
library(scran)
all.dec <- lapply(all.sce, modelGeneVar)
all.hvgs <- lapply(all.dec, getTopHVGs, prop=0.1)
#--- dimensionality-reduction ---#
library(BiocSingular)
set.seed(10000)
all.sce <- mapply(FUN=runPCA, x=all.sce, subset_row=all.hvgs,
MoreArgs=list(ncomponents=25, BSPARAM=RandomParam()),
SIMPLIFY=FALSE)
set.seed(100000)
all.sce <- lapply(all.sce, runTSNE, dimred="PCA")
set.seed(1000000)
all.sce <- lapply(all.sce, runUMAP, dimred="PCA")
#--- clustering ---#
for (n in names(all.sce)) {
g <- buildSNNGraph(all.sce[[n]], k=10, use.dimred='PCA')
clust <- igraph::cluster_walktrap(g)$membership
colLabels(all.sce[[n]]) <- factor(clust)
}
To prepare for the batch correction:
allNames <- unlist(lapply(all.sce, function(x){rownames(x)}))
allNamesNb <- table(allNames)
universe <- names(allNamesNb)[allNamesNb==3]
length(universe)
[1] 18372
# Subsetting the SingleCellExperiment object.
uni.sce <- lapply(all.sce, function(x){x[universe,]})
# Also subsetting the variance modelling results, for convenience.
uni.dec <- lapply(all.dec, function(x){x[universe,]})
library(batchelor)
rescaled <- multiBatchNorm(uni.sce, batch = "Sample.Name")
library(scran)
combined.dec <- combineVar(uni.dec[[1]], uni.dec[[2]], uni.dec[[3]])
chosen.hvgs <- combined.dec$bio > 0
sum(chosen.hvgs)
[1] 8697
When integrating datasets of variable composition, it is generally safer to err on the side of including more genes than are used in a single dataset analysis, to ensure that markers are retained for any dataset-specific subpopulations that might be present. For a top X selection, this means using a larger X (say, ~5000), or in this case, we simply take all genes above the trend.
Alternatively, a more forceful approach to feature selection can be used based on marker genes from within-batch comparisons.
Before we actually perform any correction, it is worth examining whether there is any batch effect in this dataset. We combine the two SingleCellExperiments and perform a PCA on the log-expression values for all genes with positive (average) biological components.
# Synchronizing the metadata for cbind()ing.
identical(rowData(rescaled[[1]]), rowData(rescaled[[2]]))
[1] TRUE
identical(rowData(rescaled[[1]]), rowData(rescaled[[3]]))
[1] TRUE
rescaled[[1]]$batch <- rescaled[[1]]$Sample.Name
rescaled2 <- lapply(rescaled, function(x){x$batch <- x$Sample.Name; x})
rescaled <- rescaled2
uncorrected <- cbind(rescaled[[1]], rescaled[[2]], rescaled[[3]])
# Using RandomParam() as it is more efficient for file-backed matrices.
library(scater)
set.seed(0010101010)
uncorrected <- runPCA(uncorrected, subset_row=chosen.hvgs,
BSPARAM=BiocSingular::RandomParam())
We use graph-based clustering on the components to obtain a summary of the population structure.
As our the samples should be replicates, each cluster should ideally consist of cells from each batch. However, we instead see clusters that are comprised of cells from a single batch. This indicates that cells of the same type are artificially separated due to technical differences between batches.
library(scran)
snn.gr <- buildSNNGraph(uncorrected, use.dimred="PCA")
clusters <- igraph::cluster_walktrap(snn.gr)$membership
tab <- table(Cluster=clusters, Batch=uncorrected$batch)
tab
Batch
Cluster GSM3872442 GSM3872443 GSM3872444
1 194 43 97
2 2 115 7
3 113 16 304
4 12 66 18
5 129 9 21
6 23 97 130
7 87 21 49
8 163 36 152
9 0 52 0
10 1 7 151
11 0 49 0
12 0 135 8
13 8 10 7
14 0 22 0
15 15 14 9
16 0 226 0
17 23 13 5
18 0 60 0
19 213 0 0
20 17 9 42
We can also visualize the corrected coordinates using a t-SNE plot. The strong separation between cells from different batches is consistent with the clustering results.
set.seed(1111001)
uncorrected <- runTSNE(uncorrected, dimred="PCA")
plotTSNE(uncorrected, colour_by="batch")
Of course, the other explanation for batch-specific clusters is that there are cell types that are unique to each batch. The degree of intermingling of cells from different batches is not an effective diagnostic when the batches involved might actually contain unique cell subpopulations (which is not a consideration in the PBMC dataset, but the same cannot be said in general). If a cluster only contains cells from a single batch, one can always debate whether that is caused by a failure of the correction method or if there is truly a batch-specific subpopulation. For example, do batch-specific metabolic or differentiation states represent distinct subpopulations? Or should they be merged together? We will not attempt to answer this here, only noting that each batch correction algorithm will make different (and possibly inappropriate) decisions on what constitutes “shared” and “unique” populations.
Batch effects in bulk RNA sequencing studies are commonly removed with linear regression. This involves fitting a linear model to each gene’s expression profile, setting the undesirable batch term to zero and recomputing the observations sans the batch effect, yielding a set of corrected expression values for downstream analyses. Linear modelling is the basis of the removeBatchEffect() function from the limma package (Ritchie et al. 2015) as well the comBat() function from the sva package (Leek et al. 2012).
To use this approach in a scRNA-seq context, we assume that the composition of cell subpopulations is the same across batches. We also assume that the batch effect is additive, i.e., any batch-induced fold-change in expression is the same across different cell subpopulations for any given gene. These are strong assumptions as batches derived from different individuals will naturally exhibit variation in cell type abundances and expression. Nonetheless, they may be acceptable when dealing with batches that are technical replicates generated from the same population of cells. (In fact, when its assumptions hold, linear regression is the most statistically efficient as it uses information from all cells to compute the common batch vector.) Linear modelling can also accommodate situations where the composition is known a priori by including the cell type as a factor in the linear model, but this situation is even less common.
We use the rescaleBatches() function from the batchelor package to remove the batch effect. This is roughly equivalent to applying a linear regression to the log-expression values per gene, with some adjustments to improve performance and efficiency. For each gene, the mean expression in each batch is scaled down until it is equal to the lowest mean across all batches. We deliberately choose to scale all expression values down as this mitigates differences in variance when batches lie at different positions on the mean-variance trend. (Specifically, the shrinkage effect of the pseudo-count is greater for smaller counts, suppressing any differences in variance across batches.) An additional feature of rescaleBatches() is that it will preserve sparsity in the input matrix for greater efficiency, whereas other methods like removeBatchEffect() will always return a dense matrix.
library(batchelor)
rescaled2 <- rescaleBatches(rescaled)
rescaled2
class: SingleCellExperiment
dim: 18372 3000
metadata(0):
assays(1): corrected
rownames(18372): ENSG00000000003 ENSG00000000419 ... ENSG00000285486
ENSG00000285492
rowData names(0):
colnames: NULL
colData names(1): batch
reducedDimNames(0):
altExpNames(0):
After clustering, we should observe that most clusters consist of mixtures of cells from the replicate batches, consistent with the removal of the batch effect.
set.seed(1010101010) # To ensure reproducibility of IRLBA.
rescaled2 <- runPCA(rescaled2, subset_row=chosen.hvgs, exprs_values="corrected")
snn.gr <- buildSNNGraph(rescaled2, use.dimred="PCA")
clusters.resc <- igraph::cluster_walktrap(snn.gr)$membership
tab.resc <- table(Cluster=clusters.resc, Batch=rescaled2$batch)
tab.resc
Batch
Cluster GSM3872442 GSM3872443 GSM3872444
1 123 43 171
2 24 36 6
3 189 42 99
4 0 117 1
5 0 53 0
6 12 0 18
7 291 224 244
8 86 23 45
9 34 79 58
10 0 40 0
11 17 14 11
12 2 0 14
13 167 47 178
14 8 9 7
15 25 87 107
16 22 9 41
17 0 17 0
18 0 27 0
19 0 133 0
rescaled2 <- runTSNE(rescaled2, dimred="PCA")
rescaled2$batch <- factor(rescaled2$batch)
plotTSNE(rescaled2, colour_by="batch")
Consider a cell a in batch A, and identify the cells in batch B that are nearest neighbors to a in the expression space defined by the selected features. Repeat this for a cell b in batch B, identifying its nearest neighbors in A
. Mutual nearest neighbors are pairs of cells from different batches that belong in each other’s set of nearest neighbors. The reasoning is that MNN pairs represent cells from the same biological state prior to the application of a batch effect - see Haghverdi et al. (2018) for full theoretical details. Thus, the difference between cells in MNN pairs can be used as an estimate of the batch effect, the subtraction of which yields batch-corrected values.
Compared to linear regression, MNN correction does not assume that the population composition is the same or known beforehand. This is because it learns the shared population structure via identification of MNN pairs and uses this information to obtain an appropriate estimate of the batch effect. Instead, the key assumption of MNN-based approaches is that the batch effect is orthogonal to the biology in high-dimensional expression space. Violations reduce the effectiveness and accuracy of the correction, with the most common case arising from variations in the direction of the batch effect between clusters. Nonetheless, the assumption is usually reasonable as a random vector is very likely to be orthogonal in high-dimensional space.
The batchelor package provides an implementation of the MNN approach via the fastMNN() function. (Unlike the MNN method originally described by Haghverdi et al. (2018), the fastMNN() function performs PCA to reduce the dimensions beforehand and speed up the downstream neighbor detection steps.) We apply it to our two PBMC batches to remove the batch effect across the highly variable genes in chosen.hvgs. To reduce computational work and technical noise, all cells in all batches are projected into the low-dimensional space defined by the top d principal components. Identification of MNNs and calculation of correction vectors are then performed in this low-dimensional space.
# Using randomized SVD here, as this is faster than
# irlba for file-backed matrices.
set.seed(1000101001)
mnn.out <- fastMNN(rescaled, auto.merge=TRUE, d=50, k=20, subset.row=chosen.hvgs,
BSPARAM=BiocSingular::RandomParam(deferred=TRUE))
mnn.out
class: SingleCellExperiment
dim: 8697 3000
metadata(2): merge.info pca.info
assays(1): reconstructed
rownames(8697): ENSG00000000003 ENSG00000000457 ... ENSG00000285437
ENSG00000285476
rowData names(1): rotation
colnames: NULL
colData names(1): batch
reducedDimNames(1): corrected
altExpNames(0):
The function returns a SingleCellExperiment object containing corrected values for downstream analyses like clustering or visualization. Each column of mnn.out corresponds to a cell in one of the batches, while each row corresponds to an input gene in chosen.hvgs. The batch field in the column metadata contains a vector specifying the batch of origin of each cell.
print(head(mnn.out$batch))
[1] "GSM3872442" "GSM3872442" "GSM3872442" "GSM3872442" "GSM3872442"
[6] "GSM3872442"
The corrected matrix in the reducedDims() contains the low-dimensional corrected coordinates for all cells, which we will use in place of the PCs in our downstream analyses.
dim(reducedDim(mnn.out, "corrected"))
[1] 3000 50
A reconstructed matrix in the assays() contains the corrected expression values for each gene in each cell, obtained by projecting the low-dimensional coordinates in corrected back into gene expression space. We do not recommend using this for anything other than visualization.
print(assay(mnn.out, "reconstructed"))
<8697 x 3000> matrix of class LowRankMatrix and type "double":
[,1] [,2] [,3] ... [,2999]
ENSG00000000003 -1.722144e-04 -3.236026e-06 -1.561609e-04 . -1.083770e-04
ENSG00000000457 5.854177e-05 -3.582852e-04 -2.707575e-04 . -4.484206e-04
ENSG00000000938 -2.708141e-03 1.123687e-02 -4.248379e-04 . -3.218795e-03
ENSG00000000971 -3.124645e-04 2.581670e-04 -5.870974e-05 . 3.332748e-04
ENSG00000001167 2.246984e-04 -1.155308e-03 -5.822907e-05 . -1.416768e-03
... . . . . .
ENSG00000285292 1.188323e-04 -1.027294e-07 -7.757741e-05 . -1.587947e-04
ENSG00000285413 1.533414e-05 1.961314e-04 -1.581951e-04 . -1.875805e-04
ENSG00000285427 9.997874e-05 7.792941e-05 1.263912e-05 . 1.183874e-04
ENSG00000285437 2.222927e-04 -1.729690e-03 -4.274901e-04 . 1.501870e-03
ENSG00000285476 -1.391055e-04 1.095048e-04 1.600974e-04 . 1.808076e-04
[,3000]
ENSG00000000003 -1.780924e-04
ENSG00000000457 -1.985935e-04
ENSG00000000938 -1.425305e-03
ENSG00000000971 1.615289e-04
ENSG00000001167 2.060331e-05
... .
ENSG00000285292 1.731371e-04
ENSG00000285413 2.063745e-04
ENSG00000285427 -8.458711e-05
ENSG00000285437 4.845550e-04
ENSG00000285476 1.190955e-04
The most relevant parameter for tuning fastMNN() is k, which specifies the number of nearest neighbors to consider when defining MNN pairs. This can be interpreted as the minimum anticipated frequency of any shared cell type or state in each batch. Increasing k will generally result in more aggressive merging as the algorithm is more generous in matching subpopulations across batches. It can occasionally be desirable to increase k if one clearly sees that the same cell types are not being adequately merged across batches.
We cluster on the low-dimensional corrected coordinates to obtain a partitioning of the cells that serves as a proxy for the population structure. If the batch effect is successfully corrected, clusters corresponding to shared cell types or states should contain cells from multiple batches. We see that all clusters contain contributions from each batch after correction, consistent with our expectation that the two batches are replicates of each other.
library(scran)
snn.gr <- buildSNNGraph(mnn.out, use.dimred="corrected")
clusters.mnn <- igraph::cluster_walktrap(snn.gr)$membership
tab.mnn <- table(Cluster=clusters.mnn, Batch=mnn.out$batch)
tab.mnn
Batch
Cluster GSM3872442 GSM3872443 GSM3872444
1 195 32 101
2 5 60 12
3 99 67 156
4 84 21 42
5 39 92 70
6 9 51 10
7 2 100 4
8 23 31 5
9 17 14 12
10 8 11 7
11 0 137 7
12 287 222 237
13 163 47 183
14 23 21 15
15 21 9 41
16 25 85 98
We can also visualize the corrected coordinates using a t-SNE plot. The presence of visual clusters containing cells from both batches provides a comforting illusion that the correction was successful.
library(scater)
set.seed(0010101010)
mnn.out <- runTSNE(mnn.out, dimred="corrected")
mnn.out$batch <- factor(mnn.out$batch)
plotTSNE(mnn.out, colour_by="batch")
#mnn.out$type <- gsub("_[1-4]","",mnn.out$batch)
#p <- plotTSNE(mnn.out, colour_by="batch", shape_by="type")
#p + facet_wrap(. ~ mnn.out$type)
For fastMNN(), one useful diagnostic is the proportion of variance within each batch that is lost during MNN correction. Specifically, this refers to the within-batch variance that is removed during orthogonalization with respect to the average correction vector at each merge step. This is returned via the lost.var field in the metadata of mnn.out, which contains a matrix of the variance lost in each batch (column) at each merge step (row).
metadata(mnn.out)$merge.info$lost.var
GSM3872442 GSM3872443 GSM3872444
[1,] 0.02983436 0.0000000 0.04245025
[2,] 0.01705727 0.1491781 0.02193779
Large proportions of lost variance (>10%) suggest that correction is removing genuine biological heterogeneity. This would occur due to violations of the assumption of orthogonality between the batch effect and the biological subspace (Haghverdi et al. 2018). In this case, the proportion of lost variance is small, indicating that non-orthogonality is not a major concern.
Show clusters:
mnn.out$cluster <- paste0("c", clusters.mnn)
plotTSNE(mnn.out, colour_by="cluster")
Show known marker genes:
genesToShow <- c(
"CD79A", # CD79A B ***
"CST3", # CST3 monocytes ***
"CD3D", # CD3D T cells ***
"HBA1" # HBA1 erythrocytes ***
)
tmpInd <- which(rowData(uncorrected)$Symbol %in% genesToShow)
ensToShow <- rowData(uncorrected)$ensembl_gene_id[tmpInd]
B cells:
genex <- ensToShow[1]
p <- plotTSNE(mnn.out, colour_by = genex, by_exprs_values="reconstructed")
p <- p + ggtitle(
paste("B cells", genex,
rowData(uncorrected)[genex,"Symbol"])
)
print(p)
T cells:
genex <- ensToShow[3]
p <- plotTSNE(mnn.out, colour_by = genex, by_exprs_values="reconstructed")
p <- p + ggtitle(
paste("T cells", genex,
rowData(uncorrected)[genex,"Symbol"])
)
print(p)
monocytes:
genex <- ensToShow[2]
p <- plotTSNE(mnn.out, colour_by = genex, by_exprs_values="reconstructed")
p <- p + ggtitle(
paste("monocytes", genex,
rowData(uncorrected)[genex,"Symbol"])
)
print(p)
erythrocytes:
genex <- ensToShow[4]
p <- plotTSNE(mnn.out, colour_by = genex, by_exprs_values="reconstructed")
p <- p + ggtitle(
paste("erythrocytes", genex,
rowData(uncorrected)[genex,"Symbol"])
)
print(p)
Other genes (exercise)
genesToShow2 <- c(
"IL7R", # IL7R, CCR7 Naive CD4+ T
"CCR7", # IL7R, CCR7 Naive CD4+ T
"S100A4", # IL7R, S100A4 Memory CD4+
"CD14", # CD14, LYZ CD14+ Mono
"LYZ", # CD14, LYZ CD14+ Mono
"MS4A1", # MS4A1 B
"CD8A", # CD8A CD8+ T
"FCGR3A", # FCGR3A, MS4A7 FCGR3A+ Mono
"MS4A7", # FCGR3A, MS4A7 FCGR3A+ Mono
"GNLY", # GNLY, NKG7 NK
"NKG7", # GNLY, NKG7 NK
"FCER1A", # DC
"CST3", # DC
"PPBP" # Platelet
)
tmpInd <- which(rowData(uncorrected)$Symbol %in% genesToShow2)
ensToShow <- rowData(uncorrected)$ensembl_gene_id[tmpInd]
for (genex in ensToShow)
{
p <- plotTSNE(mnn.out, colour_by = genex, by_exprs_values="reconstructed")
p <- p + ggtitle(paste(genex, rowData(uncorrected)[genex,"Symbol"]))
print(p)
}
Another useful diagnostic check is to compare the clustering within each batch to the clustering of the merged data. Accurate data integration should preserve variance within each batch as there should be nothing to remove between cells in the same batch. This check complements the previously mentioned diagnostics that only focus on the removal of differences between batches. Specifically, it protects us against cases where the correction method simply aggregates all cells together, which would achieve perfect mixing but also discard the biological heterogeneity of interest.
Ideally, we should see a many-to-1 mapping where the across-batch clustering is nested inside the within-batch clusterings. This indicates that any within-batch structure was preserved after correction while acknowledging that greater resolution is possible with more cells. In practice, more discrepancies can be expected even when the correction is perfect, due to the existence of closely related clusters that were arbitrarily separated in the within-batch clustering. As a general rule, we can be satisfied with the correction if the vast majority of entries are zero, though this may depend on whether specific clusters of interest are gained or lost.
One heatmap is generated for each datasets, where each entry is colored according to the number of cells with each pair of labels (before and after correction).
library(pheatmap)
# For the first batch (adding +10 for a smoother color transition
# from zero to non-zero counts for any given matrix entry).
tab <- table(paste("after", clusters.mnn[rescaled2$batch=="GSM3872443"]),
paste("before", colLabels(rescaled[[1]])))
heat1 <- pheatmap(log10(tab+10), cluster_row=FALSE, cluster_col=FALSE,
main="GSM3872443 comparison", silent=TRUE)
# For the second batch.
tab <- table(paste("after", clusters.mnn[rescaled2$batch=="GSM3872444"]),
paste("before", colLabels(rescaled[[2]])))
heat2 <- pheatmap(log10(tab+10), cluster_row=FALSE, cluster_col=FALSE,
main="GSM3872444 comparison", silent=TRUE)
gridExtra::grid.arrange(heat1[[4]], heat2[[4]])
Another evaluation approach is to compute the coassignment probabilities, i.e., the probability that cells from two within-batch clusters are clustered together in the across-batch clustering. High probabilities off the diagonal indicate that within-batch clusters are merged in the across-batch analysis. We would generally expect low off-diagonal probabilities for most pairs of clusters, though this may not be reasonably possible if the within-batch clusters were poorly separated in the first place.
Coassignment probabilities for the within-batch clusters, based on coassignment of cells in the across-batch clusters obtained after MNN correction. One heatmap is generated for each sample, where each entry is colored according to the coassignment probability between each pair of within-batch clusters:
# For the first batch.
tab <- coassignProb(colLabels(rescaled[[1]]), clusters.mnn[rescaled2$batch=="GSM3872443"])
heat1 <- pheatmap(tab, cluster_row=FALSE, cluster_col=FALSE,
col=rev(viridis::magma(100)), main="GSM3872443 probabilities", silent=TRUE)
# For the second batch.
tab <- coassignProb(colLabels(rescaled[[2]]), clusters.mnn[rescaled2$batch=="GSM3872444"])
heat2 <- pheatmap(tab, cluster_row=FALSE, cluster_col=FALSE,
col=rev(viridis::magma(100)), main="GSM3872444 probabilities", silent=TRUE)
gridExtra::grid.arrange(heat1[[4]], heat2[[4]])
Finally, we can summarize the agreement between clusterings by computing the Rand index. This provides a simple metric that we can use to assess the preservation of variation by different correction methods. Larger rand indices (i.e., closer to 1) are more desirable, though this must be balanced against the ability of each method to actually remove the batch effect.
suppressMessages(library(fossil))
ri1 <- rand.index(as.integer(clusters.mnn[rescaled2$batch=="GSM3872443"]),
as.integer(colLabels(rescaled[[1]])))
ri1
[1] 0.7836456
ri2 <- rand.index(as.integer(clusters.mnn[rescaled2$batch=="GSM3872444"]),
as.integer(colLabels(rescaled[[2]])))
ri2
[1] 0.7688188
In some situations, we will already have performed within-batch analyses to characterize salient aspects of population heterogeneity. This is not uncommon when merging datasets from different sources where each dataset has already been analyzed, annotated and interpreted separately. It is subsequently desirable for the integration procedure to retain these “known interesting” aspects of each dataset in the merged dataset. We can encourage this outcome by using the marker genes within each dataset as our selected feature set for fastMNN() and related methods. This focuses on the relevant heterogeneity and represents a semi-supervised approach.
To illustrate, we apply this strategy to our PBMC datasets. We identify the top marker genes from pairwise Wilcoxon ranked sum tests between every pair of clusters within each batch, analogous to the method used by SingleR. In this case, we use the top 10 marker genes but any value can be used depending on the acceptable trade-off between signal and noise (and speed). We then take the union across all comparisons in all batches and use that in place of our HVG set in fastMNN().
# Recall that groups for marker detection
# are automatically defined from 'colLabels()'.
stats1 <- pairwiseWilcox(rescaled[[1]], direction="up")
markers1 <- getTopMarkers(stats1[[1]], stats1[[2]], n=10)
stats2 <- pairwiseWilcox(rescaled[[2]], direction="up")
markers2 <- getTopMarkers(stats2[[1]], stats2[[2]], n=10)
stats3 <- pairwiseWilcox(rescaled[[3]], direction="up")
markers3 <- getTopMarkers(stats3[[1]], stats3[[2]], n=10)
marker.set <- unique(unlist(c(unlist(markers1), unlist(markers2), unlist(markers3))))
length(marker.set) # getting the total number of genes selected in this manner.
[1] 358
set.seed(1000110)
mnn.out2 <- fastMNN(rescaled, subset.row=marker.set,
BSPARAM=BiocSingular::RandomParam(deferred=TRUE))
A quick inspection indicates that the original within-batch structure is indeed preserved in the corrected data. This highlights the utility of a marker-based feature set for integrating datasets that have already been characterized separately in a manner that preserves existing interpretations of each dataset. We note that some within-batch clusters have merged, most likely due to the lack of robust separation in the first place, though this may also be treated as a diagnostic on the appropriateness of the integration depending on the context.
mnn.out2 <- runTSNE(mnn.out2, dimred="corrected")
gridExtra::grid.arrange(
plotTSNE(mnn.out2[,mnn.out2$batch=="GSM3872442"], colour_by=I(colLabels(rescaled[[1]]))),
plotTSNE(mnn.out2[,mnn.out2$batch=="GSM3872443"], colour_by=I(colLabels(rescaled[[2]]))),
plotTSNE(mnn.out2[,mnn.out2$batch=="GSM3872444"], colour_by=I(colLabels(rescaled[[3]]))),
ncol=2
)
We suggest limiting the use of per-gene corrected values to visualization, e.g., when coloring points on a t-SNE plot by per-cell expression. This can be more aesthetically pleasing than uncorrected expression values that may contain large shifts on the colour scale between cells in different batches. Use of the corrected values in any quantitative procedure should be treated with caution, and should be backed up by similar results from an analysis on the uncorrected values.
m.out <- findMarkers(uncorrected, clusters.mnn, block=uncorrected$batch,
direction="up", lfc=1, row.data=rowData(uncorrected)[,c("ensembl_gene_id","Symbol"),drop=FALSE])
# A (probably activated?) T cell subtype of some sort:
demo <- m.out[["7"]]
as.data.frame(demo[1:20,c("Symbol", "Top", "p.value", "FDR")])
Expression level fot the top gene, ENSG00000122644:
geneEnsId <- rownames(demo)[1]
plotExpression(uncorrected, x=I(factor(clusters.mnn)),
features=geneEnsId, colour_by="batch") + facet_wrap(~colour_by)