0.1 Why do we need Data Integration and Batch Correction?
0.2 Test Data Demonstration
0.3 Load the data
0.4 Prepare the data
0.5 View the uncorrected data
0.6 Correct the data - Mutual Nearest Neighbour
0.7 View the corrected data
0.8 The quickCorrect wrapper
0.9 Multiple Batches
0.10 Retrieving the metadata
0.11 Correction Diagnostics
1 Exercise for this week
- 1.1 Session information

0.1 Why do we need Data Integration and Batch Correction?

Large single-cell RNA sequencing (scRNA-seq) projects usually need to generate data across multiple batches due to logistical constraints. However, the processing of different batches is often subject to uncontrollable differences, e.g., changes in operator, differences in reagent quality. This results in systematic differences in the observed expression in cells from different batches, which we refer to as “batch effects”. Batch effects are problematic as they can be major drivers of heterogeneity in the data, masking the relevant biological differences and complicating interpretation of the results.

Computational removal of batch-to-batch variation allows us to combine data across multiple batches for a consolidated downstream analysis. However, existing methods based on linear models (Ritchie et al. 2015; Leek et al. 2012) assume that the composition of cell populations are either known or the same across batches. To overcome these limitations, bespoke methods have been developed for batch correction of single-cell data (Haghverdi et al. 2018; Butler et al. 2018; Lin et al. 2019) that do not require a priori knowledge about the composition of the population. This allows them to be used in workflows for exploratory analyses of scRNA-seq data where such knowledge is usually unavailable.

0.2 Test Data Demonstration

This is the demonstration Abbi will run through in the taught session. You can use this as a jumping off point for this weeks exercise (details at the bottom of the page). Remember to rsync your Course_Materials directory to get the latest files before you start.

library(scater)
library(scran)
library(batchelor)
library(bluster)
library(tidyverse)
library(pheatmap)
library(BiocSingular)
library(cowplot)

0.3 Load the data

To demonstrate the code we will load the data we have for two samples which are technical replicates of one another from two different 10X runs and so if there were no batch effect they should be the same. These samples have been processed as discussed up until this point in the course.

ori.sce.sample.1 <- readRDS("../Robjects/BC_sample1_dimred.rds")
ori.sce.sample.2 <- readRDS("../Robjects/BC_sample2_dimred.rds")

dec.sample.1 <- readRDS("../Robjects/BC_dec1_dimred.rds")
dec.sample.2 <- readRDS("../Robjects/BC_dec2_dimred.rds")

0.4 Prepare the data

The first and most obvious is to subset all batches to the common “universe” of features. In this case, it is straightforward as both batches use Ensembl gene annotation; more difficult integrations will require some mapping of identifiers using the packages from Annotation Hub.

universe <- intersect(rownames(ori.sce.sample.1), rownames(ori.sce.sample.2))
length(universe)

## [1] 28377

# Subsetting the SingleCellExperiment object
sce.sample.1 <- ori.sce.sample.1[universe,]
sce.sample.2 <- ori.sce.sample.2[universe,]

# Also subsetting the variance modelling results
dec.sample.1 <- dec.sample.1[universe,]
dec.sample.2 <- dec.sample.2[universe,]

The second step is to rescale each batch to adjust for differences in sequencing depth between batches. The multiBatchNorm() function recomputes log-normalized expression values after adjusting the size factors for systematic differences in coverage between SingleCellExperiment objects. (Size factors only remove biases between cells within a single batch.) This improves the quality of the correction by removing one aspect of the technical differences between batches.

rescaled <- multiBatchNorm(sce.sample.1, sce.sample.2)
sce.sample.1 <- rescaled[[1]]
sce.sample.2 <- rescaled[[2]]

Finally, we perform feature selection by averaging the variance components across all batches with the combineVar() function. We compute the average as it is responsive to batch-specific HVGs while still preserving the within-batch ranking of genes. This allows us to use the same strategies discussed earlier to select genes of interest. In contrast, approaches based on taking the intersection or union of HVGs across batches become increasingly conservative or liberal, respectively, with an increasing number of batches.

combined.dec <- combineVar(dec.sample.1, dec.sample.2)
chosen.hvgs <- combined.dec$bio > 0
sum(chosen.hvgs)

## [1] 10367

When integrating datasets of variable composition, it is generally safer to err on the side of including more HVGs than are used in a single dataset analysis, to ensure that markers are retained for any dataset-specific subpopulations that might be present. That said, many of the signal-to-noise considerations discussed previously still apply here, so some experimentation may be necessary for best results.

0.5 View the uncorrected data

Before we actually perform any correction, it is worth examining whether there is any batch effect in this dataset. We combine the two SingleCellExperiments and perform a PCA on the log-expression values for our selected subset of HVGs.

# Synchronizing the metadata for cbind()ing.
rowData(sce.sample.1) <- rowData(sce.sample.2)
sce.sample.1$batch <- "1"
sce.sample.2$batch <- "2"
uncorrected <- cbind(sce.sample.1, sce.sample.2)

#run the PCA
set.seed(0010101010)
uncorrected <- runPCA(uncorrected, subset_row=chosen.hvgs)

We use graph-based clustering on the components to obtain a summary of the population structure. As our two sample populations should be replicates, each cluster should ideally consist of cells from both batches. However, we instead see clusters that are mostly comprised of cells from a single batch. This indicates that cells of the same type are artificially separated due to technical differences between batches.

uncorrected.snn.gr <- buildSNNGraph(uncorrected, use.dimred="PCA")
uncorrected.clusters <- igraph::cluster_walktrap(uncorrected.snn.gr)$membership
uncorrected.tab <- table(Cluster=uncorrected.clusters, Batch=uncorrected$batch)
uncorrected.tab

##        Batch
## Cluster   1   2
##      1   19   3
##      2  135  18
##      3   98 143
##      4   21  71
##      5   23  97
##      6   44  92
##      7  248 145
##      8   49  45
##      9   14  18
##      10  13   7
##      11   4   9
##      12  16  30
##      13  35  14
##      14   7  23
##      15  42   5
##      16   7 425
##      17 163   6

This is supported by the t-SNE visualization where the separation between cells from different batches is consistent with the clustering results.

set.seed(1111001)
uncorrected <- runTSNE(uncorrected, dimred="PCA")
plotTSNE(uncorrected, colour_by="batch")

Of course, the other explanation for batch-specific clusters is that there are cell types that are unique to each batch. The degree of intermingling of cells from different batches is not an effective diagnostic when the batches involved might actually contain unique cell subpopulations (which is not a consideration with these samples, but the same cannot be said in general). If a cluster only contains cells from a single batch, one can always debate whether that is caused by a failure of the correction method or if there is truly a batch-specific subpopulation. For example, do batch-specific metabolic or differentiation states represent distinct subpopulations? Or should they be merged together? Each batch correction algorithm will make different (and possibly inappropriate) decisions on what constitutes “shared” and “unique” populations.

0.6 Correct the data - Mutual Nearest Neighbour

Consider a cell ‘a’ in batch ‘A’, and identify the cells in batch ‘B’ that are nearest neighbors to ‘a’ in the expression space defined by the selected features. Repeat this for a cell’b’ in batch ‘B’, identifying its nearest neighbors in ‘A’.

Mutual nearest neighbors are pairs of cells from different batches that belong in each other’s set of nearest neighbors. The reasoning is that MNN pairs represent cells from the same biological state prior to the application of a batch effect - see Haghverdi et al. (2018) for full theoretical details. Thus, the difference between cells in MNN pairs can be used as an estimate of the batch effect, the subtraction of which yields batch-corrected values.

Compared to linear regression, MNN correction does not assume that the population composition is the same or known beforehand. This is because it learns the shared population structure via identification of MNN pairs and uses this information to obtain an appropriate estimate of the batch effect. Instead, the key assumption of MNN-based approaches is that the batch effect is orthogonal to the biology in high-dimensional expression space. Violations reduce the effectiveness and accuracy of the correction, with the most common case arising from variations in the direction of the batch effect between clusters. Nonetheless, the assumption is usually reasonable as a random vector is very likely to be orthogonal in high-dimensional space.

The batchelor package provides an implementation of the MNN approach via the fastMNN() function. (Unlike the MNN method originally described by Haghverdi et al. (2018), the fastMNN() function performs PCA to reduce the dimensions beforehand and speed up the downstream neighbor detection steps.) We apply it to our two batches to remove the batch effect across the highly variable genes in chosen.hvgs. To reduce computational work and technical noise, all cells in all batches are projected into the low-dimensional space defined by the top ‘d’ principal components. Identification of MNNs and calculation of correction vectors are then performed in this low-dimensional space.

set.seed(1000101001)
mnn.out <- fastMNN(sce.sample.1, sce.sample.2, d=50, k=20, subset.row=chosen.hvgs)
mnn.out

## class: SingleCellExperiment 
## dim: 10367 2089 
## metadata(2): merge.info pca.info
## assays(1): reconstructed
## rownames(10367): ENSG00000241860 ENSG00000237491 ... ENSG00000276345
##   ENSG00000278633
## rowData names(1): rotation
## colnames(2089): 9_AAACCTGCACTTCGAA-1 9_AAACCTGCAGACGCAA-1 ...
##   10_TTTGTCAGTAAATGTG-1 10_TTTGTCAGTACAAGTA-1
## colData names(1): batch
## reducedDimNames(1): corrected
## mainExpName: NULL
## altExpNames(0):

The function returns a SingleCellExperiment object containing corrected values for downstream analyses like clustering or visualization. Each column of mnn.out corresponds to a cell in one of the batches, while each row corresponds to an input gene in chosen.hvgs. The batch field in the column metadata contains a vector specifying the batch of origin of each cell.

The corrected matrix in the reducedDims() contains the low-dimensional corrected coordinates for all cells, which we will use in place of the PCs in our downstream analyses.

dim(reducedDim(mnn.out, "corrected"))

## [1] 2089   50

A reconstructed matrix in the assays() contains the corrected expression values for each gene in each cell, obtained by projecting the low-dimensional coordinates in corrected back into gene expression space. We do not recommend using this for anything other than visualization.

assay(mnn.out, "reconstructed")

## <10367 x 2089> matrix of class LowRankMatrix and type "double":
##                  9_AAACCTGCACTTCGAA-1 ... 10_TTTGTCAGTACAAGTA-1
## ENSG00000241860          5.174395e-05   .         -2.754680e-05
## ENSG00000237491         -5.010187e-04   .         -1.655025e-04
## ENSG00000228794          1.150769e-03   .         -7.288232e-04
## ENSG00000225880          1.722473e-04   .         -7.138536e-06
## ENSG00000188976          1.108459e-03   .         -1.822399e-04
##             ...                     .   .                     .
## ENSG00000198727         -1.388621e-02   .          1.888648e-02
## ENSG00000273748          1.139648e-03   .         -8.211990e-04
## ENSG00000271254          3.003433e-05   .          3.660541e-04
## ENSG00000276345          2.352449e-04   .         -1.526498e-04
## ENSG00000278633         -7.122577e-05   .         -8.191794e-05

The most relevant parameter for tuning fastMNN() is k, which specifies the number of nearest neighbors to consider when defining MNN pairs. This can be interpreted as the minimum anticipated frequency of any shared cell type or state in each batch. Increasing k will generally result in more aggressive merging as the algorithm is more generous in matching subpopulations across batches. It can occasionally be desirable to increase k if one clearly sees that the same cell types are not being adequately merged across batches.

0.7 View the corrected data

We cluster on the low-dimensional corrected coordinates to obtain a partitioning of the cells that serves as a proxy for the population structure. If the batch effect is successfully corrected, clusters corresponding to shared cell types or states should contain cells from multiple batches. We see that all clusters contain contributions from each batch after correction, consistent with our expectation that the two batches are replicates of each other.

mnn.snn.gr <- buildSNNGraph(mnn.out, use.dimred="corrected")
mnn.clusters <- igraph::cluster_walktrap(mnn.snn.gr)$membership
mnn.tab <- table(Cluster=mnn.clusters, Batch=mnn.out$batch)
mnn.tab

##        Batch
## Cluster   1   2
##      1  110 181
##      2   15  26
##      3  134  45
##      4   34 115
##      5   52  81
##      6   12   5
##      7   19  35
##      8  239 152
##      9   42  13
##      10  14  19
##      11   4   8
##      12  38  40
##      13  12   2
##      14  16  22
##      15 168 393
##      16  29  14

set.seed(0010101010)
mnn.out <- runTSNE(mnn.out, dimred="corrected")

mnn.out$batch <- factor(mnn.out$batch)
plotTSNE(mnn.out, colour_by="batch")

# make dataframes
clusterTab.uncorrected <- data.frame(clusters=uncorrected.clusters, batch=uncorrected$batch, source=uncorrected$batch) %>%
  group_by(clusters,batch) %>%
  summarise(cells = n())
clusterTab.mnn <- data.frame(clusters=mnn.clusters, batch=mnn.out$batch, source=mnn.out$batch) %>%
  group_by(clusters,batch) %>%
  summarise(cells = n())

# plot bars
bar.uncorrected <- ggplot(data=clusterTab.uncorrected, 
                          aes(x=clusters,y=cells, fill=batch)) +
    theme(axis.text.x=element_text(angle = 90, hjust = 0)) +
    geom_col() +
    theme(legend.text = element_text(size = 7))
bar.mnn <- ggplot(data=clusterTab.mnn, 
              aes(x=clusters,y=cells, fill=batch)) +
    theme(axis.text.x=element_text(angle = 90, hjust = 0)) +
    geom_col() +
    theme(legend.text = element_text(size = 7))

plot_grid(bar.uncorrected, bar.mnn, ncol=1)

0.8 The quickCorrect wrapper

The batchelor package has actually made this even easier for us by having a wrapper function called quickCorrect which does alot of these steps for us all at once. It performs all the steps we did to prepare the data and runs the correction. By default this is the fastMNN correction but you can change to use other correction algorithms by modifying the PARAM= argument.

Intersecting each batch to the universe of common features with intersectRows.
Applying normalization and log-transformation to the batches with multiBatchNorm.
Modelling the per-gene variance with modelGeneVar. If precomputed is supplied, the precomputed results are used instead.
Identifying highly variable genes with getTopHVGs. These genes will be used in the correction, though corrected values for all genes can be returned by setting correct.all=TRUE.
Applying the batch correction algorithm of choice with batchCorrect, as specified by PARAM.

quick.corrected <- quickCorrect(ori.sce.sample.1, ori.sce.sample.2)

quick.sce <- quick.corrected$corrected
quick.sce

## class: SingleCellExperiment 
## dim: 28377 2089 
## metadata(2): merge.info pca.info
## assays(1): reconstructed
## rownames(28377): ENSG00000238009 ENSG00000239945 ... ENSG00000278633
##   ENSG00000278817
## rowData names(1): rotation
## colnames(2089): 9_AAACCTGCACTTCGAA-1 9_AAACCTGCAGACGCAA-1 ...
##   10_TTTGTCAGTAAATGTG-1 10_TTTGTCAGTACAAGTA-1
## colData names(1): batch
## reducedDimNames(1): corrected
## mainExpName: NULL
## altExpNames(0):

set.seed(00101010)
quick.sce <- runTSNE(quick.sce, dimred="corrected")
quick.sce$batch <- factor(quick.sce$batch)
plotTSNE(quick.sce, colour_by="batch")

0.9 Multiple Batches

This example showed two samples or ‘batches’ but it will often be the case that we have many samples or batches. It is straightforward to simultaneously perform correction across >2 batches, either by having multiple levels in batch= or by providing more SingleCellExperiment objects (or even raw matrices of expression values). You can try this out with more of the samples from our dataset.

0.10 Retrieving the metadata

When running the correction it is important to note that the metadata stored in the colData of the input object will not be present in the output object. We have to recreate this if we want to use it later. We can easily do this as the output object will be ordered the same as the input.

meta <- colData(uncorrected) %>%
  as.data.frame() %>%
  DataFrame()
  
colData(quick.sce) <- meta

quick.snn.gr <- buildSNNGraph(quick.sce, use.dimred="corrected")
quick.clusters <- igraph::cluster_walktrap(quick.snn.gr)$membership
quick.sce$Cluster <- factor(quick.clusters)

0.11 Correction Diagnostics

0.11.1 Mixing Between Batches

Ideally, batch correction would remove the differences between batches while preserving the heterogeneity within batches. In the corrected data, cells of the same type should be intermingled and indistinguishable even if they come from different batches, while cells of different types should remain well-separated. Unfortunately, we rarely have prior knowledge of the underlying types of the cells, making it difficult to unambiguously determine whether differences between batches represent geniune biology or incomplete correction. Indeed, it could be said that all correction methods are at least somewhat incorrect, though that not preclude them from being useful.

quick.snn.gr <- buildSNNGraph(quick.sce, use.dimred="corrected")
quick.clusters <- igraph::cluster_walktrap(quick.snn.gr)$membership
quick.sce$Cluster <- factor(quick.clusters)
colLabels(quick.sce) <- factor(quick.clusters)

clusterTab.quick <- data.frame(clusters=quick.clusters, batch=quick.sce$batch)

tab <- table(Cluster=quick.clusters, Batch=quick.sce$batch)
tab

##        Batch
## Cluster   1   2
##      1  135 219
##      2   15  26
##      3  134  45
##      4  196 424
##      5   23  96
##      6   12   5
##      7   20  46
##      8  247 153
##      9   39  61
##      10   4   9
##      11  14  19
##      12  16  22
##      13  12   2
##      14  34  15
##      15  37   9

ClusterInfo.quick <- data.frame(Cluster=quick.clusters, Batch=quick.sce$batch) %>%
  group_by(Cluster,Batch) %>%
  summarise(cells = n())

## `summarise()` has grouped output by 'Cluster'. You can override using the
## `.groups` argument.

ggplot(data=ClusterInfo.quick, aes(x=Cluster,y=cells, fill=Batch)) +
    theme(axis.text.x=element_text(angle = 90, hjust = 0)) +
    geom_col() +
    theme(legend.text = element_text(size = 7))

The OSCA book favors a qualitative approach to assessing the degree of mixing between clusters where they compute the variance in the log-normalized abundances across batches for each cluster. A highly variable cluster has large relative differences in cell abundance across batches; this may be an indicator for incomplete batch correction, e.g., if the same cell type in two batches was not combined into a single cluster in the corrected data. We can then focus our attention on these clusters to determine whether they might pose a problem for downstream interpretation. Of course, a large variance can also be caused by genuinely batch-specific populations, so some prior knowledge about the biological context is necessary to distinguish between these two possibilities.

var <- clusterAbundanceVar(tab)

# Also printing the percentage of cells in each cluster in each batch:
percent <- t(t(tab)/colSums(tab)) * 100 
df <- DataFrame(Batch=unclass(percent), var=var)
df[order(df$var, decreasing=TRUE),]

## DataFrame with 15 rows and 3 columns
##       Batch.1   Batch.2        var
##     <numeric> <numeric>  <numeric>
## 3    14.28571  3.909644   1.356372
## 15    3.94456  0.781929   1.117260
## 5     2.45203  8.340573   1.047467
## 14    3.62473  1.303215   0.519492
## 13    1.27932  0.173762   0.483715
## ...       ...       ...        ...
## 2    1.599147  2.258905 0.05484944
## 10   0.426439  0.781929 0.05440222
## 9    4.157783  5.299739 0.04225840
## 12   1.705757  1.911381 0.00575673
## 11   1.492537  1.650738 0.00407486

0.11.2 Preserving Biological Heterogeneity

Another useful diagnostic check is to compare the pre-correction clustering of each batch to the clustering of the same cells in the corrected data. Accurate data integration should preserve population structure within each batch as there is no batch effect to remove between cells in the same batch. This check complements the previously mentioned diagnostics that only focus on the removal of differences between batches. Specifically, it protects us against scenarios where the correction method simply aggregates all cells together, which would achieve perfect mixing but also discard the biological heterogeneity of interest.

table(colLabels(ori.sce.sample.1))

## 
##   1   2   3   4   5   6   7   8   9  10  11 
##  47 145 148  27  12  36  40 249 179  40  15

table(colLabels(ori.sce.sample.2))

## 
##   1   2   3   4   5   6 
## 163 231  80 497  33 147

Ideally, we should see a many-to-1 mapping where the post-correction clustering is nested inside the pre-correction clustering. This indicates that any within-batch structure was preserved after correction while acknowledging that greater resolution is possible with more cells. We quantify this mapping using the nestedClusters() function from the bluster package, which identifies the nesting of post-correction clusters within the pre-correction clusters. Well-nested clusters have high max values, indicating that most of their cells are derived from a single pre-correction cluster.

tab.sample.1 <- nestedClusters(ref=paste("before", colLabels(ori.sce.sample.1)),
    alt=paste("after", mnn.clusters[mnn.out$batch==1]))
tab.sample.1$alt.mapping

## DataFrame with 16 rows and 2 columns
##                max       which
##          <numeric> <character>
## after 1   0.872727    before 2
## after 10  1.000000    before 1
## after 11  0.500000    before 8
## after 12  1.000000    before 2
## after 13  1.000000    before 5
## ...            ...         ...
## after 5   0.769231    before 7
## after 6   1.000000    before 4
## after 7   0.684211    before 3
## after 8   0.974895    before 8
## after 9   0.952381   before 10

We can visualize this mapping for the samples. Ideally, each row should have a single dominant entry close to unity. Horizontal stripes are more concerning as these indicate that multiple pre-correction clusters were merged together, though the exact level of concern will depend on whether specific clusters of interest are gained or lost. In practice, more discrepancies can be expected even when the correction is perfect, due to the existence of closely related clusters that were arbitrarily separated in the within-batch clustering.

# For the first batch:
heat.sample.1 <- pheatmap(tab.sample.1$proportions, cluster_row=FALSE, cluster_col=FALSE,
                   main="Sample 1 comparison", silent=TRUE)

# For the second batch:
tab.sample.2 <- nestedClusters(ref=paste("before", colLabels(ori.sce.sample.2)),
                        alt=paste("after", mnn.clusters[mnn.out$batch==2]))
heat.sample.2 <- pheatmap(tab.sample.2$proportions, cluster_row=FALSE, cluster_col=FALSE,
                   main="Sample 2 comparison", silent=TRUE)

gridExtra::grid.arrange(heat.sample.1[[4]], heat.sample.2[[4]])

We use the adjusted Rand index to quantify the agreement between the clusterings before and after batch correction. Larger indices are more desirable as this indicates that within-batch heterogeneity is preserved, though this must be balanced against the ability of each method to actually perform batch correction.

ri.sample.1 <- pairwiseRand(mnn.clusters[mnn.out$batch==1], colLabels(ori.sce.sample.1), mode="index")
ri.sample.1

## [1] 0.8448763

ri.sample.2 <- pairwiseRand(mnn.clusters[mnn.out$batch==2], colLabels(ori.sce.sample.2), mode="index")
ri.sample.2

## [1] 0.7264657

We can also break down the ARI into per-cluster ratios for more detailed diagnostics (Figure 2.2). For example, we could see low ratios off the diagonal if distinct clusters in the within-batch clustering were incorrectly aggregated in the merged clustering. Conversely, we might see low ratios on the diagonal if the correction inflated or introduced spurious heterogeneity inside a within-batch cluster.

# For the first batch.
tab <- pairwiseRand(colLabels(ori.sce.sample.1), mnn.clusters[mnn.out$batch==1])
heat.1 <- pheatmap(tab, cluster_row=FALSE, cluster_col=FALSE,
    col=rev(viridis::magma(100)), main="Sample 1 probabilities", silent=TRUE)

# For the second batch.
tab <- pairwiseRand(colLabels(ori.sce.sample.2), mnn.clusters[mnn.out$batch==2])
heat.2 <- pheatmap(tab, cluster_row=FALSE, cluster_col=FALSE,
    col=rev(viridis::magma(100)), main="Sample 2 probabilities", silent=TRUE)

gridExtra::grid.arrange(heat.1[[4]], heat.2[[4]])

0.11.3 MNN specific test

For fastMNN(), one useful diagnostic is the proportion of variance within each batch that is lost during MNN correction. Specifically, this refers to the within-batch variance that is removed during orthogonalization with respect to the average correction vector at each merge step. This is returned via the lost.var field in the metadata of mnn.out, which contains a matrix of the variance lost in each batch (column) at each merge step (row).

metadata(quick.sce)$merge.info$lost.var

##             [,1]       [,2]
## [1,] 0.009522405 0.02842813

Large proportions of lost variance (>10%) suggest that correction is removing genuine biological heterogeneity. This would occur due to violations of the assumption of orthogonality between the batch effect and the biological subspace (Haghverdi et al. 2018). In this case, the proportion of lost variance is smaller, indicating that non-orthogonality is not to much of a major concern.

1 Exercise for this week

Remember to rsync your Course_Materials directory to get the latest files before you start.

Have a go at running the data integration and batch corrections on the data we have generated so far from the previous weeks. You can also find a file in RObjects called 07_semiprocessedCaronSamples.rds which contains the Caron data processed up until this point if you would just like to try out the functions in this section.

Import the data and have a look at it uncorrected.

Try the batch correction, there are many different parameters you can change here, which do you think are most suitable for this data? Compare it visually to the uncorrected data. (HINT: When you use a single sce object with quickCorrect you must specify what you want to use as a batch using the batch = argument. See the Help page for more details.)

Explore the different diagnostics we have demonstrated or have a look in the OSCA book which details more.

1.1 Session information

sessionInfo()

## R version 4.1.2 (2021-11-01)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur 10.16
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] cowplot_1.1.1               BiocSingular_1.10.0        
##  [3] pheatmap_1.0.12             forcats_0.5.1              
##  [5] stringr_1.4.0               dplyr_1.0.8                
##  [7] purrr_0.3.4                 readr_2.1.2                
##  [9] tidyr_1.2.0                 tibble_3.1.6               
## [11] tidyverse_1.3.1             bluster_1.4.0              
## [13] batchelor_1.10.0            scran_1.22.1               
## [15] scater_1.22.0               ggplot2_3.3.5              
## [17] scuttle_1.4.0               SingleCellExperiment_1.16.0
## [19] SummarizedExperiment_1.24.0 Biobase_2.54.0             
## [21] GenomicRanges_1.46.1        GenomeInfoDb_1.30.1        
## [23] IRanges_2.28.0              S4Vectors_0.32.3           
## [25] BiocGenerics_0.40.0         MatrixGenerics_1.6.0       
## [27] matrixStats_0.61.0         
## 
## loaded via a namespace (and not attached):
##  [1] Rtsne_0.15                ggbeeswarm_0.6.0         
##  [3] colorspace_2.0-2          ellipsis_0.3.2           
##  [5] XVector_0.34.0            BiocNeighbors_1.12.0     
##  [7] fs_1.5.2                  rstudioapi_0.13          
##  [9] farver_2.1.0              ggrepel_0.9.1            
## [11] fansi_1.0.2               lubridate_1.8.0          
## [13] xml2_1.3.3                sparseMatrixStats_1.6.0  
## [15] knitr_1.37                jsonlite_1.7.3           
## [17] ResidualMatrix_1.4.0      broom_0.7.12             
## [19] cluster_2.1.2             dbplyr_2.1.1             
## [21] compiler_4.1.2            httr_1.4.2               
## [23] dqrng_0.3.0               backports_1.4.1          
## [25] assertthat_0.2.1          Matrix_1.4-0             
## [27] fastmap_1.1.0             limma_3.50.0             
## [29] cli_3.1.1                 htmltools_0.5.2          
## [31] tools_4.1.2               rsvd_1.0.5               
## [33] igraph_1.2.11             gtable_0.3.0             
## [35] glue_1.6.1                GenomeInfoDbData_1.2.7   
## [37] Rcpp_1.0.8                cellranger_1.1.0         
## [39] jquerylib_0.1.4           vctrs_0.3.8              
## [41] DelayedMatrixStats_1.16.0 xfun_0.29                
## [43] beachmat_2.10.0           rvest_1.0.2              
## [45] lifecycle_1.0.1           irlba_2.3.5              
## [47] statmod_1.4.36            edgeR_3.36.0             
## [49] zlibbioc_1.40.0           scales_1.1.1             
## [51] hms_1.1.1                 parallel_4.1.2           
## [53] RColorBrewer_1.1-2        yaml_2.2.2               
## [55] gridExtra_2.3             sass_0.4.0               
## [57] stringi_1.7.6             highr_0.9                
## [59] ScaledMatrix_1.2.0        BiocParallel_1.28.3      
## [61] rlang_1.0.1               pkgconfig_2.0.3          
## [63] bitops_1.0-7              evaluate_0.14            
## [65] lattice_0.20-45           labeling_0.4.2           
## [67] tidyselect_1.1.1          magrittr_2.0.2           
## [69] R6_2.5.1                  generics_0.1.2           
## [71] metapod_1.2.0             DelayedArray_0.20.0      
## [73] DBI_1.1.2                 pillar_1.7.0             
## [75] haven_2.4.3               withr_2.4.3              
## [77] RCurl_1.98-1.6            modelr_0.1.8             
## [79] crayon_1.4.2              utf8_1.2.2               
## [81] tzdb_0.2.0                rmarkdown_2.11           
## [83] viridis_0.6.2             locfit_1.5-9.4           
## [85] grid_4.1.2                readxl_1.3.1             
## [87] reprex_2.0.1              digest_0.6.29            
## [89] munsell_0.5.0             beeswarm_0.4.0           
## [91] viridisLite_0.4.0         vipor_0.4.5              
## [93] bslib_0.3.1

CRUK CI Bitesize Introduction to Single Cell RNASeq Analysis

Dataset Integration and Batch Correction

Abbi Edwards