7 Dataset Integration

Often, single-cell experiments are done by processing samples in multiple batches. This may be related to logistical constraints such as the inability to run all experimental conditions in parallel, or more extreme cases where samples are processed in different laboratories, by different people and even sequenced with different technologies (e.g. samples from human patients collected in different hospitals). These differences across sample batches very often result in global gene expression differences across those batches. Since batch-to-batch transcriptomic differences are likely unrelated to biological differences, we would ideally want “remove” them before drawing inferences about our cell populations.

Biases due to batch effects are not new to single-cell RNA-seq. Indeed, several methods have been previously developed for standard bulk RNA-seq approaches. Some of those approaches rely on linear models that “regress out” the batch effect, assuming that the cell composition is similar across batches. However, in single-cell RNA-seq we may very often expect changes in cell compositions across batches (e.g. in our course data we have data from cancer samples such as ETV6-RUNX as well as a reference panel of healthy blood cells, PBMMCs). Therefore, methods are necessary that can deal with with heterogeneity across batches.

In recent years, several methods have been developed to deal with this challenge (too many to list here!). Some of the most popular ones include the Mutual Nearest Neighbours (MNN) algorithm, a Principal Components Analysis-based clustering method implemented in the package HARMONY and a method that combines Canonical Correlation Analysis (CCA) and MNN implemented in the package Seurat 5. These methods have been shown to perform well in several benchmark studies (e.g. Luecken et al 2022 and Tran et al 2020), although one important message from these studies is that no single method is universally the best in all situations. For example, some methods may be better at preserving small populations of cells as separate groups in the integrated data at the cost of poorer overall integration, while others may be better at removing batch effects at the cost of also removing some biological signal.

Seurat makes it easy for us to use different batch correction methods but we must be able to assess when to use batch correction and how to adjust the parameters for our dataset.

There are several steps to deciding on our batch correction. First we must examine our uncorrected data and decide if we think a correction is needed. Then we run a correction method and visualise the results. Finally, we must assess the results and decide if we are happy with the correction or if we need to adjust the parameters or try a different method.

library(Seurat)
library(sctransform)
library(glmGamPoi)
library(tidyverse)
library(patchwork)

In the normalisation section this morning we used the SCTransform function on a single sample. We have 11 samples in this dataset from 4 different sample groups. Depending on the variance of our data it may be more appropriate to normalise the data across all samples together, or to normalise each sample or sample group separately and then integrate the data.

7.1 Examining our uncorrected data

Given what we saw in Day 1 with large differences in distributions between batches it would be a sensible assumption that the sample groups or even the samples themselves should have normalisation applied separately regardless of whether we decide to batch correct.

7.1.1 Re-process the data by sample group

We will revert back to the filtered 500 cell dataset that we started the day with and re-process the data by sample group. Seurat allows us to do this still all within one object by using layers. We will use the SplitObject function to split our data by sample group and then re-process each sample group separately. The parameter f specifies the variable that we want to split by, in this case SampleGroup. This will create a new layer in our Seurat object for each sample group. We can then run the normalisation and dimensionality reduction steps separately for each sample group.

# Load the data
seurat_object <- readRDS("RObjects/Filtered.500.rds")
# Split the data by sample group
seurat_object[["RNA"]] <- split(seurat_object[["RNA"]],
                                f = seurat_object$SampleGroup)
seurat_object

## An object of class Seurat 
## 29786 features across 5500 samples within 1 assay 
## Active assay: RNA (29786 features, 0 variable features)
##  4 layers present: counts.ETV6RUNX1, counts.HHD, counts.PRET, counts.PBMMC

The RNA assay now has 4 layers, one for each of our sample groups.

# Re-process, Seurat will treat each sample group separately
seurat_object <- SCTransform(
  seurat_object,
  assay = "RNA",
  vars.to.regress = "percent.mt",
  verbose = FALSE
)
seurat_object <- RunPCA(seurat_object, 
                         features = VariableFeatures(object = seurat_object))

## Warning in PrepDR(object = object, features = features, verbose = verbose): The
## following 141 features requested have not been scaled (running reduction
## without them): GYPA, FCN1, ALAS2, GYPB, CXCL8, HEMGN, LINC02945, SERPINA1,
## EPB42, KRT1, SELENBP1, ENSG00000189229, LINC02227, KLF1, CYGB, LINC01013,
## RPS4Y2, SMIM1, HBQ1, PELATON, APOC1, IGHA2, TSPO2, BAALC, IGHGP, TCL6, SH2D4B,
## MYLK, LINC02202, NETO1, KANK2, ENSG00000287778, S100A12, CCL17, VCAN, IFIT1B,
## HBG1, TMCC2, CSTA, CCL3L3, ENSG00000227706, IGLC7, MPO, PRTN3, TCL1B, TMPRSS15,
## ENSG00000287172, CLEC7A, ENSG00000224610, EREG, HS3ST4, ENSG00000288087, AZU1,
## CTSG, ELANE, CLEC14A, LINC00922, NHERF2, RETN, ENSG00000286581, TNFRSF17, CCN2,
## ARG1, HTR1F, BEST3, PPBP, MSR1, ENSG00000254951, RNASE2, IRF4, S100B, MEIS1,
## POU4F1, LRRC26, ENSG00000260271, CDA, ENSG00000248656, LINC00114, LINC01922,
## ADGRF1, FHIP1A, TFPI, IGLC5, SDC2, FPR1, MPPED2, RBP7, ENSG00000289949, ACY3,
## KEL, EPHA7, ENSG00000288691, PDZK1IP1, SLC40A1, LOXHD1, KCNH2, APOE,
## ENSG00000291034, PLCB4, CLEC4OP, ENSG00000253214, NRP1, MS4A3, FCAR, MYO18B,
## ENSG00000286546, LINC02363, MGST1, CDHR3, ENSG00000257883, IGHG4,
## ENSG00000253673, MYO3A, GNG11, CRMP1, NT5E, ZNF804A, BEX1, DEPP1, OVCH2, AJAP1,
## ABCB4, NKAIN4, TMED6, LINC00692, ENSG00000233064, ENSG00000258631, LARGE-AS1,
## ENSG00000251598, LGR5, SCN9A, SOCS2-AS1, TRIM10, ENSG00000232464,
## ENSG00000291325, PCLO, PEG10, EBF4, TMEM220-AS1, LY6H, ENSG00000253983

## PC_ 1 
## Positive:  BACH2, IGHM, PCDH9, AFF3, JUNB, CD74, PDE4D, B2M, CD79B, BTG1 
##     CD69, LINC-PINT, RPS27, CXCR4, IGKC, FOS, ACSM3, TCL1A, TSC22D3, HLA-DRA 
##     CD52, CD37, JUN, IFITM1, ENSG00000287092, LINC01619, SSBP2, SKAP1, TMSB10, IGLC3 
## Negative:  HBA1, HBB, HBA2, HBD, AHSP, HBM, CA1, SLC25A37, SLC4A1, BLVRB 
##     IFI27, SNCA, PRDX2, CA2, HMBS, RHAG, FECH, ANK1, SOX6, SLC25A39 
##     CD36, BPGM, GMPR, RHCE, STRADB, UROD, FAM210B, HMGB2, SLC2A1, GYPE 
## PC_ 2 
## Positive:  IFITM1, IL32, SKAP1, JUNB, SRGN, S100A4, TRAC, S100A6, TC2N, CD3D 
##     CD247, CD3E, HBB, B2M, S100A10, CD7, HBA2, CCL5, HBA1, ANXA1 
##     TRBC2, ZNF331, RPS27, LINC-PINT, PCED1B-AS1, CD44, TNIK, TYROBP, SCML4, TRBC1 
## Negative:  TUBA1B, HMGB2, H4C3, STMN1, TUBB, IGLL1, TOP2A, H2AZ1, HMGN2, HMGB1 
##     AFF3, NUSAP1, TCL1A, MME, SSBP2, PCLAF, EBF1, TYMS, BACH2, CD24 
##     PTMA, UBE2C, ARPP21, RRM2, PCDH9, UHRF1, CD79B, MKI67, DNTT, VPREB1 
## PC_ 3 
## Positive:  LYZ, S100A9, TYROBP, S100A8, LGALS1, S100A4, CST3, S100A6, IFI30, FCER1G 
##     ENSG00000257764, S100A11, FTL, CCDC200, LST1, MNDA, CTSS, NAMPT, CD74, CFD 
##     AIF1, SRGN, MS4A6A, TYMP, SAT1, RBM47, FOS, FTH1, HLA-DRA, HLA-DRB1 
## Negative:  IL32, SKAP1, BCL11B, CAMK4, CD3D, TRBC1, TRAC, CD247, THEMIS, IFITM1 
##     CD3E, TC2N, TRBC2, CD7, NELL2, PITPNC1, PCED1B-AS1, PRKCH, LCK, TNIK 
##     SCML4, ITK, CD3G, CD6, CD96, RPS27, LEPROTL1, INPP4B, IL7R, TCF7 
## PC_ 4 
## Positive:  HBB, HBA2, HBA1, HBD, BACH2, HBM, AHSP, PCDH9, CA1, SLC25A37 
##     AFF3, PDE4D, SLC4A1, IGKC, IGHM, ENSG00000287092, SNCA, IFI27, CD79B, ACSM3 
##     IGLC3, BLVRB, BPGM, IGLC2, BANK1, BNIP3L, LINC01374, HMBS, CD74, NIBAN3 
## Negative:  TUBA1B, H4C3, HMGB2, TUBB, TOP2A, H2AZ1, HMGN2, STMN1, NUSAP1, UBE2C 
##     PCLAF, TYMS, RRM2, MKI67, BIRC5, CENPF, HMGB1, SMC4, TUBB4B, CKS1B 
##     PTTG1, VIM, AURKB, GAPDH, IL32, ACTB, TPX2, CKS2, IFITM1, S100A4 
## PC_ 5 
## Positive:  IGHM, HMGB2, IGKC, MS4A1, H4C3, TOP2A, BACH2, TUBA1B, IGLC3, IGLC2 
##     LYN, NUSAP1, ANGPTL1, CD79B, UBE2C, JUNB, IGHD, BANK1, RALGPS2, CENPF 
##     PTTG1, MTSS1, CD74, AFF3, MKI67, CD37, CRIP1, CA2, CD52, FCRL1 
## Negative:  GAPDH, CALN1, DNTT, ERG, CCDC26, EEF1A1, RPS2, CDK6, RPS12, AIF1 
##     HSP90AB1, ENSG00000287092, CD99, NME2, ENSG00000271955, GNAQ, CACNB2, LYZ, RUNX1, GAB2 
##     MIF, RPL10, GPX1, FRMD4B, RPLP1, RPS15A, RPLP0, CASC15, HSPB1, FABP5

seurat_object <- RunUMAP(seurat_object,
                         reduction = "pca",
                         dims = 1:15)

## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session

## 19:11:28 UMAP embedding parameters a = 0.9922 b = 1.112

## 19:11:28 Read 5500 rows and found 15 numeric columns

## 19:11:28 Using Annoy for neighbor search, n_neighbors = 30

## 19:11:28 Building Annoy index with metric = cosine, n_trees = 50

## 0%   10   20   30   40   50   60   70   80   90   100%

## [----|----|----|----|----|----|----|----|----|----|

## **************************************************|
## 19:11:29 Writing NN index file to temp file /var/folders/9s/hsfl94x96vn2g543r67q4vpn6rmr26/T//RtmpUfmyXk/file1654f362e0285
## 19:11:29 Searching Annoy index using 1 thread, search_k = 3000
## 19:11:30 Annoy recall = 100%
## 19:11:31 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
## 19:11:32 Initializing from normalized Laplacian + noise (using RSpectra)
## 19:11:33 Commencing optimization for 500 epochs, with 215632 positive edges
## 19:11:33 Using rng type: pcg
## 19:11:40 Optimization finished

seurat_object

## An object of class Seurat 
## 50798 features across 5500 samples within 2 assays 
## Active assay: SCT (21012 features, 3000 variable features)
##  3 layers present: counts, data, scale.data
##  1 other assay present: RNA
##  2 dimensional reductions calculated: pca, umap

We can now visualise the uncorrected data with a hopefully more appropriate normalisation strategy.

# Visualise the uncorrected data
uncorrected_plot <- DimPlot(seurat_object,
                            reduction = "umap",
                            group.by = "SampleName") + 
  ggtitle("Uncorrected data")

In order to examine our uncorrected data further we will run a quick clustering. clustering will be covered in much greater detail on day 3. This shows us how the cells group together in the uncorrected data. If we see that some of the clusters are mostly made up of cells from a single sample, this is an indication that batch effects are present and that we may want to run a batch correction method.

# Run clustering
seurat_object <- FindNeighbors(seurat_object, reduction = "pca",
                                dims = 1:15)

## Computing nearest neighbor graph

## Computing SNN

seurat_object <- FindClusters(seurat_object, 
                              cluster.name = "uncorrected_clusters")

## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 5500
## Number of edges: 173164
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9174
## Number of communities: 20
## Elapsed time: 0 seconds

uncorrected_clusters_plot <- DimPlot(seurat_object, reduction = "umap", 
                                     group.by ="uncorrected_clusters") + ggtitle("Uncorrected data clusters")
 
uncorrected_plot + uncorrected_clusters_plot

It can also be useful to view this as a table to get a better idea of the data.

# Make a table of the clusters and samples
table(seurat_object$uncorrected_clusters, seurat_object$orig.ident)

##     
##      ETV6RUNX1-1 ETV6RUNX1-2 ETV6RUNX1-3 ETV6RUNX1-4 HHD-1 HHD-2 PBMMC-1
##   0            0         394         185           4    30    16       7
##   1           55          60          17          16    37    91      80
##   2          433           0           0           0     3     1       3
##   3            1           9           7           0    53   353       2
##   4            0           0           0           0     4     0       0
##   5            0           0           6           1     7     1      82
##   6            2           3           1           1   287     6      15
##   7            0           0           1           0     0     1       1
##   8            0           2           5         283     1     2       0
##   9            1          17         193          26    36    13       0
##   10           0           1           0           0     0     2     151
##   11           0           1           0           5     8     0      69
##   12           1           0          22          48     0     3      15
##   13           2           6          11           7     8     9      29
##   14           1           1           2           0     3     0      20
##   15           0           0           5          93     0     0       0
##   16           4           6          45          16    23     2       0
##   17           0           0           0           0     0     0       0
##   18           0           0           0           0     0     0       0
##   19           0           0           0           0     0     0      26
##     
##      PBMMC-2 PBMMC-3 PRET-1 PRET-2
##   0        2      24      2      6
##   1       15      53     97     14
##   2        2       0      0      1
##   3        5       4      1      2
##   4        0       0    392      0
##   5      129     128      0      3
##   6       16      14      0      4
##   7        1       1      0    295
##   8        0       2      0      0
##   9        0       0      0      0
##   10      21      64      0      0
##   11      37      57      2     25
##   12      87      14      0     11
##   13      36      34      2     36
##   14      50      79      0     23
##   15       0       0      2      3
##   16       0       0      0      0
##   17      87       4      0      0
##   18       0       0      2     77
##   19      12      22      0      0

It is worth noting that, although this suggests a batch effect, there might be cases where there are genuine differences in cell populations across batches (e.g. if the different batches represent samples from different tissues).

Data integration algorithms designed for single-cell RNA-seq do allow for unique cell types existing across batches, however, it’s always good to check the results of the integration using independent information (e.g. prior information about genes that are specific to particular cell types).

According to the biology of our experiment, we expect the samples to have different some cell type compositions but the majority of the data should have cell types in common.

7.2 Integrate Data

Seurat allows us to integrate our data using its IntegrateLayers function. This function takes the layers of the RNA assay that we split into and integrates them using a specified batch correction method. There are 4 methods wrapped in this function:

Anchor-based CCA integration (method=CCAIntegration)
Anchor-based RPCA integration (method=RPCAIntegration)
Harmony (method=HarmonyIntegration)
scVI (method=scVIIntegration)

Another popular method Mutual Nearest Neighbours (MNN) is not wrapped in this function but can be run using the RunFastMNN function from the SeuratWrappers package.

7.2.1 Correct the data - Harmony

Harmony, Korsunsky et al. 2019 is a method that uses an iterative clustering and correction approach to remove batch effects. It is based on the idea of “harmonizing” the data by finding clusters of cells that are similar across batches and then adjusting the data to make those clusters more similar. It does not edit the raw expresion values but creates a corrected embedding space that can be used for downstream analysis.

Harmony

PCA embeds cells into a space with reduced dimensionality. Harmony accepts the cell coordinates in this reduced space and runs an iterative algorithm to adjust for dataset specific effects.

A, Harmony uses fuzzy clustering to assign each cell to multiple clusters, while a penalty term ensures that the diversity of datasets within each cluster is maximized.
B, Harmony calculates a global centroid for each cluster, as well as dataset-specific centroids for each cluster.
C, Within each cluster, Harmony calculates a correction factor for each dataset based on the centroids.
D, Finally, Harmony corrects each cell with a cell-specific factor: a linear combination of dataset correction factors weighted by the cell’s soft cluster assignments made in step a. Harmony repeats steps a to d until convergence. The dependence between cluster assignment and dataset diminishes with each round. Datasets are represented with colors, cell types with different shapes.

It does not assume that all cell types are shared across batches and can therefore deal with unique cell types in different batches.

There are some parameters that can be adjusted when running Harmony. These aren’t explicity part of the the IntegrateLayers function but can be passed to the HarmonyIntegration method underneath. For example, the theta parameter (diversity clustering penalty parameter) controls the diversity of datasets within clusters, with higher values leading to more diverse clusters. The lambda parameter (Ridge Regression penalty parameter) controls the strength of the correction, with higher values leading to stronger corrections. The sigma parameter (width of soft kmeans clusters) controls the bandwidth of the Gaussian kernel used in the correction, with higher values leading to smoother corrections. In practice we mostly adjust theta. The default is 2.

# Run Harmony integration
harmony_object <- IntegrateLayers(object = seurat_object, 
                                  method = HarmonyIntegration,
                                  orig.reduction = "pca", 
                                  new.reduction = "harmony")

## The `features` argument is ignored by `HarmonyIntegration`.
## Transposing data matrix
## 
## Using automatic lambda estimation
## 
## Initializing state using k-means centroids initialization
## 
## Harmony 1/10
## 
## Harmony 2/10
## 
## Harmony 3/10
## 
## Harmony 4/10
## 
## Harmony 5/10
## 
## Harmony 6/10
## 
## Harmony converged after 6 iterations
## 
## This message is displayed once per session.

We can now visualise the results of the Harmony integration. We will repeat our clustering.

# Run UMAP
harmony_object <- RunUMAP(harmony_object, 
                          reduction = "harmony", 
                          dims = 1:15)

## 19:11:55 UMAP embedding parameters a = 0.9922 b = 1.112

## 19:11:55 Read 5500 rows and found 15 numeric columns

## 19:11:55 Using Annoy for neighbor search, n_neighbors = 30

## 19:11:55 Building Annoy index with metric = cosine, n_trees = 50

## 0%   10   20   30   40   50   60   70   80   90   100%

## [----|----|----|----|----|----|----|----|----|----|

## **************************************************|
## 19:11:55 Writing NN index file to temp file /var/folders/9s/hsfl94x96vn2g543r67q4vpn6rmr26/T//RtmpUfmyXk/file1654f3e05a46e
## 19:11:55 Searching Annoy index using 1 thread, search_k = 3000
## 19:11:57 Annoy recall = 100%
## 19:11:58 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
## 19:11:59 Initializing from normalized Laplacian + noise (using RSpectra)
## 19:11:59 Commencing optimization for 500 epochs, with 224864 positive edges
## 19:11:59 Using rng type: pcg
## 19:12:07 Optimization finished

# Run clustering
harmony_object <- FindNeighbors(harmony_object, reduction = "harmony",
                                dims = 1:15)

## Computing nearest neighbor graph
## Computing SNN

harmony_object <- FindClusters(harmony_object, 
                              cluster.name = "harmony_clusters")

## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 5500
## Number of edges: 191703
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8768
## Number of communities: 16
## Elapsed time: 0 seconds

# make the plots
harmony_plot <- DimPlot(harmony_object, 
                        reduction = "umap", 
                        group.by = "SampleName") + 
  ggtitle("Harmony integrated data")
harmony_clusters_plot <- DimPlot(harmony_object, reduction = "umap", 
                                     group.by ="harmony_clusters") + ggtitle("Harmony data clusters")
 
harmony_plot + harmony_clusters_plot

We can see that the cells are now grouped together more by cluster rather than by sample, suggesting that the batch effect has been removed.

7.2.2 Exercise

We have just carried out the integration using harmony with data processed separately at the sample group level. Try running the integration with data processed at the sample level and see if you get a different result.

Starting with our filtered data, split the object by SampleName.
Reprocess the data with SCTransform and run dimensionality reduction on the new layers.
Visualise your uncorrected data.
Run Harmony integration using the IntegrateLayers function.
Visualise the results and compare them with the sample group level integration.

Answer

# Read in our filtered data
seurat_object_sample <- readRDS("RObjects/Filtered.500.rds")

# Split the data by sample name
seurat_object_sample[["RNA"]] <- split(seurat_object_sample[["RNA"]],
                                f = seurat_object_sample$SampleName)
# Re-process, Seurat will treat each sample separately
seurat_object_sample <- SCTransform(
  seurat_object_sample,
  assay = "RNA",
  vars.to.regress = "percent.mt",
  verbose = FALSE
)
seurat_object_sample <- RunPCA(seurat_object_sample, 
                         features = VariableFeatures(object = seurat_object_sample))

## Warning in PrepDR(object = object, features = features, verbose = verbose): The
## following 586 features requested have not been scaled (running reduction
## without them): LYZ, S100A8, ACSM3, CCL5, IGKC, IGLC2, HBD, AHSP, CA1, TYROBP,
## HBM, CCL3, GYPA, GYPB, CCL4, S100A9, KLRB1, TRBC1, ALAS2, CA2, CST3, GNLY,
## SLC4A1, GZMA, FCER1G, HEMGN, MS4A1, ENSG00000227706, JCHAIN, ENSG00000287092,
## ANXA1, LINC02446, CCDC26, LINC03000, GZMB, MNDA, ENSG00000231482, CFD, S100A11,
## TCL1A, CST7, GZMK, ENSG00000229425, SNCA, TMPRSS15, CD3D, IFNG-AS1, CD36, TRDC,
## KLRD1, CD9, RGS1, NELL2, EBF1, TC2N, IGHD, FCRL1, NLGN4X, MS4A6A,
## ENSG00000224610, COL19A1, CLEC7A, SERPINA1, KLHL14, ADARB2, ENSG00000257275,
## SOX5, HBG1, NRG3, MYL4, RHCE, SYT1, ID1, CD3E, HLA-DQA1, LINC01374, RHAG,
## S100A10, ENSG00000287172, LINC02945, ENSG00000288087, NCAM2, CD8B, ZNF385D,
## ANK3, CD7, PLXDC2, LINC01505, TCL1B, RETN, HLA-DPA1, HLA-DRA, XKR4, CCN2, HBG2,
## EPB42, ENSG00000271774, FCGR3A, SOX6, TAFA1, LINC01013, NPY, IGHG3, TNIK, ICOS,
## THEMIS, SEC11C, SAMD3, PPP1R14A, TENM4, KCNQ5, ATP8B4, NCALD, LINC00692,
## ENSG00000287784, HOPX, ADAM28, IGHA1, SCML4, BANK1, ANK1, RPS4Y2, CYGB, TGFBR3,
## GPR183, LINC01871, VPREB3, VPREB1, SCN3A, RUNX3, DACH1, HDAC9, KLRK1, SELENBP1,
## MAFB, KLF1, C16orf74, AREG, GPM6A, CD24, ENSG00000286863, ADAM7-AS1, GYPE,
## PRSS57, PSD3, EXT1, LINC02227, IGLL5, CCR7, EMP1, MIR4435-2HG, ENSG00000257060,
## LINC01811, LIX1-AS1, MDK, CD3G, DOCK4, ENSG00000285534, HLA-DPB1, SLCO3A1,
## STAT4, SHROOM3, PRF1, IFITM3, GPRIN3, IFI44L, FAM30A, PID1, CD8A, C1orf162,
## PTPRS, ENSG00000189229, C1QTNF4, LRRC26, TRGC2, DPEP1, IGHG1, NKAIN2, ITK,
## LRMDA, MS4A7, CD6, PDE4B, COBLL1, APBA2, IL1B, CEBPD, BNC2, HLA-DQB1, KIF14,
## TMTC2, PCAT1, RAB31, STK32B, ENSG00000253980, CCSER1, GRIP1, BASP1, HCK,
## MIR924HG, LINC00861, CSF3R, AHR, SMIM1, ENSG00000291325, SPIB, EPHA4, POU4F1,
## RGS6, MEF2C, PPM1L, CHRM3, PLCL1, TMEM176B, BLNK, RHD, CYBB, NFIA, GZMH,
## GP6-AS1, STX11, BEST3, CYRIA, FCER2, RASGRF2, C1QTNF7-AS1, PELATON, ARL4C,
## NHERF2, BIRC3, MSR1, PTPRK, LINC01922, DLGAP2, ENSG00000254951, RHEX, CYTL1,
## MPZL3, NAPSB, FGR, PTPRM, LINC02397, TSPO2, ADAMTS6, IL21R, ENSG00000290067,
## FRY, ZBTB16, GNG7, CREB5, GRAP2, SPON1, BTLA, BAALC, TRAT1, TMEM117, ARL4A,
## HELB, PCLO, H2BC5, ADARB1, AK5, STAP1, ENSG00000267723, CSF2RA, MYLK, ARHGEF4,
## LINC00623, PKIA, ENSG00000273118, PLD4, RNF144B, SYTL2, SOCS2-AS1, TRABD2A,
## SDC2, CR1L, LINC02363, TNFRSF13C, STEAP1B, GZMM, HLA-DRB1, LILRB1,
## ENSG00000258216, SOX2-OT, ENSG00000289949, DPYD-AS1, LINC02964, LINC01550,
## DTX1, COL24A1, CD2, LINC01163, CRIM1, ENSG00000287292, LHFPL6, ENSG00000260271,
## ENSG00000291034, CDH4, LINC01781, KMO, DGKH, IL10RA, UST, GIMAP4, BLK, SNED1,
## TCL6, PDE7B, SLC4A8, KANK2, PLEK, LY86, PDLIM1, MARCKS, MEF2C-AS1, RNASE6,
## JAML, CD5, EGLN3, MYO10, LINC02202, LCN6, MT1X, SNTB1, CLEC4E, TOX2, NCF1,
## P2RY10, NAALADL2, WDR49, ATRNL1, CCND2, SLFN5, APBB2, PIK3AP1, EGFL7, IRAK3,
## ENSG00000254777, LINC02694, TNFSF8, MAL, ENSG00000253673, LINC01036, C5AR1,
## CD86, ARHGAP42, BICDL1, NFIA-AS2, SPI1, FCRL5, SPTA1, FGD2, GADD45G, SH2D4B,
## PCSK5, CYTIP, LINC01891, TTC39C, LINC01588, RGL1, MPP7, HLA-DMB, FMNL2, CIITA,
## CCDC30, NETO1, SLC25A43, KLF3, PLCH1, CHST15, DEPP1, FAM135A, PTK2, MINPP1,
## SERPINB9, ALCAM, LGALS3, ENSG00000229418, LRRK1, ALOX5, SPON2, CTSC, MTMR8,
## ST3GAL6, LINC02273, ARHGEF12, SNX10, SOCS2, ITGB7, CD22, SCML1, SKI, IGFBP7,
## KCNQ1, LILRB2, BIN2, LARGE-AS1, DHRS3, ZNF608, TLCD4, ENSG00000289298, RAG1,
## WDFY3, TRPS1, ENSG00000255240, RFESD, OVCH2, EPB41L2, CACNA1C, CARD16, GPR132,
## CD28, MDFIC, C12orf42, ENSG00000290853, CD4, PTGER4, DIAPH2-AS1, GATA3,
## CFAP20DC-DT, STK33, FBXO32, TSBP1-AS1, TMEM65, CRMP1, TSPAN32, AGPAT4, ZBP1,
## CARMIL1, ENSG00000197585, EPB41L5, COL5A1, KIF13A, NT5E, TSPAN7, TANC2, ADCY9,
## CD40, TGFBR1, FAM169A, CYSLTR1, TMEM220-AS1, LAMC1, XKR6, ITGB2-AS1,
## ENSG00000290906, TRIM2, KLF3-AS1, ICOSLG, ENSG00000290905, LATS2, PRAM1,
## ENSG00000288891, MTCL1, WDR19, S100A12, CCL17, ENSG00000257764, VCAN, FCN1,
## CXCL8, IFI27, MPO, CSTA, PRTN3, TMCC2, CCL4L2, ENSG00000253693, ELANE, AZU1,
## SLIT2, ENSG00000270240, CD14, PJVK, CCL3L3, PTPRG, ROR1, SPINK2, FGF13, IRF4,
## CA3, MDGA2, RNASE2, HS3ST4, CLIC3, MTUS2, APOC1, CECR2, HTR1F, CEMIP, LINGO2,
## ENSG00000233928, PPP2R2B, TFPI, FCRLA, S100B, UNC79, LINC00114, MPEG1,
## SLC25A21, FGL2, LINC02384, EPHA7, WARS1, PLBD1, FHIP1A, FPR1, XG, MIR3681HG,
## RASD1, RORB, EDA, CYTOR, BMPR1B, MIR4432HG, NCR3, ADGRE2, KRT1, KCNH8,
## TNFSF13B, ENSG00000286481, CMTM2, AFF2, GMPR, CLIC5, FGFBP2, COL23A1, CFAP73,
## PBX1, ITPKB-AS1, GPR65, MAN1C1, TNF, CHODL, FXYD2, PLPP3, ENSG00000286742,
## SORBS2, PTCRA, KYNU, CASP1, COBL, DPP10, DIRC3-AS1, ENSG00000258631, PDGFD,
## DMTN, NETO2, PLCB4, NRN1, MYO1E, TUSC3, SLC11A1, TCFL5, MGST1, ENSG00000253214,
## ENAH, GNG11, GCNT2, ADGRG5, ENSG00000285783, HES6, RIN2, LRRC3B,
## ENSG00000228401, ENSG00000187229, TAFA5, ENSG00000285679, PPM1E, CD34, NCEH1,
## UPP1, C1orf21, IGF2BP2, LDLRAP1, SCN9A, ENSG00000289404, ENSG00000287125, CR1,
## PCBP3, SIT1, ENSG00000286393, MID1, CALD1, UGT3A2

## PC_ 1 
## Positive:  BACH2, PDE4D, LINC-PINT, PCDH9, AFF3, JUNB, CD74, CD37, B2M, MALAT1 
##     CD52, PTPRC, SKAP1, BTG1, FOXP1, PRKCB, MAML2, UTRN, ZBTB20, IGHM 
##     ARHGEF18, RIPOR2, IFITM1, CD79B, CD247, TXNIP, CD69, LTB, IL32, FOXO1 
## Negative:  TUBA1B, HMGB2, H4C3, TUBB, TOP2A, H2AZ1, PCLAF, STMN1, NUSAP1, HMGN2 
##     RRM2, TYMS, UBE2C, PTTG1, MKI67, HMGB1, BIRC5, CENPF, HBA1, TUBB4B 
##     CDK1, PCNA, CKS2, SMC4, CKS1B, DUT, TK1, CCNB2, PRDX2, DIAPH3 
## PC_ 2 
## Positive:  HBA1, IFITM1, S100A4, IL32, S100A6, SRGN, SKAP1, RPS27, FTL, TRAC 
##     RPS18, RPL21, HBB, RPL13, HBA2, CD247, RPS29, BLVRB, SLC25A37, JUNB 
##     RPS12, RPS2, RPL34, RPS14, RPL32, RPL12, RPL41, RPS6, TRBC2, RPLP1 
## Negative:  BACH2, HMGB2, AFF3, H4C3, MME, CD79B, STMN1, TUBA1B, SSBP2, IGHM 
##     PCDH9, IGLL1, PDE4D, ARPP21, CD74, TOP2A, TUBB, HMGB1, TMEM131L, AKAP12 
##     PAX5, TCF4, AUTS2, RUBCNL, HMGN2, NUSAP1, RCSD1, NIBAN3, UBE2C, ZCCHC7 
## PC_ 3 
## Positive:  HBA1, HBA2, HBB, SLC25A37, BLVRB, HMBS, FECH, BPGM, PRDX2, PCDH9 
##     SLC25A39, STRADB, BNIP3L, AFF3, BACH2, FAM210B, SLC2A1, UROD, SSBP2, STOM 
##     MME, CD79B, CAT, GYPC, MZB1, YBX3, BCL2L1, CASC15, DNTT, ERG 
## Negative:  TUBA1B, H4C3, HMGB2, S100A4, TOP2A, IFITM1, B2M, TMSB4X, IL32, SRGN 
##     TUBB, S100A6, JUNB, UBE2C, CRIP1, LGALS1, SKAP1, ACTB, VIM, COTL1 
##     CD247, NUSAP1, HMGN2, AOAH, IFI30, TRAC, H2AZ1, CTSS, PTPRC, MKI67 
## PC_ 4 
## Positive:  LGALS1, S100A4, IFI30, FTL, S100A6, LST1, CTSS, CCDC200, NAMPT, CD74 
##     AIF1, SAT1, FTH1, TYMP, RBM47, LYN, NEAT1, SRGN, CD68, PSAP 
##     ANXA2, CFP, TSPO, KLF4, COTL1, TEX14, ZEB2, GPX1, FOS, GRN 
## Negative:  SKAP1, BCL11B, IL32, CAMK4, CD247, IFITM1, TRAC, TRBC2, PITPNC1, PRKCH 
##     RPS27, PCED1B-AS1, LCK, INPP4B, IL7R, LEPROTL1, ZNF331, TXK, MAML2, CD96 
##     RPS29, RNF125, PDE3B, RPL21, PRKCA, ABLIM1, TCF7, RORA, RPL13, RPL34 
## PC_ 5 
## Positive:  HBA1, HBB, HBA2, SLC25A37, BLVRB, PRDX2, HMBS, TOP2A, ARL6IP1, PTTG1 
##     CENPF, FECH, JAZF1, MGST3, UBE2C, HMGB2, BACH2, JUNB, CCNB1, CCNB2 
##     TUBB4B, UROD, CENPE, SLC25A39, SLC2A1, HMMR, FAM210B, CDC20, STOM, NUSAP1 
## Negative:  GAPDH, RPS2, EEF1A1, RPS18, RPLP1, DNTT, RPL10, NPM1, RPLP0, RPS3A 
##     RPL39, RPS12, MIF, RPL7, RPL36A, HNRNPA1, RPL3, IGLL1, UHRF1, RPS6 
##     RPS19, STMN1, HSP90AB1, ENO1, NME2, FABP5, RPL13, PTMA, ACTG1, RPL41

seurat_object_sample <- RunUMAP(seurat_object_sample,
                         reduction = "pca",
                         dims = 1:15)

## 19:13:41 UMAP embedding parameters a = 0.9922 b = 1.112

## 19:13:41 Read 5500 rows and found 15 numeric columns

## 19:13:41 Using Annoy for neighbor search, n_neighbors = 30

## 19:13:41 Building Annoy index with metric = cosine, n_trees = 50

## 0%   10   20   30   40   50   60   70   80   90   100%

## [----|----|----|----|----|----|----|----|----|----|

## **************************************************|
## 19:13:41 Writing NN index file to temp file /var/folders/9s/hsfl94x96vn2g543r67q4vpn6rmr26/T//RtmpUfmyXk/file1654f56c18462
## 19:13:41 Searching Annoy index using 1 thread, search_k = 3000
## 19:13:43 Annoy recall = 100%
## 19:13:44 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
## 19:13:45 Initializing from normalized Laplacian + noise (using RSpectra)
## 19:13:45 Commencing optimization for 500 epochs, with 221882 positive edges
## 19:13:45 Using rng type: pcg
## 19:13:53 Optimization finished

uncorrected_plot_sample <- DimPlot(seurat_object_sample,
                            reduction = "umap",
                            group.by = "SampleName") + 
  ggtitle("Uncorrected data - Sample")

seurat_object_sample <- FindNeighbors(seurat_object_sample, 
                                      reduction = "pca",
                                dims = 1:15)

## Computing nearest neighbor graph
## Computing SNN

seurat_object_sample <- FindClusters(seurat_object_sample, 
                              cluster.name = "uncorrected_clusters")

## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 5500
## Number of edges: 181101
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8754
## Number of communities: 19
## Elapsed time: 0 seconds

uncorrected_clusters_sample_plot <- DimPlot(seurat_object, reduction = "umap", 
                                     group.by ="uncorrected_clusters") + ggtitle("Uncorrected data clusters - Sample")

uncorrected_clusters_sample_plot

# Run Harmony integration
harmony_object_sample <- IntegrateLayers(object = seurat_object_sample,
                                  method = HarmonyIntegration,
                                  orig.reduction = "pca",
                                  new.reduction = "harmony_sample")

## Transposing data matrix

## Using automatic lambda estimation

## Initializing state using k-means centroids initialization

## Harmony 1/10

## Harmony 2/10

## Harmony 3/10

## Harmony 4/10

## Harmony 5/10

## Harmony 6/10

## Harmony converged after 6 iterations

harmony_object_sample <- RunUMAP(harmony_object_sample,
                          reduction = "harmony_sample",
                          dims = 1:15)

## 19:14:12 UMAP embedding parameters a = 0.9922 b = 1.112

## 19:14:12 Read 5500 rows and found 15 numeric columns

## 19:14:12 Using Annoy for neighbor search, n_neighbors = 30

## 19:14:12 Building Annoy index with metric = cosine, n_trees = 50

## 0%   10   20   30   40   50   60   70   80   90   100%

## [----|----|----|----|----|----|----|----|----|----|

## **************************************************|
## 19:14:13 Writing NN index file to temp file /var/folders/9s/hsfl94x96vn2g543r67q4vpn6rmr26/T//RtmpUfmyXk/file1654f3296b02e
## 19:14:13 Searching Annoy index using 1 thread, search_k = 3000
## 19:14:15 Annoy recall = 100%
## 19:14:18 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
## 19:14:20 Initializing from normalized Laplacian + noise (using RSpectra)
## 19:14:20 Commencing optimization for 500 epochs, with 232812 positive edges
## 19:14:20 Using rng type: pcg
## 19:14:28 Optimization finished

harmony_object_sample_plot <- DimPlot(harmony_object_sample, 
                                     reduction = "umap", 
                                     group.by = "SampleName") + 
  ggtitle("Harmony integrated data - Sample")

uncorrected_plot_sample + harmony_object_sample_plot

harmony_object_sample <- FindNeighbors(harmony_object_sample, 
                                       reduction = "harmony_sample",
                                dims = 1:15)

## Computing nearest neighbor graph
## Computing SNN

harmony_object_sample <- FindClusters(harmony_object_sample, 
                              cluster.name = "harmony_sample_clusters")

## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 5500
## Number of edges: 207415
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8744
## Number of communities: 14
## Elapsed time: 0 seconds

harmony_object_sample_clusters_plot <- DimPlot(harmony_object_sample, reduction = "umap", 
                                     group.by ="harmony_sample_clusters") + ggtitle("Harmony integrated data clusters - Sample")


 
uncorrected_clusters_sample_plot + harmony_object_sample_clusters_plot

7.2.3 Adjusting theta

The theta parameter controls the diversity of datasets within clusters, with higher values leading to more diverse clusters. If we have a strong batch effect we may want to increase this value to encourage more mixing of the batches within clusters. However, if we have a weaker batch effect or if we are concerned about over-correction we may want to decrease this value to allow for more batch-specific clusters. In our data we know we should have some celltypes that are batch specific so we may want to try a lower value.

# Run Harmony integration with adjusted theta
harmony_object_theta <- IntegrateLayers(object = seurat_object, 
                                  method = HarmonyIntegration,
                                  orig.reduction = "pca", 
                                  new.reduction = "harmony_theta_01",
                                  theta = 0.1)

## Transposing data matrix

## Using automatic lambda estimation

## Initializing state using k-means centroids initialization

## Harmony 1/10

## Harmony 2/10

## Harmony 3/10

## Harmony converged after 3 iterations

harmony_object_theta <- RunUMAP(harmony_object_theta,
                          reduction = "harmony_theta_01", 
                          dims = 1:15)

## 19:14:39 UMAP embedding parameters a = 0.9922 b = 1.112

## 19:14:39 Read 5500 rows and found 15 numeric columns

## 19:14:39 Using Annoy for neighbor search, n_neighbors = 30

## 19:14:39 Building Annoy index with metric = cosine, n_trees = 50

## 0%   10   20   30   40   50   60   70   80   90   100%

## [----|----|----|----|----|----|----|----|----|----|

## **************************************************|
## 19:14:39 Writing NN index file to temp file /var/folders/9s/hsfl94x96vn2g543r67q4vpn6rmr26/T//RtmpUfmyXk/file1654f7735dae0
## 19:14:39 Searching Annoy index using 1 thread, search_k = 3000
## 19:14:41 Annoy recall = 100%
## 19:14:41 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
## 19:14:43 Initializing from normalized Laplacian + noise (using RSpectra)
## 19:14:43 Commencing optimization for 500 epochs, with 223392 positive edges
## 19:14:43 Using rng type: pcg
## 19:14:50 Optimization finished

harmony_object_theta_plot <- DimPlot(harmony_object_theta, 
                                     reduction = "umap", 
                                     group.by = "SampleName") + 
  ggtitle("Harmony integrated data with theta = 0.1")

harmony_object_theta <- FindNeighbors(harmony_object_theta, 
                                       reduction = "harmony_theta_01",
                                dims = 1:15)

## Computing nearest neighbor graph
## Computing SNN

harmony_object_theta <- FindClusters(harmony_object_theta,
                              cluster.name = "harmony_theta_01_clusters")

## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 5500
## Number of edges: 186686
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8900
## Number of communities: 18
## Elapsed time: 0 seconds

harmony_object_theta_clusters_plot <- DimPlot(harmony_object_theta, 
                                              reduction = "umap", 
                                              group.by ="harmony_theta_01_clusters") + 
  ggtitle("Harmony integrated data clusters with theta = 0.1")

harmony_object_theta_plot + harmony_object_theta_clusters_plot

7.3 Correction Diagnostics

7.3.1 Has the correction worked?

We can visually assess whether the correction has worked by looking at the UMAP plots like the above to see if the cells are grouped together more by cluster rather than by sample, this is an indication that the batch effect has been removed. If we have enough cells we can also look at the distribution of samples within clusters to see if they are more mixed after integration. With our downsampled dataset this is difficult.

We can also make a barplots to check if the clusters have mixed sufficiently.

#Make a barplot of the clusters and samples
data.frame(Cluster = seurat_object$uncorrected_clusters, Sample = seurat_object$SampleName) %>%
  ggplot(aes(x = Cluster)) +
    geom_bar(aes(fill = Sample), position = "fill") +
    labs(title = "Uncorrected data") +
  scale_y_continuous(labels = scales::percent)

#Make a barplot of the clusters and samples
data.frame(Cluster = harmony_object_theta$harmony_theta_01_clusters, Sample = harmony_object_theta$SampleName) %>%
  ggplot(aes(x = Cluster)) +
    geom_bar(aes(fill = Sample), position = "fill") +
    labs(title = "Harmony Corrected data") +
  scale_y_continuous(labels = scales::percent)

In the corrected barplot we have less clusters and the clusters are more mixed across samples, suggesting that the batch effect has been removed. However, we can still see some clusters that are mostly made up of cells from a single sample. Biologically this is not unexpected so this should reassure us that the correction has done what it should.

7.3.2 Has the correction over worked?

From our plots we can see that the cells have been more integrated, however we must also ensure that in doing this we have not removed genuine biological heterogeneity.

We can do this visually by looking at markers of celltypes. For example if cells of different types merge together after integration this may be an indication that the integration has removed some of the biological signal. We can also look at the expression of known marker genes for different cell types to see if they are still expressed in the expected clusters after integration.

# CD79A (B cells)
# CST3 (monocytes)
# CD3D (T cells)
# HBA1 (erythrocytes)

FeaturePlot(seurat_object, features = c("CD79A", "CST3", "CD3D", "HBA1"), 
            reduction = "umap")

FeaturePlot(harmony_object_theta, features = c("CD79A", "CST3", "CD3D", "HBA1"), 
            reduction = "umap")

We can see that for all of these celltype markers the expression is still in the expected clusters after integration, suggesting that the integration has not removed too much of the biological signal.

Another method is to use the Adjusted Rand Index. We can use the adjusted Rand index to quantify the agreement between the clusterings before and after batch correction.

The adjusted Rand index gives us a measure of the proportion of pairs of cells that have the same relative status in both clusterings - i.e. they were are in the same cluster in both clusterings or they are in different clusters in both clusterings, as oppose to e.g. being in the same cluster in one clustering and different clusters in the other.

Larger indices are more desirable as this indicates that within-batch heterogeneity is preserved, though this must be balanced against the ability of each method to actually perform batch correction.

We can use the pairwiseRand function from the bluster package. The input is a list of clusterings. We can test all three of our integrated clusterings against their uncorrected clustering to see how much they differ.

library(bluster)
# Calculate the adjusted Rand index
# Input is a vector of the clusters
ari.SG <- pairwiseRand(seurat_object$uncorrected_clusters,
                    harmony_object$harmony_clusters,
                         mode = "index")

ari.SG

## [1] 0.3776693

ari.SN <- pairwiseRand(seurat_object_sample$uncorrected_clusters,
                    harmony_object_sample$harmony_sample_clusters,
                         mode = "index")

ari.SN

## [1] 0.3106236

ari.T <- pairwiseRand(seurat_object$uncorrected_clusters,
                    harmony_object_theta$harmony_theta_01_clusters,
                         mode = "index")

ari.T

## [1] 0.5711085

We can see that integrating on samples a gives us a worse Adjusted Rand Index compared to sample group and decreasing the theta parameter gives us a better Adjusted Rand Index, suggesting that the integration with theta = 0.1 has preserved more of the within batch heterogeneity than the other two integrations. An ARI of 0.59 is still less that ideal and so we may want to try other parameter changes or algorithms other than Harmony.

7.4 Our final dataset

Once we are happy with the integration we can rejoin our dataset for the downstream analyis. We must make sure we specify the correct assay to rejoin, in this case the “RNA” assay. The current active assay is “SCT” but the “RNA” assay contains the original, uncorrected data that we want to use for downstream analyses and is what needs to be rejoined.

# Rejoin the data
harmony_object_theta <- JoinLayers(harmony_object_theta, assay = "RNA")

A warning: Do not use the batch corrected values for gene based downstream analyses like differential expression. Instead, you should use the original, uncorrected data (in the “RNA or”SCT” layers) for gene-based analyses and use the batch-corrected data for visualisation and clustering.

Harmony is often faster for large datasets, whereas MNN is often preferred when strong, non-linear batch effects are present.

sessionInfo()

## R version 4.5.2 (2025-10-31)
## Platform: x86_64-apple-darwin20
## Running under: macOS Sequoia 15.7.3
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.5-x86_64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: Europe/London
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] bluster_1.20.0     future_1.69.0      patchwork_1.3.2    lubridate_1.9.4   
##  [5] forcats_1.0.1      stringr_1.6.0      dplyr_1.2.0        purrr_1.2.1       
##  [9] readr_2.1.6        tidyr_1.3.2        tibble_3.3.1       ggplot2_4.0.2     
## [13] tidyverse_2.0.0    glmGamPoi_1.22.0   sctransform_0.4.3  Seurat_5.4.0      
## [17] SeuratObject_5.3.0 sp_2.2-0          
## 
## loaded via a namespace (and not attached):
##   [1] RColorBrewer_1.1-3          rstudioapi_0.18.0          
##   [3] jsonlite_2.0.0              magrittr_2.0.4             
##   [5] spatstat.utils_3.2-1        farver_2.1.2               
##   [7] rmarkdown_2.30              vctrs_0.7.1                
##   [9] ROCR_1.0-12                 DelayedMatrixStats_1.32.0  
##  [11] spatstat.explore_3.7-0      htmltools_0.5.9            
##  [13] S4Arrays_1.10.1             BiocNeighbors_2.4.0        
##  [15] SparseArray_1.10.8          sass_0.4.10                
##  [17] parallelly_1.46.1           KernSmooth_2.23-26         
##  [19] bslib_0.10.0                htmlwidgets_1.6.4          
##  [21] ica_1.0-3                   plyr_1.8.9                 
##  [23] plotly_4.12.0               zoo_1.8-15                 
##  [25] cachem_1.1.0                igraph_2.2.1               
##  [27] mime_0.13                   lifecycle_1.0.5            
##  [29] pkgconfig_2.0.3             Matrix_1.7-4               
##  [31] R6_2.6.1                    fastmap_1.2.0              
##  [33] MatrixGenerics_1.22.0       fitdistrplus_1.2-6         
##  [35] shiny_1.12.1                digest_0.6.39              
##  [37] S4Vectors_0.48.0            tensor_1.5.1               
##  [39] RSpectra_0.16-2             irlba_2.3.7                
##  [41] GenomicRanges_1.62.1        beachmat_2.26.0            
##  [43] labeling_0.4.3              progressr_0.18.0           
##  [45] timechange_0.4.0            spatstat.sparse_3.1-0      
##  [47] httr_1.4.7                  polyclip_1.10-7            
##  [49] abind_1.4-8                 compiler_4.5.2             
##  [51] withr_3.0.2                 S7_0.2.1                   
##  [53] BiocParallel_1.44.0         fastDummies_1.7.5          
##  [55] MASS_7.3-65                 DelayedArray_0.36.0        
##  [57] tools_4.5.2                 lmtest_0.9-40              
##  [59] otel_0.2.0                  httpuv_1.6.16              
##  [61] future.apply_1.20.1         goftest_1.2-3              
##  [63] glue_1.8.0                  nlme_3.1-168               
##  [65] promises_1.5.0              grid_4.5.2                 
##  [67] Rtsne_0.17                  cluster_2.1.8.1            
##  [69] reshape2_1.4.5              generics_0.1.4             
##  [71] gtable_0.3.6                spatstat.data_3.1-9        
##  [73] tzdb_0.5.0                  hms_1.1.4                  
##  [75] data.table_1.18.2.1         XVector_0.50.0             
##  [77] BiocGenerics_0.56.0         spatstat.geom_3.7-0        
##  [79] RcppAnnoy_0.0.23            ggrepel_0.9.6              
##  [81] RANN_2.6.2                  pillar_1.11.1              
##  [83] spam_2.11-3                 RcppHNSW_0.6.0             
##  [85] later_1.4.5                 splines_4.5.2              
##  [87] lattice_0.22-7              survival_3.8-6             
##  [89] deldir_2.0-4                tidyselect_1.2.1           
##  [91] miniUI_0.1.2                pbapply_1.7-4              
##  [93] knitr_1.51                  gridExtra_2.3              
##  [95] IRanges_2.44.0              Seqinfo_1.0.0              
##  [97] SummarizedExperiment_1.40.0 scattermore_1.2            
##  [99] RhpcBLASctl_0.23-42         stats4_4.5.2               
## [101] xfun_0.56                   Biobase_2.70.0             
## [103] matrixStats_1.5.0           stringi_1.8.7              
## [105] lazyeval_0.2.2              yaml_2.3.12                
## [107] evaluate_1.0.5              codetools_0.2-20           
## [109] cli_3.6.5                   uwot_0.2.4                 
## [111] xtable_1.8-4                reticulate_1.44.1          
## [113] jquerylib_0.1.4             harmony_1.2.4              
## [115] dichromat_2.0-0.1           Rcpp_1.1.1                 
## [117] globals_0.19.0              spatstat.random_3.4-4      
## [119] png_0.1-8                   spatstat.univar_3.1-6      
## [121] parallel_4.5.2              dotCall64_1.2              
## [123] sparseMatrixStats_1.22.0    listenv_0.10.0             
## [125] viridisLite_0.4.2           scales_1.4.0               
## [127] ggridges_0.5.7              rlang_1.1.7                
## [129] cowplot_1.2.0

Introduction to single-cell RNA-seq analysis

Dataset Integration and Batch Correction