1 QC, normalisation and batch correction

1.1 Introduction

The aim of these materials is to demonstrate how to use the Seurat R package to process scRNA-seq data. They have been designed as a supplement to the Introduction to Single-cell RNA-seq Analysis course developed by University of Cambridge/CRUK. Here we use the same dataset and follow the same general steps, but using Seurat as an alternative to the Bioconductor packages used in the main course.

1.2 Dataset

We will be using data from Caron et al. 2020. Some context for this dataset is given in their introduction:

Childhood acute lymphoblastic leukemia (cALL) is the most common pediatric cancer. It is characterized by bone marrow lymphoid precursors that acquire genetic alterations, resulting in disrupted maturation and uncontrollable proliferation. Nowaways, up to 85–90% of patients are cured, but others do not respond to treatment or relapse and die. The aim of the study is to characterise the heterogeneity of gene expression at the cell level, within and between patients. Precursor B cell ALL (B-ALL) represents ~85% of cases of cALL and precursor T cell ALL (T-ALL) ~15%, which can be further subdivided into more than a dozen molecular subtypes. The high hyper diploid cases (HHD) and those harbouring the t(12;21) [ETV6/RUNX1] rearrangement represent about ~60% of B-ALL cases and are associated with a good prognosis. Other less frequent (<10%) subtypes, such as KMT2A-rearranged, t(9;22) [BCR/ABL1] or T-ALL are associated with less favourable outcomes.

Caron et al. loaded thawed PBMMCs onto a 10X Genomics Chromium single cell platform (v2 chemistry). They aimed for 3,000 cells per sample and targeted 100,000 reads per cell by sequencing each sample on one lane of an Illumina HiSeq 4000 high-throughput sequencer (2x98 b.p. paired-end sequencing). They generated single cell gene expression data from 39,375 pediatric bone marrow mononuclear cells (PBMMCs) from eight cALL patients of common subtypes. Thus we have cells collected from four patients with ETV6/RUNX1 rearrangements, two HHD cases and two T-ALL cases. There are also PBMMCs from 3 healthy donors.

In the original paper they examined transcriptional variation within and between the cancers of different patients. Similarly, by the end of this tutorial we will have:

  • Performed QC, normalisation and batch correction on the data.
  • Identified the cell types in the different samples (see part 2).

1.3 Reading in the data

We start the analysis by loading the necessary packages:

library(Seurat)
library(sctransform)

# set seed for randomisation (e.g. UMAP dimension reduction)
set.seed = 123

We then read the data in, using a dedicated function that reads cellranger output. We will read in one sample from each sample type for demonstration purposes:

ETV6_RUNX1_1.data <- Read10X(data.dir = "Data/CellRanger_Outputs/SRR9264343/outs/filtered_feature_bc_matrix/")

HHD_1.data <- Read10X(data.dir = "Data/CellRanger_Outputs/SRR9264347/outs/filtered_feature_bc_matrix/")

PRE_T_1.data <- Read10X(data.dir = "Data/CellRanger_Outputs/SRR9264349/outs/filtered_feature_bc_matrix/")

PBMMC_1.data <- Read10X(data.dir = "Data/CellRanger_Outputs/SRR9264351/outs/filtered_feature_bc_matrix/")

This imports the matrix of counts into objects of class dgCMatrix. We can use these to make Seurat objects for each dataset, which is what we will use in the analysis:

ETV6_RUNX1_1 <- CreateSeuratObject(counts = ETV6_RUNX1_1.data, project = "ETV6_RUNX1_1")
HHD_1 <- CreateSeuratObject(counts = HHD_1.data, project = "HHD_1")
PRE_T_1 <- CreateSeuratObject(counts = PRE_T_1.data, project = "PRE_T_1")
PBMMC_1 <- CreateSeuratObject(counts = PBMMC_1.data, project = "PBMMC_1")

We can then merge the Seurat objects, storing all information in a single object for ease of use. We will name our Seurat object call as in cALL - childhood Acute Lymphoblastic Leukemia.

call <- merge(ETV6_RUNX1_1, 
              y = c(HHD_1, PRE_T_1, PBMMC_1), 
              add.cell.ids = c("E1", "H1", "PT1", "PB1"), 
              project = "cALL")

This is the Seurat object with the datasets combined

call
## An object of class Seurat 
## 37764 features across 14809 samples within 1 assay 
## Active assay: RNA (37764 features, 0 variable features)

1.3.1 The Seurat Object

There are two important components of the Seurat object to be aware of:

  • The @meta.data slot, which stores metadata for our droplets/cells (e.g. which batch of samples they belong to, total counts, total number of detected genes, etc.).
  • The @assays slot, which stores the matrix of raw counts, as well as (further down) matrices of normalised/transformed data.

Starting with our metadata slot:

head(call@meta.data)
##                         orig.ident nCount_RNA nFeature_RNA
## E1_AAACCTGAGACTTTCG-1 ETV6_RUNX1_1       8315         2920
## E1_AAACCTGGTCTTCAAG-1 ETV6_RUNX1_1      14916         4318
## E1_AAACCTGGTGCAACTT-1 ETV6_RUNX1_1       1567          931
## E1_AAACCTGGTGTTGAGG-1 ETV6_RUNX1_1      10433         3628
## E1_AAACCTGTCCCAAGTA-1 ETV6_RUNX1_1      10410         3320
## E1_AAACCTGTCGAATGCT-1 ETV6_RUNX1_1       2439         1384

We can see that the droplet/cell ids, given in the rownames of this table, have prefixes added (based on the add.cell.ids option we used above with the merge() function). The “orig.ident” has the sample names we also specified earlier. And we can see that Seurat automatically calculated total UMI counts (nCount_RNA) and the total number of detected genes (nFeature_RNA) in each droplet.

Moving on to the assays slot, we can see that it currently contains just the count data, called $RNA. Normalised data of different sorts will also get stored here.

call@assays
## $RNA
## Assay data with 37764 features for 14809 cells
## First 10 features:
##  MIR1302-2HG, FAM138A, OR4F5, ENSG00000238009, ENSG00000239945,
## ENSG00000239906, ENSG00000241860, ENSG00000241599, ENSG00000286448,
## ENSG00000236601

Being aware of the active assay is important when doing different types of analysis because tools will try to use the active assay by default if they can. Note that normally raw counts (the RNA assay) are used for differential expression e.g. calling markers. Conversely, normalised data is used to identify cell types e.g. in drawing UMAP plots. Here, we start by defining our default assay to the raw counts, using the DefaultAssay() function:

DefaultAssay(call) <- 'RNA'
call@active.assay
## [1] "RNA"

1.4 Quality Control

We have already seen some useful stats for QC e.g. feature counts. In addition, the number of reads mapping to mitochondrial genes is useful because high numbers are indicative of poor quality cells.

Another simple metric that we can quickly calculate is the number of genes detected in the whole dataset (i.e. those that are true for this statement):

table(rowSums(call@assays$RNA) > 0)
## 
## FALSE  TRUE 
##  6591 31173

1.4.1 % Mitocondrial Reads

One of the basic quality metrics used for filtering cells is the percentage of reads mapping to the mitochondrial genome. A high level of reads from a particular cell mapping to mitochondrial genes is indictative of poor a quality transcriptome from that cell. In particular this may be a cell which has been caused to initiate apoptosis during the processing steps of the experiment. To get these numbers we can take advantage of the fact that (in human) all mitochondrial gene names start with the ‘MT-’ prefix. This is probably not the case for your non-human species of interest, although for mouse it is usually ‘Mt’.

The following command will add the percent of reads mapping to mitochondrial genes to the meta data.

call[["percent.mt"]] <- PercentageFeatureSet(call, pattern = "^MT-")

head(call@meta.data)
##                         orig.ident nCount_RNA nFeature_RNA percent.mt
## E1_AAACCTGAGACTTTCG-1 ETV6_RUNX1_1       8315         2920   3.523752
## E1_AAACCTGGTCTTCAAG-1 ETV6_RUNX1_1      14916         4318   3.821400
## E1_AAACCTGGTGCAACTT-1 ETV6_RUNX1_1       1567          931  27.504786
## E1_AAACCTGGTGTTGAGG-1 ETV6_RUNX1_1      10433         3628   4.092783
## E1_AAACCTGTCCCAAGTA-1 ETV6_RUNX1_1      10410         3320   5.072046
## E1_AAACCTGTCGAATGCT-1 ETV6_RUNX1_1       2439         1384   3.690037

1.4.2 Visualising Counts

We can use the VlnPlot() function to visualise data stored in our metadata slot. This function will generate violin plots with samples on the x-axis and the variable of interest on the y-axis.

For example, to plot raw counts per sample:

And total features per sample:

And the percentage of mitochondrial reads per sample:

We can see that the different samples differ quite a lot in terms of the distributions of these statistics. Why might that be?

1.5 Filtering cells

The simplest way to filter out cells is with hard cutoffs across all samples. However, it may be preferable to use different cutoffs on different samples (because different samples may have been sequenced to different depths).

Let’s look at the relationships between the QC metrics:

Those cells with high percent.mt also tend to have low read counts, extra evidence that these are low quality cells which will not be useful for our analysis.

FeatureScatter(call, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")

Read count is well correlated with feature count. This is what we would expect. If there were cells with lots of reads but few features this would suggest problems with the capture of diverse RNA molecules in those cells.

Based on this exploration, we will pick some thresholds to remove the most outlying droplets: with greater than 200 and less than 6000 detected features. This excludes poor quality cells with little data and potential doublets which have more features than we expect to see in the cells. We will also exclude droplets with less than 10% mitochondrial reads, which should exclude cells undergoing apoptosis.

We might want to come back and adjust these cut offs once we have seen the UMAP plots. It may be that we still get clusters of low quality cells that should have been removed. Better to be cautious to begin with, to avoid filtering out unusual cell types.

# define condition for filtering
cells_to_filter <- rownames(subset(call, subset = nFeature_RNA > 200 & nFeature_RNA < 6000 & percent.mt < 10)@meta.data)
# add this information to the metadata
call$keep <- rownames(call@meta.data) %in% cells_to_filter

We can quickly tabulate how many cells we are keeping:

table(call$keep)
## 
## FALSE  TRUE 
##  2054 12755

Let’s look back at our distribution plots, highlighting the cells that we are retaining and filtering.

VlnPlot(call, features = c("nCount_RNA"), log=TRUE, split.by="keep")

VlnPlot(call, features = c("nFeature_RNA"), split.by="keep")

VlnPlot(call, features = c("percent.mt"), split.by="keep")

Once we are happy with the filtering, we can remove the low quality cells from the Seurat object. Note that features here are genes and samples are cells.

call <- subset(call, subset = keep)
call
## An object of class Seurat 
## 37764 features across 12755 samples within 1 assay 
## Active assay: RNA (37764 features, 0 variable features)

1.6 Normalisation

The algorithm scTransform has been shown to improve normalisation over simpler methods. Furthermore, it does normalisation, scaling and finding variable features in one step.

call <- SCTransform(call, 
                    vars.to.regress = "percent.mt", 
                    variable.features.n = 3000)

The normalised data are now under $SCT in the assay slot:

call@assays
## $RNA
## Assay data with 37764 features for 12755 cells
## First 10 features:
##  MIR1302-2HG, FAM138A, OR4F5, ENSG00000238009, ENSG00000239945,
## ENSG00000239906, ENSG00000241860, ENSG00000241599, ENSG00000286448,
## ENSG00000236601 
## 
## $SCT
## SCTAssay data with 26354 features for 12755 cells, and 1 SCTModel(s) 
## Top 10 variable features:
##  LYZ, S100A8, S100A9, IGKC, ACSM3, ENSG00000257764, HBB, HBA1, HBA2,
## GNLY

1.7 Dimensionality reduction

1.7.1 Variable features

Let’s have a look at the variable features (genes). Do you recognise any of them?

top10 <- head(VariableFeatures(call), 10)
plot1 <- VariableFeaturePlot(call)
LabelPoints(plot = plot1, points = top10, repel = TRUE)

1.7.2 Linear dimension reduction (PCA)

We do linear dimension reduction (PCA) to determine the principal components of variation in the data. Note the genes associated with the first 5 PCs. Is there anything noticeable about them? If there are recognisable groups of genes associated with a particular PC this likely represents a strong biological signals (the most distinct cell type, variation in cell cycle stage, or a technical problem (low quality cells).

call <- RunPCA(call, features = VariableFeatures(object = call))
## PC_ 1 
## Positive:  CD74, TCL1A, HLA-DRA, CD79B, VPREB1, EBF1, BACH2, PSD3, HLA-DPB1, ENSG00000287092 
##     TSC22D3, MAML3, AUTS2, PDE4D, HLA-DRB1, ZEB2, HLA-B, RPL10, BANK1, HLA-DPA1 
##     MEF2C, CD79A, CD52, IGHM, S100A4, KLF2, YBX3, CD9, MME, LAPTM5 
## Negative:  CHI3L2, FXYD2, TOX, TRBC1, ENSG00000259097, MIR181A1HG, CASC15, CCDC26, BCL11B, ENSG00000271955 
##     GNAQ, CD3D, ENSG00000236283, SPRED2, DAB1, LINC01811, TSHR, TRBC2, ITM2A, PTPRM 
##     CD1E, KIAA1217, ALDH1A2, MAL, PLCL1, CAMK4, ELOVL4, ENSG00000259345, THEMIS, CEP85L 
## PC_ 2 
## Positive:  LINC03000, CACNB2, HLA-DRA, MDGA2, TUBB, LINC02945, DNTT, NEGR1, ENSG00000290032, H4C3 
##     RNF220, STMN1, TUBA1B, CALN1, ENSG00000233064, PCLO, ENSG00000189229, ENSG00000285534, MSR1, HLA-DPB1 
##     HLA-DRB5, SMAD1, PRKD1, ENSG00000229839, MDK, ENSG00000227706, LINC01416, H2BC11, ENSG00000289605, H1-0 
## Negative:  BACH2, CD79B, VPREB1, AFF3, TSC22D3, KLF2, LINC-PINT, MAML3, PDE4D, SETBP1 
##     S100A4, IGKC, CD9, BANK1, ZEB2, MS4A1, BLK, LYN, CD37, B4GALT1 
##     BTG1, IL3RA, TMSB4X, IGHM, LINC01619, OSBPL10, LGALS1, KLHL2, CDKL1, S100A6 
## PC_ 3 
## Positive:  VPREB1, BACH2, CD79B, EBF1, TCL1A, AFF3, PSD3, MAML3, TSC22D3, CD9 
##     ENSG00000287092, MME, PDE4D, GAB1, H4C3, YBX3, SETBP1, LARGE1, BANK1, AUTS2 
##     KLF2, CSGALNACT1, MIR181A1HG, TCF4, FLT3, ENSG00000287172, SSBP2, CDKL1, HMGB2, BLK 
## Negative:  TYROBP, S100A6, FCER1G, SRGN, S100A11, LYZ, S100A9, FTL, LGALS1, S100A4 
##     CST3, S100A8, FCN1, ENSG00000257764, CSTA, S100A10, VCAN, CFD, MNDA, S100A12 
##     SERPINA1, LRMDA, TYMP, NAMPT, CTSS, COTL1, CD14, RAB31, IFI30, CLEC7A 
## PC_ 4 
## Positive:  H4C3, HMGB2, TUBA1B, TOP2A, TUBB, NUSAP1, UBE2C, CENPF, MKI67, H2AZ1 
##     HMGN2, RRM2, PTTG1, BIRC5, SMC4, AURKB, CENPE, STMN1, TYMS, TPX2 
##     TUBB4B, PRC1, GTSE1, PCLAF, ASPM, ATAD2, ARL6IP1, HMGB1, KPNA2, CDK1 
## Negative:  ENSG00000271955, LINC03000, CASC15, CACNB2, LINC02945, HLA-DRA, MDGA2, CALN1, ENSG00000290032, ENSG00000227706 
##     STAG3, ENSG00000259345, ENSG00000287092, EEF1A1, KDM5B, RPS27, H2AC6, RPS15A, DNTT, ADGRL3 
##     LTB, LINC01416, SMAD1, ENSG00000229839, CD74, HLA-DPB1, PCLO, CCDC26, ANKFN1, SKAP1 
## PC_ 5 
## Positive:  NKG7, IFITM1, CCL5, RPL10, SKAP1, VPREB1, RPS4X, LNCAROD, FLT3, GZMA 
##     IL32, CST7, CTSW, KLRB1, SYT1, S100A16, GNLY, RPL23A, TC2N, RPS2 
##     TMSB4X, TRAC, RPS3A, RPL13, RPS18, RPS26, MAML3, GZMM, GZMH, SAMD3 
## Negative:  ACSM3, IGKC, BACH2, IGLL5, IGHM, AFF3, SYK, PDE4D, PCDH9, IGLC2 
##     KLHL14, ADAM23, LYN, HLA-DRA, IGLC3, ROR1, RGS2, MCTP2, RUBCNL, FCRL1 
##     CD79B, LYZ, CD74, MTSS1, DTX1, TMEM131L, S100A8, CST3, FCRLA, ENSG00000224610

The result of the PCA is stored in the reductions slot, here as $pca:

call@reductions
## $pca
## A dimensional reduction object with key PC_ 
##  Number of dimensions: 50 
##  Projected dimensional reduction calculated:  FALSE 
##  Jackstraw run: FALSE 
##  Computed using assay: SCT

We can plot the first two principal components using the DimPlot() function. What do you notice about the different samples on the PCA plot?

DimPlot(call, reduction = "pca", group.by="orig.ident")

1.7.3 Number of components for UMAP

We need to pick a number of PCs which capture most of the variance in the data. These will be used to generate a UMAP plot. How many PCs should we use to describe the data? This plot shows how much variance is captured by each PC.

ElbowPlot(call)

A large proportion of the variation is captured by 16 components and there is a drop off thereafter. Let’s go with 16.

1.7.4 Non-linear dimension reduction - UMAP

Non-linear dimension reduction methods such as UMAP and TSNE take the PCA data as a starting point, but are able to take more complex (non-linear) patterns hidden in the data and represent them in only two dimensions (which humans are good at examining).

call <- RunUMAP(call, dims = 1:16)
## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session
DimPlot(call, group.by="orig.ident")

There is a strong batch effect. Most of the cells separate based on the sample. If the cells really are this different between samples it makes it difficult (perhaps impossible) to compare the same cell types between the datasets and differentiate technical differences from biological differences. We need to do “batch correction” to better integrate the data before further analysis.

If we plot the expression of the B cell marker CD79A (below), we can see that there are several distinct B cell clusters. These could be distinct types of B cell, but if this were true and ETV6_RUNX1_1 (red cells in plot above) has no cells comparable to the other datasets, then we simply can’t compare them.

FeaturePlot(call, reduction = "umap", features = c('CD79A'))

1.8 Batch Correction

As an example of batch correction, we will start with two technical replicates. One is the PBMMC_1 sample that we have already seen (PBMMC1a), the other is a technical replicate derived from the same sample material (PBMMC1b). Whilst the two samples come from distinct 10X runs, they are derived from the same starting material and therefore, if there was no batch effect, they should be close to identical.

Here we will use the integration and batch correction approach outlined in Stuart et al..

1.8.1 Read data

As we did before, we start by reading the count matrices from cellranger and create Seurat objects from them:

# PBMMC_1a = SRR9264351
PBMMC_1a.data <- Read10X(data.dir = "Data/CellRanger_Outputs/SRR9264351/outs/filtered_feature_bc_matrix/")
PBMMC_1a <- CreateSeuratObject(counts = PBMMC_1a.data, project = "PBMMC_1a")

# PBMMC_1b = SRR9264352
PBMMC_1b.data <- Read10X(data.dir = "Data/CellRanger_Outputs/SRR9264352/outs/filtered_feature_bc_matrix/")
PBMMC_1b <- CreateSeuratObject(counts = PBMMC_1b.data, project = "PBMMC_1b")

1.8.2 Merge, normalise and do dimension reduction

Next, we merge our datasets (without batch correction) and see whether there is a batch effect using PCA + UMAP:

PBMMC_merge <- merge(PBMMC_1a, y = c(PBMMC_1b), 
              add.cell.ids = c("PB1a", "PB1b"), project = "PBMMC")

# Normalise the data, do dimension reduction
PBMMC_merge <- SCTransform(PBMMC_merge, verbose = TRUE, variable.features.n = 3000)
PBMMC_merge <- RunPCA(PBMMC_merge, npcs = 30, verbose = FALSE)
PBMMC_merge <- RunUMAP(PBMMC_merge, reduction = "pca", dims = 1:30)

Looking at our UMAP, we can see that the technical replicates cluster together but clearly don’t overlap well.

DimPlot(PBMMC_merge, reduction = "umap", group.by="orig.ident")

So, there seems to be a batch effect, but will it affect the results?

1.8.3 Clustering of un-corrected cells

Let’s call clusters and see how the replicates are distributed in the them. If there are clusters with lots of one replicate and little of the other, this suggests that these clusters are driven by technical variation, which ought to be removed before analysis.

The clustering is done in two steps: generate a nearest-neighbours graph; call clusters (or “communities”) on this graph.

PBMMC_merge <- FindNeighbors(PBMMC_merge, dims = 1:30)
PBMMC_merge <- FindClusters(PBMMC_merge, resolution = 0.5)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 2153
## Number of edges: 68762
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9127
## Number of communities: 14
## Elapsed time: 0 seconds
table(Cluster = PBMMC_merge$SCT_snn_res.0.5, Batch = PBMMC_merge$orig.ident)
##        Batch
## Cluster PBMMC_1a PBMMC_1b
##      0       130      212
##      1        16      304
##      2        20      241
##      3       248        0
##      4        46      112
##      5       136       14
##      6        35       93
##      7        21       97
##      8        61       56
##      9       104        4
##      10       46       36
##      11       31       38
##      12       16       20
##      13        6       10

By tabulating how many cells from each technical replicate fall in each cluster, we can see that some clusters have almost none of one replicate and lots of the other. If we proceeded with the uncorrected data in our downstream analysis, it might lead us to conclude that there are biological differences between our samples, which in this case we know should really be identical.

1.8.4 Data Integration

We can perform batch correction by using a method of data integration implemented in Seurat, which aims to bring cells of different samples closer together (while retaining as much of the biological variance as possible).

Firstly, we normalize and identify variable features for each dataset independently and make a list of the normalised Seurat objects

# loop through the two samples
pbmmc_list <- lapply(c(PBMMC_1a, PBMMC_1b), function(x) {
  x <- NormalizeData(x)
  x <- FindVariableFeatures(x, selection.method = "vst", nfeatures = 2000)
})
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at -2.7913
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 0.30103
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 7.7858e-15

Then, we select features that are repeatedly variable across datasets which can be used for for integration. We are using a high value here (10000 features). You could try varying this to see if using fewer features affects the results.

features <- SelectIntegrationFeatures(object.list = pbmmc_list, nfeatures=10000)

Perform the integration

pbmmc_anchors <- FindIntegrationAnchors(object.list = pbmmc_list, anchor.features = features)
## Warning in CheckDuplicateCellNames(object.list = object.list): Some cell names
## are duplicated across objects provided. Renaming to enforce unique cell names.
pbmmc_combined <- IntegrateData(anchorset = pbmmc_anchors)

We now have an ‘integrated’ assay, as well as the standard ‘RNA’ assay.

pbmmc_combined@assays
## $RNA
## Assay data with 37764 features for 2153 cells
## First 10 features:
##  MIR1302-2HG, FAM138A, OR4F5, ENSG00000238009, ENSG00000239945,
## ENSG00000239906, ENSG00000241860, ENSG00000241599, ENSG00000286448,
## ENSG00000236601 
## 
## $integrated
## Assay data with 3027 features for 2153 cells
## Top 10 variable features:
##  HBM, HBA1, AHSP, CA1, HBA2, HBD, HBB, BLVRB, MPO, PRDX2

1.8.5 Analysis on corrected cells

To proceed with the corrected data, we need to change our default assay. Note that the original unmodified data still resides in the ‘RNA’ assay.

DefaultAssay(pbmmc_combined) <- "integrated"

Run the standard workflow for visualization and clustering:

pbmmc_combined <- ScaleData(pbmmc_combined, verbose = FALSE)
pbmmc_combined <- RunPCA(pbmmc_combined, npcs = 30, verbose = FALSE)
pbmmc_combined <- RunUMAP(pbmmc_combined, reduction = "pca", dims = 1:30)

Finally we are ready to visualise the corrected cells. The data look much better integrated, with a lot of overlapping between the different samples within clusters.

DimPlot(pbmmc_combined, reduction = "umap", group.by = "orig.ident")

We can move on to identify clusters of corrected cells and see how the replicates are distributed:

pbmmc_combined <- FindNeighbors(pbmmc_combined, dims = 1:30)
## Computing nearest neighbor graph
## Computing SNN
pbmmc_combined <- FindClusters(pbmmc_combined, resolution = 0.8)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 2153
## Number of edges: 80123
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8648
## Number of communities: 14
## Elapsed time: 0 seconds
table(Cluster = pbmmc_combined$integrated_snn_res.0.8, Batch = pbmmc_combined$orig.ident)
##        Batch
## Cluster PBMMC_1a PBMMC_1b
##      0       113      243
##      1       128      208
##      2        82      141
##      3       139       49
##      4        85       87
##      5        41      128
##      6        46      105
##      7        53       70
##      8        31       53
##      9        38       45
##      10       33       46
##      11       36       33
##      12       50       17
##      13       41       12

We can see that the clusters contain much more similar numbers of cells from each replicate than we had before. We can therefore that the data integration step helped to mitigate major batch effects in the data.

We are ready to move on to our downstream analysis in part 2.