5 Normalisation and Feature selection

Normalisation is a crucial step in single-cell RNA-seq data analysis. It aims to remove technical variations while preserving biological differences between cells. Various methods exist for normalising single-cell RNA-seq data, each with its own advantages and disadvantages.

We will demonstrate two normalisation methods implemented in Seurat and discuss some of the properties of single-cell count data that make normalisation necessary. We will also discuss the selection of highly variable genes, which is an important step in many downstream analyses.

While we provide brief explanations of the issues and challenges associated with normalisation and feature selection, we encourage you to read the Normalisation chapter in the Single Cell Best Practices online book:

Heumos, L., Schaar, A.C., Lance, C. et al. Best practices for single-cell analysis across modalities. Nat Rev Genet (2023). https://doi.org/10.1038/s41576-023-00586-w.

5.1 Setup

We will be using the following packages:

  • Seurat for the main analysis
  • sctransform for normalisation, with the additional glmGamPoi package, which makes it run faster
  • tidyverse for data manipulation and plotting and the patchwork package for combining plots
  • We also import specific functions from the SparseArray package for calculating summary statistics from matrices (we avoid importing the entire package to avoid conflicts with functions in the other packages)
# Load the libraries we will need for this practical
library(Seurat)
library(sctransform)
library(glmGamPoi)
library(tidyverse)
library(patchwork)
use("SparseArray", c("rowMeans", "rowVars", "colMeans", "colVars"))

# set default ggplot theme
theme_set(theme_classic())

For the purposes of this demonstration we have generated a smaller data set in which there are only 500 cells per sample. This is so that the code can be run in a reasonable amount of time during the live teaching session. The data were first QC’d and filtered as described in the QC and exploratory analysis session using thresholds set on a per sample basis. After this 500 cells were selected at random from each sample.

# Read preprocessed and filtered data object
seurat_object <- readRDS("RObjects/Filtered.500.rds")
seurat_object
## An object of class Seurat 
## 29786 features across 5500 samples within 1 assay 
## Active assay: RNA (29786 features, 0 variable features)
##  1 layer present: counts

We will demonstrate the effect of normalisation using a single sample to start with, with analysis of multi-sample data discussed later.

# Subset seurat object to include only PBMMC-1 sample for demonstration purposes
seurat_pbmmc1 <- subset(seurat_object,
                        subset = SampleName == "PBMMC-1")

# Remove genes that are not expressed in any of the 500 cells in this sample
pbmmc1_expressed_genes <- rowSums(seurat_pbmmc1[["RNA"]]$counts) > 0
seurat_pbmmc1 <- seurat_pbmmc1[pbmmc1_expressed_genes, ]

5.2 Why normalise?

Although we have done some basic filtering and QC, there remains variation in the data that may confound downstream analyses.

There are two critical aspects we will explore:

  • Total UMI differences between cells: Some cells may have been sequenced to a greater depth than others, resulting in higher total UMI counts for those cells, even if they are transcriptomically similar. This needs to be accounted for to ensure that differences in gene expression between cells are not simply due to differences in sequencing depth.
  • Mean-variance relationship across genes: In count data, the variance typically increases with the mean (heteroscedasticity). This can affect downstream analyses that assume stable variance across the range of expression values (homoscedasticity). Normalisation methods often include a transformation step to stabilise the variance.

Let’s explore both of these issues by looking at our raw counts.

5.2.1 Differences in total UMIs between cells

The plot below shows the total UMI counts per cell (transcript molecules) for 500 cells from the PBMMC-1 sample.

# Plot the total UMI counts across cells
# Get the raw counts data
seurat_pbmmc1[["RNA"]]$counts %>%
  # calculate the total UMI counts for each cell (column sums)
  colSums() %>%
  # convert the named vector to a data frame with cell names and UMI counts
  enframe(name = "cell", value = "total_umis") %>%
  # reorder the cells by their total UMI counts for plotting
  mutate(Cell = fct_reorder(cell, total_umis)) %>%
  # make a bar plot of the total UMI counts per cell
  ggplot(aes(x = Cell, y = total_umis)) +
  geom_col() +
  labs(x = "Cell",
       y = "Total cell UMI counts",
       title = "PBMMC-1: Before Normalization") +
  theme(
    axis.text.x = element_blank(),
    axis.ticks.x = element_blank(),
    plot.title = element_text(color = "firebrick")
  )

We can clearly see that UMI counts vary considerably between cells. Downstream analyses rely on comparing cells against each other to derive biological insights. However, differences in UMI counts make such comparisons problematic.

Why do UMI counts differ between cells?

  • Biological factors: cell subtype differences (e.g. cell size, transcription activity)
  • Technical factors: scRNA-seq data is inherently noisy due to:
    • Low mRNA content per cell
    • Cell-to-cell differences in mRNA capture efficiency
    • Variable sequencing depth
    • PCR amplification bias

What does normalisation do?

  • Normalisation corrects for cell-to-cell technical differences by standardising UMI count distributions across cells.
  • This ensures that each cell has comparable total counts.
  • As a result, downstream comparisons of relative gene expression between cells are valid.

5.2.2 Mean-variance relationship across genes

A common property of count data is that the variance increases with the mean (heteroscedasticity). Let’s explore this for our raw counts data by calculating the mean and variance for each gene across all cells and plotting the relationship.

# Get the raw counts data from the object
raw_cts <- seurat_pbmmc1[["RNA"]]$counts

# Calculate summary statistics for each gene
gene_raw_stats <- tibble(gene = rownames(raw_cts),
                         mean = rowMeans(raw_cts),
                         variance = rowVars(raw_cts))

# Plot the mean-variance relationship for the raw counts data
ggplot(gene_raw_stats, aes(x = mean, y = variance)) +
  geom_point(alpha = 0.5) +
  scale_x_log10() +
  scale_y_log10() +
  geom_abline(intercept = 0, slope = 1,
              linewidth = 1, colour = "red") +
  labs(x = "Mean expression (log scale)",
       y = "Variance (log scale)",
       title = "PBMMC-1: Raw Counts - Mean vs Variance")

We can clearly see a strong relationship between the mean and variance of gene expression, with variance increasing as mean expression increases.

The line shows the x = y relationship, which would indicate a Poisson distribution where the variance equals the mean. In this case, we see the data is above the line, indicating that the variance is greater than the mean, which is a common property of RNA-seq data and is often modelled using a negative binomial distribution (as we will see later using sctransform).

5.3 Normalization strategies

The sparse nature of scRNA-seq data presents challenges for normalisation that differ from bulk RNA-seq approaches.

Normalisation methods can be broadly categorised into two classes:

  1. Spike-in methods
  • Use spike-in controls for normalisation
  • Not applicable to droplet-based techniques such as 10x Genomics
  1. Non-spike-in methods
  • Rely on the available counts data for normalisation
  • Examples include DESeq2, edgeR (TMM), library size normalisation, deconvolution, and sctransform

Normalisation generally includes two steps:

  1. Scaling
  • Calculate a cell-specific size factor that represents the relative sequencing bias for that cell
  • Divide all counts for each gene by this size factor to remove bias
  • This approach assumes that cell-specific biases affect all genes similarly
  1. Transformation
  • Common transformations include log2, square root, or Pearson residuals (as in sctransform)

Scaling methods typically produce normalised counts-per-million (CPM) or transcripts-per-million (TPM) values that account for sequencing depth. However, these values exhibit variance that increases with their mean (heteroscedasticity), whereas most statistical methods assume stable variance that is independent of the mean (homoscedasticity). The log transformation is a widely used variance-stabilising approach that works well for highly expressed genes in bulk RNA-seq but is less effective for sparse scRNA-seq data.

Methods such as DESeq2, edgeR-TMM, and library size normalisation were originally developed for bulk RNA-seq. When applied to scRNA-seq data, these methods systematically over- or under-estimate size factors because they typically exclude zero counts prior to normalisation. This limitation led to the development of scRNA-seq-specific methods such as deconvolution and sctransform, which better account for the sparsity inherent in single-cell data.

Comparison of size factor estimation methods: DESeq2, edgeR-TMM, and library size normalisation

5.4 Shifted-log transformation

A simple method of normalisation, which is still widely used, is to apply a type of log transformation to the counts data. To account for different total UMIs in each cell and zeros in the data, a few steps are taken:

  • Calculate a size factor for each cell, which in Seurat is the total UMI counts for that cell.
  • Multiply the scaled values by a constant (default is 10,000), to avoid extremely small values.
  • Add a pseudocount (default 1) to all counts before applying the log transformation, to avoid infinite values when count = 0.
  • Apply the log transformation to the adjusted counts.

Despite being a simple method, it is still widely used and has been shown to be effective at capturing biological variation in many cases. However, it does not address the issue of heteroskedasticity in the data, which can affect downstream analyses.

Let’s apply this normalisation, and then investigate how the mean-variance relationship across genes looks like after applying it.

# The default normalisation method (shifted-log transformation)
# 1. counts of each gene are divided by total cell UMIs
# 2. multiplied by a scale factor (default 10,000)
# 3. log-transformed with a pseudocount of 1
seurat_pbmmc1 <- NormalizeData(seurat_pbmmc1)
## Normalizing layer: counts
# A new layer named `data` is added to the RNA assay
seurat_pbmmc1
## An object of class Seurat 
## 22644 features across 500 samples within 1 assay 
## Active assay: RNA (22644 features, 0 variable features)
##  2 layers present: counts, data

We can see that the NormalizeData function has added a new layer named data to the RNA assay, which contains the normalised data. The raw counts are still available in the counts layer.

Let’s see if this improved the mean-variance relationship in the data. Similarly to before, we calculate the mean and variance for each gene, but this time using the log-normalised data, and plot the relationship.

# Get the log-normalised data from the object
lognorm_cts <- seurat_pbmmc1[["RNA"]]$data
gene_lognorm_stats <- tibble(gene = rownames(lognorm_cts),
                             mean = rowMeans(lognorm_cts),
                             variance = rowVars(lognorm_cts))

# Plot the mean-variance relationship for the log-normalised data
# improved for highly expressed genes
# but still heteroscedastic for lowly expressed genes
ggplot(gene_lognorm_stats, aes(x = mean, y = variance)) +
  geom_point(alpha = 0.5) +
  geom_abline(intercept = 0, slope = 1,
              linewidth = 1, colour = "red") +
  labs(x = "Mean expression",
       y = "Variance",
       title = "PBMMC-1: Log Normalized - Mean vs Variance")

We can see that while the mean-variance relationship has somewhat improved for the highly expressed genes, the variance is still clearly increasing with the mean for more lowly expressed genes, so the data is still heteroscedastic.

This is a common issue with log-normalised data, and is one of the reasons why we use sctransform, which aims to stabilise the variance across the range of mean expression values.

Seurat provides a further function called ScaleData(), which attempts to further ameliorate this issue. However, the developers currently recommend using sctransform instead, which we introduce later.

5.5 Selection of highly variable genes

One objective of normalisation is to identify genes that are most variable across cells, as these are likely to be the most informative for downstream analyses such as clustering and dimensionality reduction.

In feature selection, we remove genes that are uninformative to improve computation and reduce noise, which should help reveal the true biology in the data. We assume that most low-level variance is due to technical effects rather than genuine biology, and that highly variable genes reflect real biological differences. While not perfect, this approach provides a principled way to interpret data with greater confidence.

Seurat offers the FindVariableFeatures() function to identify highly variable genes (HVGs) from the raw data. This identifies features that are outliers on a “mean variability plot”.

There are three available methods, summarised in the function’s documentation:

  • vst: First, fits a line to the relationship of log(variance) and log(mean) using local polynomial regression (loess). Then standardizes the feature values using the observed mean and expected variance (given by the fitted line). Feature variance is then calculated on the standardized values after clipping to a maximum (see clip.max parameter).

  • mean.var.plot (mvp): First, uses a function to calculate average expression (mean.function) and dispersion (dispersion.function) for each feature. Next, divides features into num.bin (default 20) bins based on their average expression, and calculates z-scores for dispersion within each bin. The purpose of this is to identify variable features while controlling for the strong relationship between variability and average expression.

  • dispersion (disp): selects the genes with the highest dispersion values.

We will use the variance stabilising transform (vst) method, which explicitly takes into account the mean-variance relationship we’ve been exploring.

Commonly used ways to decide on how many variable genes to select is to use a fixed number (e.g. 1000) or a fraction (e.g. the top 10%).

# Extract the 1000 most variable genes
# using the variance stabilising transformation method
seurat_pbmmc1 <- FindVariableFeatures(
  seurat_pbmmc1,
  selection.method = "vst",
  nfeatures = 1000
)
## Finding variable features for layer counts
VariableFeaturePlot(seurat_pbmmc1)

# Extract the top 10% most variable genes
seurat_pbmmc1 <- FindVariableFeatures(
  seurat_pbmmc1,
  selection.method = "vst",
  nfeatures = round(0.1 * nrow(seurat_pbmmc1))
)
## Finding variable features for layer counts
VariableFeaturePlot(seurat_pbmmc1)

The function’s default is to select 2000 features, but we will finally set this to 3000 to match what will be default in sctransform.

# Extract the 3000 most variable genes to match the default in sctransform
seurat_pbmmc1 <- FindVariableFeatures(
  seurat_pbmmc1,
  selection.method = "vst",
  nfeatures = 3000
)
## Finding variable features for layer counts
VariableFeaturePlot(seurat_pbmmc1)

The plot above looks like this is a reasonable number of genes to capture the biological variation rather than technical noise.

To extract the variable genes, we can use the VariableFeatures() function, which returns a vector of the names of the variable genes.

# Get the variable genes selected by the vst method
hvgs_vst <- VariableFeatures(seurat_pbmmc1)
head(hvgs_vst)
## [1] "HBB"  "HBA2" "HBA1" "HBM"  "AHSP" "HBD"

At this stage, we could use these normalised values and variable genes for downstream analyses, such as dimensionality reduction and clustering.

However, we will now demonstrate running the sctransform normalisation and compare the results to the log-normalised data.

5.6 sctransform normalisation

The log-normalisation method discussed above is a global-scaling approach that divides feature counts by the total counts per cell, multiplies by a scale factor (default = 10,000), and applies a natural log transformation with a pseudocount of 1. However, global-scaling methods assume that each cell originally contains the same number of RNA molecules, which is often not the case.

A key limitation of log-normalisation is that a correlation remains between the mean and variance of expression (heteroscedasticity). This affects downstream dimensionality reduction, as the principal components are often correlated with library size rather than biological variation.

sctransform addresses these issues by regressing library size out of raw counts and providing residuals as normalised and variance-stabilised expression values for downstream analyses. This approach assigns greater weight to lowly expressed genes that may show cell-type-specific expression, rather than highly expressed genes with broad expression patterns.

Effect of scaling normalisation

The sctransform package implements a statistical model based on regularised negative binomial regression to normalise scRNA-seq data. This model implicitly accounts for the two key issues discussed: widely varying total UMI counts across cells and the strong mean-variance relationship across genes.

Although sctransform is computationally more intensive than log-normalisation, it has proven more effective at capturing biological variation in many datasets, particularly for lowly expressed genes. It also allows adjustment for covariates that may drive technical variation, such as mitochondrial gene expression or cell cycle state, which can be regressed out during normalisation.

5.6.1 Running sctransform in Seurat

In Seurat this normalisation can be done using the SCTransform() function. This method also performs feature selection by default, selecting the top 3000 genes by residual variance. We can change this number when we run SCTransform() using the variable.features.n argument.

# Run sctransform normalisation
# regressing out the percentage of mitochondrial gene expression
seurat_pbmmc1 <- SCTransform(
  seurat_pbmmc1,
  vars.to.regress = "percent.mt",
  verbose = FALSE
)

# A new assay named `SCT` is added to the Seurat object
# 'counts' layer contains the corrected UMI counts
# 'data' layer contains the log-transformed counts data, with a pseudocount of 1 added
# 'scale.data' layer contains the Pearson residuals of the model fit
seurat_pbmmc1
## An object of class Seurat 
## 37494 features across 500 samples within 2 assays 
## Active assay: SCT (14850 features, 3000 variable features)
##  3 layers present: counts, data, scale.data
##  1 other assay present: RNA

This function returns the Seurat object with a new assay named SCT, which is automatically set as the default after running the function. This means that downstream functions will use the sctransform-normalised values by default.

This assay includes three layers:

  • The counts layer contains the depth-corrected UMI counts. These counts should now be directly comparable between cells as if each cell had been sequenced the same amount.
  • The data layer contains the log-transformed counts data, with a pseudocount of 1 added to avoid infinite values when count = 0.
  • The scale.data layer contains the residuals of the model fit. These values can be negative (meaning the gene was underexpressed relative to the model prediction) or positive (meaning the gene was overexpressed relative to the model prediction). The residuals will be automatically used in some of the downstream analysis implemented in Seurat.

As before, we can use the function VariableFeatures() to obtain the variable genes selected by sctransform:

# Get the variable genes selected by sctransform
# which are ranked by residual variance
hvgs_sct <- VariableFeatures(seurat_pbmmc1)
head(hvgs_sct)
## [1] "LYZ"             "S100A8"          "S100A9"          "ENSG00000257764" "ACSM3"           "LGALS1"

We can compare with the previously chosen variable genes from the log-normalised data to see how much overlap there is between the two methods.

# Compare the variable genes selected
# by vst on the raw data and from sctransform residual variance
length(intersect(hvgs_vst, hvgs_sct))
## [1] 845

We can see that while there is some overlap between the two sets of variable genes, the majority of genes are unique to each method.

So, which is the “correct” normalisation method? This is a difficult question to answer and there is still no consensus in the community about the ideal normalisation method for single-cell analysis.

5.6.2 Mean-variance relationship after sctransform

We can investigate the mean-variance relationship across genes after applying sctransform normalisation, to see if it has improved compared to the log-normalised data.

We look at the scale.data layer of the SCT assay, which contains the Pearson residuals of the model fit. These can be thought of as the variance-stabilised expression values, as the model has been fitted to account for the mean-variance relationship in the data.

# Compare the mean-variance relationship for the sctransform-normalised data
# Get summary statistics for the Pearson residuals from the sctransform normalisation
vst_cts <- seurat_pbmmc1[["SCT"]]$scale.data
gene_vst_stats <- tibble(gene = rownames(vst_cts),
                         mean = rowMeans(vst_cts),
                         variance = rowVars(vst_cts))

# Plot the mean-variance relationship for the sctransform-normalised data
ggplot(gene_vst_stats, aes(x = mean, y = variance)) +
  geom_point(alpha = 0.5) +
  labs(x = "Mean expression",
       y = "Variance",
       title = "PBMMC-1: sctransform Normalized - Mean vs Variance")

We can see that there is now essentially no relationship between the mean and variance of gene expression.

5.6.3 Total UMIs per cell after sctransform

We can also investigate how the total UMIs per cell look after normalisation with sctransform, and compare this to the raw counts.

# Compare the total UMI counts per cell for the raw counts and sctransform-normalised data

# get the total UMIs for raw UMIs
raw_cts_total <- seurat_pbmmc1[["RNA"]]$counts %>%
  colSums() %>%
  enframe(name = "cell", value = "total_counts")

# get the total UMIs for sctransform-normalised counts
sct_cts_total <- seurat_pbmmc1[["SCT"]]$counts %>%
  colSums() %>%
  enframe(name = "cell", value = "total_counts")

# Plot the total UMI counts per cell for the PBMMC-1 sample before normalisation
p_raw <- raw_cts_total %>%
  # reorder the cells by their total UMI counts for plotting
  mutate(Cell = fct_reorder(cell, total_counts)) %>%
  # make a bar plot of the total UMI counts per cell
  ggplot(aes(x = Cell, y = total_counts)) +
  geom_col() +
  labs(x = "Cell",
       y = "Total cell UMI counts",
       title = "Before Normalization") +
  theme(
    axis.text.x = element_blank(),
    axis.ticks.x = element_blank(),
    plot.title = element_text(color = "firebrick")
  )

p_sct <- sct_cts_total %>%
  # reorder the cells by their total UMI counts for plotting
  mutate(Cell = fct_reorder(cell, total_counts)) %>%
  # make a bar plot of the total UMI counts per cell
  ggplot(aes(x = Cell, y = total_counts)) +
  geom_col() +
  labs(x = "Cell",
       y = "Total cell UMI counts",
       title = "SCTransform") +
  theme(
    axis.text.x = element_blank(),
    axis.ticks.x = element_blank(),
    plot.title = element_text(color = "firebrick")
  )

# combine the two plots side by side
p_raw + p_sct

We can see the effect of the sctransform normalisation, where the range of total UMIs across cells is much narrower (notice the y-axis scale).

5.6.4 Exercise: Apply normalisation to the HHD-1 sample

Repeat the normalisation procedure demonstrated above, but this time apply it to the “HHD-1” sample instead of “PBMMC-1”.

Your task is to:

  1. Subset the seurat_object to include only the “HHD-1” sample
  2. Remove genes that are not expressed in this sample
  3. Apply log-normalisation using NormalizeData()
  4. Select the top 3000 variable genes using FindVariableFeatures() with the vst method
  5. Apply sctransform normalisation, regressing out mitochondrial percentage

After completing these steps, compare the mean-variance relationship in the raw counts, log-normalised, and sctransform-normalised data for this sample.

Answer

# 1. Subset to HHD-1 sample
seurat_hhd1 <- subset(seurat_object,
                      subset = SampleName == "HHD-1")

# 2. Remove genes not expressed in this sample
hhd1_expressed_genes <- rowSums(seurat_hhd1[["RNA"]]$counts) > 0
seurat_hhd1 <- seurat_hhd1[hhd1_expressed_genes, ]

# 3. Apply log-normalisation
seurat_hhd1 <- NormalizeData(seurat_hhd1)

# 4. Select top 3000 variable genes using vst method
seurat_hhd1 <- FindVariableFeatures(
  seurat_hhd1,
  selection.method = "vst",
  nfeatures = 3000
)
VariableFeaturePlot(seurat_hhd1)

# 5. Apply sctransform normalisation
seurat_hhd1 <- SCTransform(
  seurat_hhd1,
  vars.to.regress = "percent.mt",
  verbose = FALSE
)

# Compare mean-variance relationships

# Raw counts
raw_cts_hhd1 <- seurat_hhd1[["RNA"]]$counts
gene_raw_stats_hhd1 <- tibble(gene = rownames(raw_cts_hhd1),
                              mean = rowMeans(raw_cts_hhd1),
                              variance = rowVars(raw_cts_hhd1))

# Log-normalised
lognorm_cts_hhd1 <- seurat_hhd1[["RNA"]]$data
gene_lognorm_stats_hhd1 <- tibble(gene = rownames(lognorm_cts_hhd1),
                                  mean = rowMeans(lognorm_cts_hhd1),
                                  variance = rowVars(lognorm_cts_hhd1))

# sctransform
vst_cts_hhd1 <- seurat_hhd1[["SCT"]]$scale.data
gene_vst_stats_hhd1 <- tibble(gene = rownames(vst_cts_hhd1),
                              mean = rowMeans(vst_cts_hhd1),
                              variance = rowVars(vst_cts_hhd1))

# Create comparison plots
p1 <- ggplot(gene_raw_stats_hhd1, aes(x = mean, y = variance)) +
  geom_point(alpha = 0.5) +
  scale_x_log10() +
  scale_y_log10() +
  labs(title = "HHD-1: Raw Counts")

p2 <- ggplot(gene_lognorm_stats_hhd1, aes(x = mean, y = variance)) +
  geom_point(alpha = 0.5) +
  labs(title = "HHD-1: Log Normalized")

p3 <- ggplot(gene_vst_stats_hhd1, aes(x = mean, y = variance)) +
  geom_point(alpha = 0.5) +
  labs(title = "HHD-1: sctransform Normalized")

# combine the plots side by side
p1 + p2 + p3

5.7 Working with multiple samples

When working with multiple samples, it is recommended to run the normalisation for each sample separately, so the model learns the technical variation within each sample and removes it.

This is done by performing a so-called split of the RNA assay:

# Split the RNA assay by sample name
seurat_object[["RNA"]] <- split(seurat_object[["RNA"]],
                                f = seurat_object$SampleName)

seurat_object
## An object of class Seurat 
## 29786 features across 5500 samples within 1 assay 
## Active assay: RNA (29786 features, 0 variable features)
##  11 layers present: counts.ETV6RUNX1-1, counts.ETV6RUNX1-2, counts.ETV6RUNX1-3, counts.ETV6RUNX1-4, counts.HHD-1, counts.HHD-2, counts.PRET-1, counts.PRET-2, counts.PBMMC-1, counts.PBMMC-2, counts.PBMMC-3

The option f defines the factor by which we want to split the assay. In this case we’re using the individual samples, but you could use other variables of our choice.

Note how the RNA assay now has many more layers, one for each sample.

After this, when we run either NormalizeData() or SCTransform(), it will apply the normalisation per batch.

# Run log-normalisation on the split assay
seurat_object <- NormalizeData(seurat_object)
## Normalizing layer: counts.ETV6RUNX1-1
## Normalizing layer: counts.ETV6RUNX1-2
## Normalizing layer: counts.ETV6RUNX1-3
## Normalizing layer: counts.ETV6RUNX1-4
## Normalizing layer: counts.HHD-1
## Normalizing layer: counts.HHD-2
## Normalizing layer: counts.PRET-1
## Normalizing layer: counts.PRET-2
## Normalizing layer: counts.PBMMC-1
## Normalizing layer: counts.PBMMC-2
## Normalizing layer: counts.PBMMC-3
# Run sctransform normalisation on the split assay
seurat_object <- SCTransform(
  seurat_object,
  assay = "RNA",
  vars.to.regress = "percent.mt",
  verbose = FALSE
)

An you could then continue with the downstream analysis, such as dimensionality reduction, which will be covered next.

We will discuss more about splitting and integration in the Data Integration section.


sessionInfo()

## R version 4.5.1 (2025-06-13)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.5 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0  LAPACK version 3.10.0
## 
## locale:
##  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8        LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8   
##  [6] LC_MESSAGES=C.UTF-8    LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C           LC_TELEPHONE=C        
## [11] LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
## 
## time zone: Europe/London
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] SparseArray_1.10.1 patchwork_1.3.2    glmGamPoi_1.22.0   lubridate_1.9.4    forcats_1.0.1      stringr_1.6.0     
##  [7] dplyr_1.1.4        purrr_1.2.0        readr_2.1.5        tidyr_1.3.1        tibble_3.3.0       ggplot2_4.0.1     
## [13] tidyverse_2.0.0    sctransform_0.4.2  Seurat_5.3.1       SeuratObject_5.2.0 sp_2.2-0          
## 
## loaded via a namespace (and not attached):
##   [1] RColorBrewer_1.1-3          jsonlite_2.0.0              magrittr_2.0.4              ggbeeswarm_0.7.2           
##   [5] spatstat.utils_3.2-0        farver_2.1.2                rmarkdown_2.30              vctrs_0.6.5                
##   [9] ROCR_1.0-11                 DelayedMatrixStats_1.32.0   spatstat.explore_3.5-3      S4Arrays_1.10.0            
##  [13] htmltools_0.5.8.1           sass_0.4.10                 parallelly_1.45.1           KernSmooth_2.23-26         
##  [17] bslib_0.9.0                 htmlwidgets_1.6.4           ica_1.0-3                   plyr_1.8.9                 
##  [21] plotly_4.11.0               zoo_1.8-14                  cachem_1.1.0                igraph_2.2.1               
##  [25] mime_0.13                   lifecycle_1.0.4             pkgconfig_2.0.3             Matrix_1.7-4               
##  [29] R6_2.6.1                    fastmap_1.2.0               MatrixGenerics_1.22.0       fitdistrplus_1.2-4         
##  [33] future_1.67.0               shiny_1.11.1                digest_0.6.38               S4Vectors_0.48.0           
##  [37] rprojroot_2.1.1             tensor_1.5.1                RSpectra_0.16-2             irlba_2.3.5.1              
##  [41] GenomicRanges_1.62.0        beachmat_2.26.0             labeling_0.4.3              progressr_0.18.0           
##  [45] spatstat.sparse_3.1-0       timechange_0.3.0            httr_1.4.7                  polyclip_1.10-7            
##  [49] abind_1.4-8                 compiler_4.5.1              here_1.0.2                  bit64_4.6.0-1              
##  [53] withr_3.0.2                 S7_0.2.1                    fastDummies_1.7.5           R.utils_2.13.0             
##  [57] MASS_7.3-65                 DelayedArray_0.36.0         tools_4.5.1                 vipor_0.4.7                
##  [61] lmtest_0.9-40               otel_0.2.0                  beeswarm_0.4.0              httpuv_1.6.16              
##  [65] future.apply_1.20.0         goftest_1.2-3               R.oo_1.27.1                 glue_1.8.0                 
##  [69] nlme_3.1-168                promises_1.5.0              grid_4.5.1                  Rtsne_0.17                 
##  [73] cluster_2.1.8.1             reshape2_1.4.5              generics_0.1.4              gtable_0.3.6               
##  [77] spatstat.data_3.1-9         tzdb_0.5.0                  R.methodsS3_1.8.2           data.table_1.17.8          
##  [81] hms_1.1.4                   XVector_0.50.0              BiocGenerics_0.56.0         spatstat.geom_3.6-0        
##  [85] RcppAnnoy_0.0.22            ggrepel_0.9.6               RANN_2.6.2                  pillar_1.11.1              
##  [89] vroom_1.6.6                 spam_2.11-1                 RcppHNSW_0.6.0              later_1.4.4                
##  [93] splines_4.5.1               lattice_0.22-7              bit_4.6.0                   survival_3.8-3             
##  [97] deldir_2.0-4                tidyselect_1.2.1            miniUI_0.1.2                pbapply_1.7-4              
## [101] knitr_1.50                  gridExtra_2.3               Seqinfo_1.0.0               IRanges_2.44.0             
## [105] SummarizedExperiment_1.40.0 scattermore_1.2             stats4_4.5.1                xfun_0.54                  
## [109] Biobase_2.70.0              matrixStats_1.5.0           stringi_1.8.7               lazyeval_0.2.2             
## [113] yaml_2.3.10                 evaluate_1.0.5              codetools_0.2-20            cli_3.6.5                  
## [117] uwot_0.2.4                  xtable_1.8-4                reticulate_1.44.0           jquerylib_0.1.4            
## [121] dichromat_2.0-0.1           Rcpp_1.1.0                  globals_0.18.0              spatstat.random_3.4-2      
## [125] png_0.1-8                   ggrastr_1.0.2               spatstat.univar_3.1-4       parallel_4.5.1             
## [129] dotCall64_1.2               sparseMatrixStats_1.22.0    listenv_0.10.0              viridisLite_0.4.2          
## [133] scales_1.4.0                ggridges_0.5.7              crayon_1.5.3                rlang_1.1.6                
## [137] cowplot_1.2.0