4 Quality Control and Filtering

4.1 Libraries, parallelisation and seed setting

We start our analysis by loading the libraries we will need for this practical. We will be using Seurat for our analysis, sctransform for normalisation (coming later) and tidyverse for data manipulation and plotting.

# Load the libraries we will need for this practical
library(Seurat)
library(sctransform)
library(tidyverse)

4.1.1 Load Single Sample

Reading data into Seurat can be done in two steps:

First, we read in the data using Read10X(), which reads in the count matrix and returns a list of matrices (one for each assay).
Then we create a Seurat object using CreateSeuratObject(), which takes the count matrix and creates a Seurat object that contains the data and metadata.

# Read 10x data matrices
etv6_runx1_1_data <- Read10X(
  data.dir = "Data/CellRanger_Outputs/SRR9264343/outs/filtered_feature_bc_matrix/"
)

# Create Seurat object from the loaded matrices
etv6_runx1_1 <- CreateSeuratObject(
  counts = etv6_runx1_1_data,
  project = "ETV6_RUNX1_1"
)

4.1.2 Seurat Object

The Seurat Object is a specific data structure that is designed to hold single-cell data and associated metadata all in one place. It is essentially a list of lists with different slots for different types of data.

Seurat Object This website has a more detailed explanantion: https://biostatsquid.com/seurat-objects-explained/

The Seurat documentation is on the Satija Lab website and contains a number of useful vignettes for running different types of experiments. The essential commands page is a good reference:

https://satijalab.org/seurat/articles/essential_commands

The Seurat documentation is more focused on the practicalities of how to use the functions and is not always fully comprehensive on the theory, reasoning and workings of the individual functions. The OSCA book created for use with the Bioconductor single cell analysis packages can a good resource for these details as the workflow is essentially the same. It is a good practice to read through the documentation for each function you use in order to understand what it does and how it works.

# Look at the structure of the Seurat object
etv6_runx1_1

## An object of class Seurat 
## 38606 features across 3153 samples within 1 assay 
## Active assay: RNA (38606 features, 0 variable features)
##  1 layer present: counts

# The metadata can be accessed using [[]]
head(etv6_runx1_1[[]])

##                      orig.ident nCount_RNA nFeature_RNA
## AAACCTGAGACTTTCG-1 ETV6_RUNX1_1       8354         2935
## AAACCTGGTCTTCAAG-1 ETV6_RUNX1_1      14974         4341
## AAACCTGGTGCAACTT-1 ETV6_RUNX1_1       1566          931
## AAACCTGGTGTTGAGG-1 ETV6_RUNX1_1      10468         3636
## AAACCTGTCCCAAGTA-1 ETV6_RUNX1_1      10437         3340
## AAACCTGTCGAATGCT-1 ETV6_RUNX1_1       2453         1392

Being aware of the active assay is important when doing different types of analysis because tools will try to use the active assay by default if they can. Note that normally raw counts (the RNA assay) are used for differential expression e.g. calling markers. Conversely, normalised data is used to identify cell types e.g. in drawing UMAP plots. Here, we start by defining our default assay to the raw counts, using the DefaultAssay() function:

# Set the default assay to RNA (raw counts)
DefaultAssay(etv6_runx1_1) <- "RNA"

# Check the active assay was set correctly
DefaultAssay(etv6_runx1_1)

## [1] "RNA"

For convenience, we pull the raw counts matrix for the RNA assay out of the Seurat object and into a separate variable for exploratory analysis and QC. This is not strictly necessary but it can be easier to work with the raw counts matrix directly for some of these steps.

# Pull out the raw counts matrix for the RNA assay for exploratory analysis and QC
raw_counts <- etv6_runx1_1[["RNA"]]$counts

4.2 Properties of Single cell data

4.2.1 Number of genes detected per cell

# Calculate the number of genes detected per cell
genes_per_cell <- colSums(raw_counts > 0)

# Plot the distribution of genes detected per cell
plot(density(genes_per_cell),
     main = "",
     xlab = "Genes per cell")

4.2.2 Total UMI for a gene versus the number of times detected

If we compare the number of UMI’s assigned to an individual gene to the number of cells in which that gene is detected, we can see that highly expressed genes tend to be detected in a higher proportion of cells than lowly expressed genes.

# Calculate the total UMIs per gene
total_umis_per_gene <- rowSums(raw_counts)

# Calculate the proportion of cells expressing each gene
proportion_cells_expressing <- rowMeans(raw_counts > 0)

# Plot the relationship
plot(
  x = total_umis_per_gene,
  y = proportion_cells_expressing,
  log = "x",
  xlab = "Total UMIs per gene",
  ylab = "Proportion of cells expressing the gene"
)

## Warning in xy.coords(x, y, xlabel, ylabel, log): 11823 x values <= 0 omitted from logarithmic plot

4.2.3 Distribution of counts for a gene across cells

We could also look at the distribution of counts for individual genes across all cells. The plot below shows this distribution for the top 20 genes detected.

# Calculate the relative expression of each gene in each cell
rel_expression <- t(t(raw_counts) / colSums(raw_counts)) * 100

# Get the 20 most expressed genes
most_expressed <- sort(rowSums(rel_expression), decreasing = TRUE)[20:1]

# Plot the distribution of relative expression for the top 20 genes
plot_data <- as.matrix(t(rel_expression[names(most_expressed), ]))

boxplot(plot_data, cex = 0.1, las = 1,
        xlab = "% total count per cell",
        horizontal = TRUE)

4.3 Quality Control

The cell calling performed by CellRanger does not always retain only droplets containing cells. Poor-quality cells, or rather droplets, may be caused by cell damage during dissociation or failed library preparation. They usually have low UMI counts, few genes detected and/or high mitochondrial content. The presence of these droplets in the data set may affect normalisation, assessment of cell population heterogeneity, and clustering:

Normalisation: Contaminating genes, ‘the ambient RNA’, are detected at low levels in all libraries. In low quality libraries with low RNA content, scaling will increase counts for these genes more than for better-quality cells, resulting in their apparent upregulation in these cells and increased variance overall.
Cell population heterogeneity: variance estimation and dimensionality reduction with PCA where the first principal component will be correlated with library size, rather than biology.
Clustering: higher mitochondrial and/or nuclear RNA content may cause low-quality cells to cluster separately or form states or trajectories between distinct cell types.

In order to remove or reduce the impact of poor-quality droplets on our downstream analysis we will attempt to filter them out using some QC metrics. The three principle means of doing this are to apply thresholds for inclusion on three characteristics:

The library size defined as the total sum of UMI counts across all genes; cells with small library sizes are considered to be of low quality as the RNA has not been efficiently captured, i.e. converted into cDNA and amplified, during library preparation.
The number of expressed genes in each cell defined as the number of genes with non-zero counts for that cell; any cell with very few expressed genes is likely to be of poor quality as the diverse transcript population has not been successfully captured.
The proportion of UMIs mapped to genes in the mitochondrial genome; high proportions are indicative of poor-quality cells, possibly because of loss of cytoplasmic RNA from perforated cells (the reasoning is that mitochondria are larger than individual transcript molecules and less likely to escape through tears in the cell membrane).

4.3.1 Detected Genes

Another simple metric that we can quickly calculate is the number of genes detected in the whole dataset (i.e. those that are true for this statement):

# How many genes are detected in the whole dataset?
table(rowSums(raw_counts) > 0)

## 
## FALSE  TRUE 
## 11823 26783

4.3.2 Mitochondrial Gene Content

To calculate the percentage of mitochondrial genes, we first need to identify which genes are mitochondrial. To get these numbers we can take advantage of the fact that (in human) all mitochondrial gene names start with the ‘MT-’ prefix. This is probably not the case for your non-human species of interest, although for mouse it is usually ‘Mt’. The [[ operator can add columns to object metadata. This is a good place to store QC information.

# Identify mitochondrial genes
mito_genes <- grep(pattern = "^MT-",
                   x = rownames(raw_counts), value = TRUE)
length(mito_genes)

## [1] 13

# Calculate the percentage of UMIs mapped to mitochondrial genes and add this as a column to the metadata
etv6_runx1_1[["percent.mt"]] <- PercentageFeatureSet(etv6_runx1_1,
                                                     pattern = "^MT-")

# Check the metadata to see the new column
head(etv6_runx1_1[[]])

##                      orig.ident nCount_RNA nFeature_RNA percent.mt
## AAACCTGAGACTTTCG-1 ETV6_RUNX1_1       8354         2935   3.507302
## AAACCTGGTCTTCAAG-1 ETV6_RUNX1_1      14974         4341   3.806598
## AAACCTGGTGCAACTT-1 ETV6_RUNX1_1       1566          931  27.522350
## AAACCTGGTGTTGAGG-1 ETV6_RUNX1_1      10468         3636   4.088651
## AAACCTGTCCCAAGTA-1 ETV6_RUNX1_1      10437         3340   5.049344
## AAACCTGTCGAATGCT-1 ETV6_RUNX1_1       2453         1392   3.668977

4.3.3 Visualize QC Metrics

We can visualize some of these QC metrics using VlnPlot(), which produces violin plots. Here we will look at the number of features (genes), number of counts (UMIs) and percentage of mitochondrial genes.

# Distribution of number of genes detected per cell
VlnPlot(etv6_runx1_1, features = c("nFeature_RNA"), layer = "counts")

# Distribution of total UMI counts per cell
VlnPlot(etv6_runx1_1, features = c("nCount_RNA"), layer = "counts")

# Distribution of percentage of mitochondrial genes per cell
VlnPlot(etv6_runx1_1, features = c("percent.mt"), layer = "counts")

# Distribution of all three QC metrics together
VlnPlot(etv6_runx1_1,
        features = c("nFeature_RNA", "nCount_RNA", "percent.mt"),
        ncol = 3, layer = "counts")

4.4 Loading Multiple Samples

If you have multiple samples to load, the Read10X function can take a named list of directories as input, and it will read in each sample and combine them into a single expression matrix with sample-specific cell barcodes. Each cell barcode is modified with the sample name to ensure uniqueness in your dataset.

The function will make a column in the metadata called ‘orig.ident’ which contains the sample name for each cell. However it obtains this programmatically from the named vector. The default behaviour assumes names are sample_otherinfo. The separator can be set with names.delim. Warning: If your names do not follow this convention you may need to set this argument or you will end up with incomplete or nonsense samples in your object.

# Read our sample information sheet
sampleinfo <- read_tsv("Data/sample_sheet.tsv")

## Rows: 11 Columns: 3
## ── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (3): Sample, SampleName, SampleGroup
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Create a named vector of directories for each sample
# We work with a subset of samples for demonstration purposes
samples <- sampleinfo$Sample[c(1, 5, 7, 9)]
list_of_files <- file.path("Data/CellRanger_Outputs/",
                           samples,
                           "/outs/filtered_feature_bc_matrix")
names(list_of_files) <- sampleinfo$SampleName[c(1, 5, 7, 9)]

# Read in the cellranger matrices for all samples
expression_matrix <- Read10X(data.dir = list_of_files)

# Create a new seurat object with the combined expression matrix
multi_seurat_object <- CreateSeuratObject(counts = expression_matrix)
multi_seurat_object

## An object of class Seurat 
## 38606 features across 14826 samples within 1 assay 
## Active assay: RNA (38606 features, 0 variable features)
##  1 layer present: counts

4.4.1 Adding Sample Metadata

Now we have our combined Seurat object we should add in some helpful metadata like which sample group or condition these samples belong to. Since we need the cell level information to do this the easiest way is to use the orig.ident column of the metadata which is where the sample name was stored when the object was made and then pass this back to the Seurat object metadata. Our sample names are structured in a way that makes this easy.

# Pull out the metadata and add sample group and sample name information
temp_metadata <- multi_seurat_object[[]] %>%
  # Add the cell barcodes as a column so we can pull them back in at the end
  rownames_to_column("Cell") %>%
  # Extract the sample group information from the origin identifier
  # The pattern "-.*" matches the first dash and everything after it
  mutate(SampleGroup = str_remove(orig.ident, "-.*")) %>% 
  # Add the sample name as a column
  mutate(SampleName = orig.ident) %>%
  # Put the cell barcodes back as rownames
  column_to_rownames("Cell")

# Add the modified metadata back to the Seurat object
multi_seurat_object[[]] <- temp_metadata

4.4.2 Remove undetected genes

The first thing we should do is filter out genes that haven’t been detected in our experiment. This will reduce the size of the Seurat Object and make processing a little faster. We can do this using the rowSums() function to calculate the total counts for each gene across all cells and all samples, and then filter out genes with zero counts.

# Get the raw counts to work with
multi_raw_counts <- multi_seurat_object[["RNA"]]$counts

# How many genes are detected in the whole dataset?
table(rowSums(multi_raw_counts) > 0)

## 
## FALSE  TRUE 
##  6853 31753

# Filter out genes that are not detected in any cell
filtered_multi_seurat_object <- subset(
  multi_seurat_object,
  features = rownames(multi_seurat_object)[rowSums(multi_raw_counts) > 0]
)

# Check how many genes are detected after filtering
filtered_multi_seurat_object

## An object of class Seurat 
## 31753 features across 14826 samples within 1 assay 
## Active assay: RNA (31753 features, 0 variable features)
##  1 layer present: counts

# Confirm all of these now have at least 1 count across all cells
filt_raw_counts <- filtered_multi_seurat_object[["RNA"]]$counts
table(rowSums(filt_raw_counts) > 0)

## 
##  TRUE 
## 31753

We don’t actually need to do this explicitly. CreateSeuratObject has an argument called min.cells. The documentation states: “Include features detected in at least this many cells”. so by changing this to 1 (default is 0) when we create the object at the start we don’t need to worry about this step.

4.4.3 Visualize QC Metrics for Multiple Samples

We can visualize QC metrics for the merged samples using VlnPlot() again:

# Add the percentage of mitochondrial genes to the metadata
filtered_multi_seurat_object[["percent.mt"]] <-
  PercentageFeatureSet(filtered_multi_seurat_object, pattern = "^MT-")

# Plot the distribution of QC metrics for the merged samples
VlnPlot(filtered_multi_seurat_object,
  features = c("nFeature_RNA", "nCount_RNA", "percent.mt"),
  ncol = 3,
  layer = "counts"
)

The default plotting sytle of Seurat overlays the points on the violin. Sometimes this can obscure the distribution. To turn this off, we can set the pt.size parameter to 0:

VlnPlot(
  filtered_multi_seurat_object,
  features = c("nFeature_RNA", "nCount_RNA", "percent.mt"),
  ncol = 3,
  layer = "counts",
  pt.size = 0
)

4.4.4 Filtering out low quality droplets

One could use a hard threshold for the library size, number of genes detected and mitochondrial content based on the distributions seen above. These would need vary across runs and the decision making process is somewhat arbitrary. For example, imagine if we decided all cells should have at least 500 genes detected. This example also shows how you can use the Seurat ggplot nature to add a line to indicate our thresholds.

# Plot where we would set a threshold
VlnPlot(
  filtered_multi_seurat_object,
  features = "nFeature_RNA",
  group.by = "SampleGroup",
  layer = "counts",
  pt.size = 0
) +
  geom_hline(yintercept = 500, color = "red")

It may therefore be preferable to rely on outlier detection to identify cells that markedly differ from most cells.

We saw above that the distribution of the QC metrics is close to Normal. Hence, we can detect outliers using the median and the median absolute deviation (MAD) from the median (not the mean and the standard deviation which both are sensitive to outliers).

For a given metric, an outlier value is one that lies over some number of MADs away from the median. A cell will be excluded if it is an outlier in the part of the range to avoid, for example low gene counts, or high mitochondrial content. For a normal distribution, a threshold defined with a distance of 3 MADs from the median retains about 99% of values. For most experiments 2 or 3 MADs is a reasonable starting point.

The samples analysed here may have been processed in different batches leading to differences in the overall distribution of UMI counts, numbers of genes detected and mitochondrial content. Such differences would affect the adaptive thresholds discussed above - that is, as the distributions of the metrics differ, so we need to decide if threshold should be per batch, per sample or across all samples.

Because of the different layers, we will need to break down the object to apply different thresholds by group or sample. The easiest way to do this is to work with the metadata and use the cell barcodes for filtering with Seurat::subset().

Before generalising this filtering across all samples, let’s look at a step-by-step example for one sample.

# Pull out the metadata for easier manipulation
metadata <- filtered_multi_seurat_object[[]]

#### Example of how to filter for one sample ####

# Subset the metadata for one of the samples
e_cells <- metadata %>%
  filter(SampleName == "ETV6RUNX1-1")

# calculate the median and MAD for the number of genes detected in this sample
feature_mad <- mad(e_cells$nFeature_RNA)
feature_median <- median(e_cells$nFeature_RNA)

# set a threshold for the number of genes detected as 2 MADs below the median
feature_threshold <- feature_median - 2 * feature_mad

# How many cells would this remove?
length(e_cells$nFeature_RNA) # total cells

## [1] 3153

sum(e_cells$nFeature_RNA > feature_threshold) # cells above threshold

## [1] 2940

# The same process can be repeated for the other metrics
umi_threshold <- median(e_cells$nCount_RNA) - 2 * mad(e_cells$nCount_RNA)
sum(e_cells$nCount_RNA > umi_threshold)

## [1] 3153

# for mitochondrial content we want to set a threshold above the median
mt_threshold <- median(e_cells$percent.mt) + 2 * mad(e_cells$percent.mt)
sum(e_cells$percent.mt < mt_threshold)

## [1] 2777

# visualise the thresholds on the plot
ggplot(e_cells,
       aes(x = nFeature_RNA,
           y = nCount_RNA,
           color = percent.mt)) +
  geom_point() +
  geom_vline(xintercept = feature_threshold, color = "red") +
  geom_hline(yintercept = umi_threshold, color = "blue") +
  scale_color_viridis_c() +
  theme_minimal() +
  labs(
    title = "QC Metrics for ETV6-RUNX1_1",
    x = "Number of Features (Genes) Detected",
    y = "Number of Counts (UMIs)",
    color = "Percent Mitochondrial"
  )

# We cound apply all 3 filters together to get a vector of cell
# barcodes to keep for this sample
# using standard dplyr filtering and pulling out the cell barcodes
e_keep_cells <- metadata %>%
  rownames_to_column("Cell") %>%
  filter(SampleName == "ETV6RUNX1-1") %>%
  filter(nFeature_RNA > feature_threshold,
         nCount_RNA > umi_threshold,
         percent.mt < mt_threshold) %>%
  pull(Cell)

# check how many cells would be left for this sample
length(e_keep_cells)

## [1] 2715

So, this would leave us with 2715 cells for this sample.

We can generalise this process by using the group_by() function from dplyr, which will allow us to define median/MAD thresholds per sample. Here is the combined code to do this for all samples, which you can use as a template for your own filtering:

#### Apply filtering across all samples ####

all_keep_cells <- metadata %>%
  # Add the cell barcodes as a column so we can pull them back in at the end
  rownames_to_column("cell") %>%
  # Group by sample name to calculate thresholds per sample
  group_by(SampleName) %>%
  # Apply the filtering for each sample
  filter(nFeature_RNA > (median(nFeature_RNA) - 2 * mad(nFeature_RNA)),
         nCount_RNA > (median(nCount_RNA) - 2 * mad(nCount_RNA)),
         percent.mt < (median(percent.mt) + 2 * mad(percent.mt))) %>%
  # Pull out the cell barcodes for the cells that pass the filters
  pull(cell)

# How many cells would be left after filtering?
length(all_keep_cells)

## [1] 11832

# How many cells per sample
table(metadata[all_keep_cells, "SampleName"])

## 
## ETV6RUNX1-1       HHD-1     PBMMC-1      PRET-1 
##        2715        4746         835        3536

We then use this vector of cell barcodes to subset our Seurat object to retain only the high-quality cells.

qc_seurat_object <- subset(
  filtered_multi_seurat_object,
  cells = all_keep_cells
)

VlnPlot(qc_seurat_object,
  features = c("nFeature_RNA", "nCount_RNA", "percent.mt"),
  ncol = 3,
  layer = "counts",
  pt.size = 0
)

4.5 Exercises

In the demonstration above, we performed QC and filtering of 4 of our samples, one from each of our sample groups. For this practical we would like you to perform QC and filtering on the full dataset.

4.5.1 Load the data

In order to load the CellRanger data for all of the samples, you will first need to create a named vector of the paths to the filtered count matrix folders called list_of_files and then use this in the Read10X() function.

# Read in the sample information sheet
sampleinfo <- read_tsv("Data/sample_sheet.tsv")

# Create a named vector of directories for each sample
# YOUR CODE HERE

# Read in the cellranger matrices for all samples
# YOUR CODE HERE

# Create the seurat object
# YOUR CODE HERE

Hint

The paths to filtered_feature_bc_matrix directories for each sample can be constructed using the Sample column as:

Data/CellRanger_Outputs/Sample/outs/filtered_feature_bc_matrix

You will need to use a function such as str_c(), paste() or file.path() to construct the paths.

The names of the vector will determine the identifer used in the cell barcodes, this should be the sample name as in the SampleName column of the sample sheet.

As Seurat V5 no longer supports parallelisation with future this step may take a couple of minutes.

Answer

# Create a named vector of directories for each sample
list_of_files <- file.path("Data/CellRanger_Outputs/",
                           sampleinfo$Sample,
                           "outs/filtered_feature_bc_matrix")
names(list_of_files) <- sampleinfo$SampleName
list_of_files

##                                                           ETV6RUNX1-1 
## "Data/CellRanger_Outputs//SRR9264343/outs/filtered_feature_bc_matrix" 
##                                                           ETV6RUNX1-2 
## "Data/CellRanger_Outputs//SRR9264344/outs/filtered_feature_bc_matrix" 
##                                                           ETV6RUNX1-3 
## "Data/CellRanger_Outputs//SRR9264345/outs/filtered_feature_bc_matrix" 
##                                                           ETV6RUNX1-4 
## "Data/CellRanger_Outputs//SRR9264346/outs/filtered_feature_bc_matrix" 
##                                                                 HHD-1 
## "Data/CellRanger_Outputs//SRR9264347/outs/filtered_feature_bc_matrix" 
##                                                                 HHD-2 
## "Data/CellRanger_Outputs//SRR9264348/outs/filtered_feature_bc_matrix" 
##                                                                PRET-1 
## "Data/CellRanger_Outputs//SRR9264349/outs/filtered_feature_bc_matrix" 
##                                                                PRET-2 
## "Data/CellRanger_Outputs//SRR9264350/outs/filtered_feature_bc_matrix" 
##                                                               PBMMC-1 
## "Data/CellRanger_Outputs//SRR9264351/outs/filtered_feature_bc_matrix" 
##                                                               PBMMC-2 
## "Data/CellRanger_Outputs//SRR9264353/outs/filtered_feature_bc_matrix" 
##                                                               PBMMC-3 
## "Data/CellRanger_Outputs//SRR9264354/outs/filtered_feature_bc_matrix"

# Read in the cellranger matrices for all samples
expression_matrix <- Read10X(data.dir = list_of_files)

# Create the seurat object
seurat_object <- CreateSeuratObject(counts = expression_matrix, min.cells = 1)
seurat_object

## An object of class Seurat 
## 33869 features across 53476 samples within 1 assay 
## Active assay: RNA (33869 features, 0 variable features)
##  1 layer present: counts

Once you import the objects, make sure to adjust the metadata as we did in the demonstration, to include the sample name and sample group obtained from the orig.ident column.

# Sample group and sample name information to the metadata
temp_metadata <- seurat_object[[]] %>%
  rownames_to_column("Cell") %>%
  mutate(SampleGroup = str_remove(orig.ident, "-.*")) %>% 
  mutate(SampleName = orig.ident) %>%
  column_to_rownames("Cell")

seurat_object[[]] <- temp_metadata
head(seurat_object[[]])

##                                 orig.ident nCount_RNA nFeature_RNA SampleGroup  SampleName
## ETV6RUNX1-1_AAACCTGAGACTTTCG-1 ETV6RUNX1-1       8354         2935   ETV6RUNX1 ETV6RUNX1-1
## ETV6RUNX1-1_AAACCTGGTCTTCAAG-1 ETV6RUNX1-1      14974         4341   ETV6RUNX1 ETV6RUNX1-1
## ETV6RUNX1-1_AAACCTGGTGCAACTT-1 ETV6RUNX1-1       1566          931   ETV6RUNX1 ETV6RUNX1-1
## ETV6RUNX1-1_AAACCTGGTGTTGAGG-1 ETV6RUNX1-1      10468         3636   ETV6RUNX1 ETV6RUNX1-1
## ETV6RUNX1-1_AAACCTGTCCCAAGTA-1 ETV6RUNX1-1      10437         3340   ETV6RUNX1 ETV6RUNX1-1
## ETV6RUNX1-1_AAACCTGTCGAATGCT-1 ETV6RUNX1-1       2453         1392   ETV6RUNX1 ETV6RUNX1-1

Also add the percentage of UMIs mapped to mitochondrial genes to the metadata:

# Add percentage of UMIs mapped to mitochondrial genes to the metadata
seurat_object[["percent.mt"]] <- PercentageFeatureSet(seurat_object,
                                                      pattern = "^MT-")

4.5.2 Explore QC metric distribution

Before moving on to do the actual cell filtering, it is always a good idea to explore the distribution of the metrics across the droplets.

Use VlnPlot() to generate plots showing the distributions of the total number of UMIs, the number of genes detected and percentage of UMIs aligned to mitochondrial genes across all cells for each sample.

This time we will use the cols argument and a vector of colours to make it easier to tell each sample group apart.

The code for plotting the total number of UMIs is shown below. You will also need to plot the the number of genes detected and percentage of UMIs aligned to mitochondrial.

# Create a vector of colours of each sample group
colours <- c("cyan3", "cyan3", "cyan3", "cyan3",
             "darkgoldenrod1", "darkgoldenrod1",
             "blue", "blue", "blue",
             "lightgreen", "lightgreen")

# Plot the distribution of total UMI counts per cell
VlnPlot(seurat_object, 
        features = c("nCount_RNA"),
        cols = colours,
        layer = "counts",
        pt.size = 0) +
  ggtitle("Total count of UMIs")

# Plot the distribution of number of genes detected per cell
# YOUR CODE HERE

# Plot the distribution of percentage of UMIs aligned to mitochondrial genes
# YOUR CODE HERE

Answer

# Plot the distribution of number of genes detected per cell
VlnPlot(seurat_object, 
        features = c("nFeature_RNA"),
        cols = colours,
        layer = "counts",
        pt.size = 0) +
  ggtitle("Detected Features")

# Plot the distribution of percentage of UMIs aligned to mitochondrial genes per cell
VlnPlot(seurat_object, 
        features = c("percent.mt"), 
        cols = colours,
        layer = "counts", 
        pt.size = 0) +
  ggtitle("Mitochondrial Percentage")

4.5.3 Identification of low-quality cells with adaptive thresholds

With these samples we have three possible “batch” levels at which we could run adaptive filtering. We could apply the filtering across all samples together (i.e. no batch), we could apply it by sample group (ETV6-RUNX1, HHD, PBMMC, PRE-T), or we could apply it per sample.

How many cells will be removed from the data set if we used sample group?
And do you think this is a reasonable method to choose?
How would you decide?

# Extract the metadata for filtering
metadata <- seurat_object[[]]

# Apply the filtering at the SampleGroup level
# YOUR CODE HERE

# Number of cells that would be removed
# YOUR CODE HERE

Hint

Use the code from the demonstration to set thresholds at a SampleGroup level rather than for every sample. Change the filtering of the metadata to get only the cells you want.

Answer

# Extract the metadata for filtering
metadata <- seurat_object[[]]

# Apply the filtering at the SampleGroup level
sample_group_keep_cells <- metadata %>%
  rownames_to_column("Cell") %>%
  group_by(SampleGroup) %>%
  filter(nFeature_RNA > (median(nFeature_RNA) - 2 * mad(nFeature_RNA)),
         nCount_RNA > (median(nCount_RNA) - 2 * mad(nCount_RNA)),
         percent.mt < (median(percent.mt) + 2 * mad(percent.mt))) %>%
  ungroup() %>%
  pull(Cell)

# Number of cells that would be removed
nrow(metadata) - length(sample_group_keep_cells)

## [1] 6013

In total 6013 cells will be removed from the dataset.

From our violin plots, it is clear that there are different distribution profiles between samples in the same group. Noticeably the distribution profiles for the two HDD samples are quite different. For this reason it may be prudent to apply the adaptive filtering to each sample independently.

We could do so, by creating a vector of cells to keep for each sample instead. The only change in the code is how we group the data frame before applying the filter, we group by SampleName instead of SampleGroup:

# Apply the filtering at the SampleName level
sample_name_keep_cells <- metadata %>%
  rownames_to_column("Cell") %>%
  group_by(SampleName) %>%
  filter(nFeature_RNA > (median(nFeature_RNA) - 2 * mad(nFeature_RNA)),
         nCount_RNA > (median(nCount_RNA) - 2 * mad(nCount_RNA)),
         percent.mt < (median(percent.mt) + 2 * mad(percent.mt))) %>%
  ungroup() %>%
  pull(Cell)

# Number of cells that would be removed
nrow(metadata) - length(sample_name_keep_cells)

## [1] 6302

Which of these filters should we choose?

Making some visualisations of our data may help us decide. First, let’s add this information back to the metadata so we can use it for plotting.

new_metadata <- metadata %>%
  rownames_to_column("Cell") %>%
  mutate(sample_name_keep = Cell %in% sample_name_keep_cells,
         sample_group_keep = Cell %in% sample_group_keep_cells) %>%
  column_to_rownames("Cell")

# We can optionally add this metadata to our object
seurat_object[[]] <- new_metadata

We can make a quick tally of how many cells are kept by each method for each sample:

# Tally of how many cells are kept by each method for each sample
table(new_metadata$sample_name_keep, new_metadata$sample_group_keep)

##        
##         FALSE  TRUE
##   FALSE  4986  1316
##   TRUE   1027 46147

We can see that in general most cells would be kept by either filter. However, there are some differences on cells that would pass one way of filtering but not the other.

We can make different visualisations of our QC filters using the new metadata. In these examples we use standard ggplot code, with the ggbeeswarm::geom_quasirandom() function to make violin-like plots that display individual data points. This is so we can see more clearly which cells are being filtered out by each method.

# We sort the table by the filter so that the cells that are filtered out appear on top
new_metadata %>%
  arrange(desc(sample_group_keep)) %>%
  ggplot(aes(x = SampleName, y = nCount_RNA, 
             colour = sample_group_keep)) +
  ggbeeswarm::geom_quasirandom(size = 0.5) +
  ggtitle("Total count of UMIs")

new_metadata %>%
  arrange(desc(sample_name_keep)) %>%
  ggplot(aes(x = SampleName, y = nCount_RNA, 
             colour = sample_name_keep)) +
  ggbeeswarm::geom_quasirandom(size = 0.5) +
  ggtitle("Total count of UMIs")

You could also do the same for other metrics.

Once we are happy with our filtering, we can remove the cells from the Seurat object.

filtered_seurat_object <- subset(seurat_object, 
                                 cells = sample_name_keep_cells)

sessionInfo()

## R version 4.5.1 (2025-06-13)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.5 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0  LAPACK version 3.10.0
## 
## locale:
##  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8        LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8   
##  [6] LC_MESSAGES=C.UTF-8    LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C           LC_TELEPHONE=C        
## [11] LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
## 
## time zone: Europe/London
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] lubridate_1.9.4    forcats_1.0.1      stringr_1.6.0      dplyr_1.1.4        purrr_1.2.0        readr_2.1.5       
##  [7] tidyr_1.3.1        tibble_3.3.0       ggplot2_4.0.1      tidyverse_2.0.0    sctransform_0.4.2  Seurat_5.3.1      
## [13] SeuratObject_5.2.0 sp_2.2-0          
## 
## loaded via a namespace (and not attached):
##   [1] RColorBrewer_1.1-3     jsonlite_2.0.0         magrittr_2.0.4         ggbeeswarm_0.7.2       spatstat.utils_3.2-0  
##   [6] farver_2.1.2           rmarkdown_2.30         vctrs_0.6.5            ROCR_1.0-11            spatstat.explore_3.5-3
##  [11] htmltools_0.5.8.1      sass_0.4.10            parallelly_1.45.1      KernSmooth_2.23-26     bslib_0.9.0           
##  [16] htmlwidgets_1.6.4      ica_1.0-3              plyr_1.8.9             plotly_4.11.0          zoo_1.8-14            
##  [21] cachem_1.1.0           igraph_2.2.1           mime_0.13              lifecycle_1.0.4        pkgconfig_2.0.3       
##  [26] Matrix_1.7-4           R6_2.6.1               fastmap_1.2.0          fitdistrplus_1.2-4     future_1.67.0         
##  [31] shiny_1.11.1           digest_0.6.38          patchwork_1.3.2        rprojroot_2.1.1        tensor_1.5.1          
##  [36] RSpectra_0.16-2        irlba_2.3.5.1          labeling_0.4.3         progressr_0.18.0       spatstat.sparse_3.1-0 
##  [41] timechange_0.3.0       httr_1.4.7             polyclip_1.10-7        abind_1.4-8            compiler_4.5.1        
##  [46] here_1.0.2             bit64_4.6.0-1          withr_3.0.2            S7_0.2.1               fastDummies_1.7.5     
##  [51] R.utils_2.13.0         MASS_7.3-65            tools_4.5.1            vipor_0.4.7            lmtest_0.9-40         
##  [56] otel_0.2.0             beeswarm_0.4.0         httpuv_1.6.16          future.apply_1.20.0    goftest_1.2-3         
##  [61] R.oo_1.27.1            glue_1.8.0             nlme_3.1-168           promises_1.5.0         grid_4.5.1            
##  [66] Rtsne_0.17             cluster_2.1.8.1        reshape2_1.4.5         generics_0.1.4         gtable_0.3.6          
##  [71] spatstat.data_3.1-9    tzdb_0.5.0             R.methodsS3_1.8.2      data.table_1.17.8      hms_1.1.4             
##  [76] spatstat.geom_3.6-0    RcppAnnoy_0.0.22       ggrepel_0.9.6          RANN_2.6.2             pillar_1.11.1         
##  [81] vroom_1.6.6            spam_2.11-1            RcppHNSW_0.6.0         later_1.4.4            splines_4.5.1         
##  [86] lattice_0.22-7         bit_4.6.0              survival_3.8-3         deldir_2.0-4           tidyselect_1.2.1      
##  [91] miniUI_0.1.2           pbapply_1.7-4          knitr_1.50             gridExtra_2.3          scattermore_1.2       
##  [96] xfun_0.54              matrixStats_1.5.0      stringi_1.8.7          lazyeval_0.2.2         yaml_2.3.10           
## [101] evaluate_1.0.5         codetools_0.2-20       cli_3.6.5              uwot_0.2.4             xtable_1.8-4          
## [106] reticulate_1.44.0      jquerylib_0.1.4        dichromat_2.0-0.1      Rcpp_1.1.0             globals_0.18.0        
## [111] spatstat.random_3.4-2  png_0.1-8              ggrastr_1.0.2          spatstat.univar_3.1-4  parallel_4.5.1        
## [116] dotCall64_1.2          listenv_0.10.0         viridisLite_0.4.2      scales_1.4.0           ggridges_0.5.7        
## [121] crayon_1.5.3           rlang_1.1.6            cowplot_1.2.0

Introduction to single-cell RNA-seq analysis

4 Quality Control and Filtering

4.1 Libraries, parallelisation and seed setting

4.1.1 Load Single Sample

4.1.2 Seurat Object

4.2 Properties of Single cell data

4.2.1 Number of genes detected per cell

4.2.2 Total UMI for a gene versus the number of times detected

4.2.3 Distribution of counts for a gene across cells

4.3 Quality Control

4.3.1 Detected Genes

4.3.2 Mitochondrial Gene Content

4.3.3 Visualize QC Metrics

4.4 Loading Multiple Samples

4.4.1 Adding Sample Metadata

4.4.2 Remove undetected genes

4.4.3 Visualize QC Metrics for Multiple Samples

4.4.4 Filtering out low quality droplets

4.5 Exercises

4.5.1 Load the data

4.5.2 Explore QC metric distribution

4.5.3 Identification of low-quality cells with adaptive thresholds