Now that the read alignment and quantification have been carried out using cellranger (v7.0.0), we will begin by assessing the data quality and applying further filtering of cells to remove poor quality data.
library(DropletUtils)
library(scater)
library(ensembldb)
library(AnnotationHub)
library(BiocParallel)
library(tidyverse)
library(patchwork)
library(ggvenn)
We have already prepared a sample meta data table that relates the SRA sample ID to the sample group. The table below shows details of the twelve samples that we will be working with.
samplesheet <- read_tsv("Data/sample_sheet.tsv")
scRNAseq data sets tend to be very large and processing them requires a large amount of computing power and can take time. Many of the commands we will use have the option to be run in parallel across multiple processors. By default they will only use a single processor, but parallelisation will greatly speed up the analysis.
We need to first set up some parallel parameters using the package BiocParallel
.
bp.params <- MulticoreParam(workers = 7)
Here were are selecting to use forked processes with
MulticoreParam
and instructing the function to use 7 cores
(our machines have 8, this leaves 1 to run the desktop etc.). Note that
on Windows MulticoreParam
does not work and it is necessary
to use SnowParam
- please refer to the
BiocParallel
vignettes for further information.
We will load the data for the SRR9264343. To load
the data from the CellRanger outputs, we need to use the function
read10xCounts
from the DropletUtils
package.
We pass the function the location of the directory containing the counts
matrix, cell barcodes and features (genes).
We could load the raw data, which includes counts for all
cell barcodes detected in the sample, and use the
emptyDrops
function in DropletUtils to call cells, however,
CellRanger has already called cells and so we are going to work with the
filtered matrix, which only contains droplets called as cells
by CellRanger.
sample.path <- "Data/CellRanger_Outputs/SRR9264343/outs/filtered_feature_bc_matrix/"
sce.sing <- read10xCounts(sample.path, col.names=TRUE, BPPARAM = bp.params)
sce.sing
## class: SingleCellExperiment
## dim: 37764 3153
## metadata(1): Samples
## assays(1): counts
## rownames(37764): ENSG00000243485 ENSG00000237613 ... ENSG00000278817
## ENSG00000277196
## rowData names(3): ID Symbol Type
## colnames(3153): AAACCTGAGACTTTCG-1 AAACCTGGTCTTCAAG-1 ...
## TTTGTCACAGGCTCAC-1 TTTGTCAGTTCGGCAC-1
## colData names(2): Sample Barcode
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
SingleCellExperiment
objectThe data have been loaded as a SingleCellExperiment object. The details of the structure of the object are described here. In summary, it stores various data types in a single object. Currently it will contain:
Later we will also add the outcomes of downstream analysis such as normalisation and dimensionality reduction.
Compared to bulk RNA-seq, Single-cell RNA-seq data is sparse, i.e. there many missing values or zeroes. This is particularly true with droplet-based methods such as 10X, mostly because:
We can access the counts matrix with counts
. Given the
large number of droplets in a sample, count matrices can be large.
dim(counts(sce.sing))
## [1] 37764 3153
They are however very sparse, that is, most of the entries are 0’s.
To save memory the counts can be stored in a ‘sparse matrix’ that only
stores non-zero values, in this case as a dgCMatrix
object.
counts(sce.sing)[1:10, 1:10]
## 10 x 10 sparse Matrix of class "dgCMatrix"
##
## ENSG00000243485 . . . . . . . . . .
## ENSG00000237613 . . . . . . . . . .
## ENSG00000186092 . . . . . . . . . .
## ENSG00000238009 . . . . . . . . . .
## ENSG00000239945 . . . . . . . . . .
## ENSG00000239906 . . . . . . . . . .
## ENSG00000241860 . . . . . . . . . .
## ENSG00000241599 . . . . . . . . . .
## ENSG00000286448 . . . . . . . . . .
## ENSG00000236601 . . . . . . . . . .
Details about the “features” (in this case genes) can by accessed
using the rowData
function. Currently it contains the
ensembl gene ID and the gene symbol, which have been derived from the
10x reference used by CellRanger. It also contains a “Type” column,
which tells us what sort of data we are looking at; in this case it is
“Gene Expression”. If we wish to, we can add further annotation to the
features by adding extra columns to this data frame.
rowData(sce.sing)
## DataFrame with 37764 rows and 3 columns
## ID Symbol Type
## <character> <character> <character>
## ENSG00000243485 ENSG00000243485 MIR1302-2HG Gene Expression
## ENSG00000237613 ENSG00000237613 FAM138A Gene Expression
## ENSG00000186092 ENSG00000186092 OR4F5 Gene Expression
## ENSG00000238009 ENSG00000238009 ENSG00000238009 Gene Expression
## ENSG00000239945 ENSG00000239945 ENSG00000239945 Gene Expression
## ... ... ... ...
## ENSG00000277836 ENSG00000277836 ENSG00000277836 Gene Expression
## ENSG00000278633 ENSG00000278633 ENSG00000278633 Gene Expression
## ENSG00000276017 ENSG00000276017 ENSG00000276017 Gene Expression
## ENSG00000278817 ENSG00000278817 ENSG00000278817 Gene Expression
## ENSG00000277196 ENSG00000277196 ENSG00000277196 Gene Expression
The rows of this table correspond to the rows of the count matrix; the row names of this table will match the row names of the counts matrix - currently these are the Ensembl IDS:
rownames(counts(sce.sing))[1:6]
## [1] "ENSG00000243485" "ENSG00000237613" "ENSG00000186092" "ENSG00000238009"
## [5] "ENSG00000239945" "ENSG00000239906"
Details about the droplets can be accessed using
colData
. Currently it contains the sample names and droplet
Barcodes. As with the feature data, we can add additional information
about each droplet, e.g. counts of genes or the percentage of
mitochondrial genes, to this data frame. The rows of this table
correspond to the data in the columns of the count matrix; the row names
of this table will match the column names of the counts matrix -
currently these are the droplet barcodes.
colData(sce.sing)
## DataFrame with 3153 rows and 2 columns
## Sample Barcode
## <character> <character>
## AAACCTGAGACTTTCG-1 Data/CellRanger_Outp.. AAACCTGAGACTTTCG-1
## AAACCTGGTCTTCAAG-1 Data/CellRanger_Outp.. AAACCTGGTCTTCAAG-1
## AAACCTGGTGCAACTT-1 Data/CellRanger_Outp.. AAACCTGGTGCAACTT-1
## AAACCTGGTGTTGAGG-1 Data/CellRanger_Outp.. AAACCTGGTGTTGAGG-1
## AAACCTGTCCCAAGTA-1 Data/CellRanger_Outp.. AAACCTGTCCCAAGTA-1
## ... ... ...
## TTTGGTTTCTTTAGGG-1 Data/CellRanger_Outp.. TTTGGTTTCTTTAGGG-1
## TTTGTCAAGAAACGAG-1 Data/CellRanger_Outp.. TTTGTCAAGAAACGAG-1
## TTTGTCAAGGACGAAA-1 Data/CellRanger_Outp.. TTTGTCAAGGACGAAA-1
## TTTGTCACAGGCTCAC-1 Data/CellRanger_Outp.. TTTGTCACAGGCTCAC-1
## TTTGTCAGTTCGGCAC-1 Data/CellRanger_Outp.. TTTGTCAGTTCGGCAC-1
The rows of this table correspond to the data in the columns of the count matrix; the row names of this table will match the column names of the counts matrix - currently these are the droplet barcodes:
colnames(counts(sce.sing))[1:6]
## [1] "AAACCTGAGACTTTCG-1" "AAACCTGGTCTTCAAG-1" "AAACCTGGTGCAACTT-1"
## [4] "AAACCTGGTGTTGAGG-1" "AAACCTGTCCCAAGTA-1" "AAACCTGTCGAATGCT-1"
The number and identity of genes detected in a cell varies greatly across cells: the total number of genes detected across all cells is far larger than the number of genes detected in each cell.
For the current set of samples the total number of genes detected across cells was 26468 out of 37764 gene in the reference, but if we look at the number of genes detected in each cell, we can see that this ranges from 40 to 5547, with a median of 2562.
genesPerCell <- colSums(counts(sce.sing) > 0)
plot(density(genesPerCell), main="", xlab="Genes per cell")
If we compare the number of UMI’s assigned to an individual gene to the number of cells in which that gene is detected, we can see that highly expressed genes tend to be detected in a higher proportion of cells than lowly expressed genes.
plot(rowSums(counts(sce.sing)) / rowSums(counts(sce.sing) > 0),
rowMeans(counts(sce.sing) > 0),
log = "x",
xlab="Mean UMIs per cell",
ylab="proportion of cells expressing the gene"
)
We could also look at the distribution of counts for individual genes across all cells. The plot below shows this distribution for the top 20 genes detected.
rel_expression <- t( t(counts(sce.sing)) / colSums(counts(sce.sing))) * 100
rownames(rel_expression) <- rowData(sce.sing)$Symbol
most_expressed <- sort(rowSums( rel_expression ), decreasing = T)[20:1]
plot_data <- as.matrix(t(rel_expression[names(most_expressed),]))
boxplot(plot_data, cex=0.1, las=1, xlab="% total count per cell", horizontal=TRUE)
The cell calling performed by CellRanger does not always retain only droplets containing cells. Poor-quality cells, or rather droplets, may be caused by cell damage during dissociation or failed library preparation. They usually have low UMI counts, few genes detected and/or high mitochondrial content. The presence of these droplets in the data set may affect normalisation, assessment of cell population heterogeneity, and clustering:
In order to remove or reduce the impact of poor-quality droplets on our downstream analysis we will attempt to filter them out using some QC metrics. The three principle means of doing this are to apply thresholds for inclusion on three characteristics:
The library size defined as the total sum of UMI counts across all genes; cells with small library sizes are considered to be of low quality as the RNA has not been efficiently captured, i.e. converted into cDNA and amplified, during library preparation.
The number of expressed genes in each cell defined as the number of genes with non-zero counts for that cell; any cell with very few expressed genes is likely to be of poor quality as the diverse transcript population has not been successfully captured.
The proportion of UMIs mapped to genes in the mitochondrial genome; high proportions are indicative of poor-quality cells, possibly because of loss of cytoplasmic RNA from perforated cells (the reasoning is that mitochondria are larger than individual transcript molecules and less likely to escape through tears in the cell membrane).
The scater
function addPerCellQC()
will compute various per droplet QC
metrics and will add this information as new columns in the droplet
annotation (colData
) of the single cell object.
We can load multiple samples at the same time using the
read10xCounts
command. This will create a single object
containing the data for multiple samples. We can then QC and filter the
samples in conjunction. As we will see later, this is not always optimal
when samples have been processed in multiple batches.
As an example we will one sample from each sample group. Again we
will start with the filtered counts matrix, which only contains cells
called by CellRanger. We pass the read10xCounts
a named
vector containing the paths to the filtered counts matrices that we wish
to load; the names of the vector will be used as the sample names in the
Single Cell Experiment object.
samples <- samplesheet$Sample[c(1,5,7,9)]
list_of_files <- str_c("Data/CellRanger_Outputs/",
samples,
"/outs/filtered_feature_bc_matrix")
names(list_of_files) <- samples
list_of_files
## SRR9264343
## "Data/CellRanger_Outputs/SRR9264343/outs/filtered_feature_bc_matrix"
## SRR9264347
## "Data/CellRanger_Outputs/SRR9264347/outs/filtered_feature_bc_matrix"
## SRR9264349
## "Data/CellRanger_Outputs/SRR9264349/outs/filtered_feature_bc_matrix"
## SRR9264351
## "Data/CellRanger_Outputs/SRR9264351/outs/filtered_feature_bc_matrix"
sce <- read10xCounts(list_of_files, col.names=TRUE, BPPARAM = bp.params)
sce
## class: SingleCellExperiment
## dim: 37764 14809
## metadata(1): Samples
## assays(1): counts
## rownames(37764): ENSG00000243485 ENSG00000237613 ... ENSG00000278817
## ENSG00000277196
## rowData names(3): ID Symbol Type
## colnames(14809): 1_AAACCTGAGACTTTCG-1 1_AAACCTGGTCTTCAAG-1 ...
## 4_TTTGGTTAGGTGCTAG-1 4_TTTGGTTGTGCATCTA-1
## colData names(2): Sample Barcode
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
Currently, the droplet annotation in colData
slot of the
sce
object has two columns: “Sample” and “Barcode”. The
“Sample” is the name of the sample as we provided it to
read10xCounts
, the “Barcode” is the barcode for the droplet
(cell).
colData(sce)
## DataFrame with 14809 rows and 2 columns
## Sample Barcode
## <character> <character>
## 1_AAACCTGAGACTTTCG-1 SRR9264343 AAACCTGAGACTTTCG-1
## 1_AAACCTGGTCTTCAAG-1 SRR9264343 AAACCTGGTCTTCAAG-1
## 1_AAACCTGGTGCAACTT-1 SRR9264343 AAACCTGGTGCAACTT-1
## 1_AAACCTGGTGTTGAGG-1 SRR9264343 AAACCTGGTGTTGAGG-1
## 1_AAACCTGTCCCAAGTA-1 SRR9264343 AAACCTGTCCCAAGTA-1
## ... ... ...
## 4_TTTCCTCGTCGTCTTC-1 SRR9264351 TTTCCTCGTCGTCTTC-1
## 4_TTTGCGCTCTTGTTTG-1 SRR9264351 TTTGCGCTCTTGTTTG-1
## 4_TTTGGTTAGATAGTCA-1 SRR9264351 TTTGGTTAGATAGTCA-1
## 4_TTTGGTTAGGTGCTAG-1 SRR9264351 TTTGGTTAGGTGCTAG-1
## 4_TTTGGTTGTGCATCTA-1 SRR9264351 TTTGGTTGTGCATCTA-1
The “Barcode” column contains the cell/droplet barcode and comprises the actual sequence and a ‘group ID’, e.g. AAACCTGAGAAACCAT-1. The ‘group ID’ helps distinguish cells from different samples that have identical barcode sequences, however, as each sample was processed separately with CellRanger, the group ID is set to 1 in all data sets. In the rownames DropUtils has helpfully add a prefix to each barcode to distinguish between samples. We will replace the “Barcode” column with the these.
We will also add information from the sample metadata table to the
colData
object. We will be using the merge
function to do this. Unfortunately, this function removes the rownames
from the DFrame, so we will need to replace them.
sce$Barcode <- rownames(colData(sce))
colData(sce) <- merge(colData(sce), samplesheet, by="Sample", sort=FALSE)
rownames(colData(sce)) <- sce$Barcode
Although the count matrix has 37764 genes, many of these will not have been detected in any droplet.
detected_genes <- rowSums(counts(sce)) > 0
table(detected_genes)
## detected_genes
## FALSE TRUE
## 6591 31173
About a fifth of the genes have not been detected in any droplet. We can remove these before proceeding in order to reduce the size of the single cell experiment object.
sce <- sce[detected_genes,]
In order to assess the percentage of mitochondrial UMIs, we will need to be able to identify mitochondrial genes. The simplest way to do this is to annotate the genes with their chromosome of origin.
There are many ways we could annotate our genes in R. We will use
AnnotationHub
. AnnotationHub has access to a large number
of annotation databases. Our genes are currently annotated with Ensembl
IDs, so we will use Ensembl human database. We will also specify that we
want the database corresponding to Ensembl release 107 as this the
release from which the CellRanger gene annotation was generated.
ah <- AnnotationHub()
ens.hs.107<- query(ah, c("Homo sapiens", "EnsDb", 107))[[1]]
genes <- rowData(sce)$ID
gene_annot <- AnnotationDbi::select(ens.hs.107,
keys = genes,
keytype = "GENEID",
columns = c("GENEID", "SEQNAME")) %>%
set_names(c("ID", "Chromosome"))
rowData(sce) <- merge(rowData(sce), gene_annot, by = "ID", sort=FALSE)
rownames(rowData(sce)) <- rowData(sce)$ID
rowData(sce)
## DataFrame with 31173 rows and 4 columns
## ID Symbol Type Chromosome
## <character> <character> <character> <character>
## ENSG00000243485 ENSG00000243485 MIR1302-2HG Gene Expression 1
## ENSG00000238009 ENSG00000238009 ENSG00000238009 Gene Expression 1
## ENSG00000241860 ENSG00000241860 ENSG00000241860 Gene Expression 1
## ENSG00000235146 ENSG00000235146 ENSG00000235146 Gene Expression 1
## ENSG00000229905 ENSG00000229905 ENSG00000229905 Gene Expression 1
## ... ... ... ... ...
## ENSG00000276345 ENSG00000276345 ENSG00000276345 Gene Expression KI270721.1
## ENSG00000277856 ENSG00000277856 ENSG00000277856 Gene Expression KI270726.1
## ENSG00000275063 ENSG00000275063 ENSG00000275063 Gene Expression KI270726.1
## ENSG00000276017 ENSG00000276017 ENSG00000276017 Gene Expression KI270734.1
## ENSG00000278817 ENSG00000278817 ENSG00000278817 Gene Expression KI270734.1
We can now add per cell QC metrics to the droplet annotation using
the function addPerCellQC
. In order to get the metrics for
the subset of mitochondrial genes, we need to pass the function a vector
indicating which genes are mitochondrial.
is.mito <- which(rowData(sce)$Chromosome=="MT")
sce <- addPerCellQC(sce, subsets=list(Mito=is.mito), BPPARAM = bp.params)
The function has added six columns to the droplet annotation:
We will use sum, detected, and subsets_Mito_percent to further filter the cells.
colData(sce)
## DataFrame with 14809 rows and 10 columns
## Sample Barcode SampleName SampleGroup
## <character> <character> <character> <character>
## 1_AAACCTGAGACTTTCG-1 SRR9264343 1_AAACCTGAGACTTTCG-1 ETV6-RUNX1_1 ETV6-RUNX1
## 1_AAACCTGGTCTTCAAG-1 SRR9264343 1_AAACCTGGTCTTCAAG-1 ETV6-RUNX1_1 ETV6-RUNX1
## 1_AAACCTGGTGCAACTT-1 SRR9264343 1_AAACCTGGTGCAACTT-1 ETV6-RUNX1_1 ETV6-RUNX1
## 1_AAACCTGGTGTTGAGG-1 SRR9264343 1_AAACCTGGTGTTGAGG-1 ETV6-RUNX1_1 ETV6-RUNX1
## 1_AAACCTGTCCCAAGTA-1 SRR9264343 1_AAACCTGTCCCAAGTA-1 ETV6-RUNX1_1 ETV6-RUNX1
## ... ... ... ... ...
## 4_TTTCCTCGTCGTCTTC-1 SRR9264351 4_TTTCCTCGTCGTCTTC-1 PBMMC_1 PBMMC
## 4_TTTGCGCTCTTGTTTG-1 SRR9264351 4_TTTGCGCTCTTGTTTG-1 PBMMC_1 PBMMC
## 4_TTTGGTTAGATAGTCA-1 SRR9264351 4_TTTGGTTAGATAGTCA-1 PBMMC_1 PBMMC
## 4_TTTGGTTAGGTGCTAG-1 SRR9264351 4_TTTGGTTAGGTGCTAG-1 PBMMC_1 PBMMC
## 4_TTTGGTTGTGCATCTA-1 SRR9264351 4_TTTGGTTGTGCATCTA-1 PBMMC_1 PBMMC
## sum detected subsets_Mito_sum subsets_Mito_detected
## <numeric> <integer> <numeric> <integer>
## 1_AAACCTGAGACTTTCG-1 8315 2920 293 13
## 1_AAACCTGGTCTTCAAG-1 14916 4318 570 12
## 1_AAACCTGGTGCAACTT-1 1567 931 431 11
## 1_AAACCTGGTGTTGAGG-1 10433 3628 427 12
## 1_AAACCTGTCCCAAGTA-1 10410 3320 528 13
## ... ... ... ... ...
## 4_TTTCCTCGTCGTCTTC-1 6652 2329 777 12
## 4_TTTGCGCTCTTGTTTG-1 13944 3728 330 11
## 4_TTTGGTTAGATAGTCA-1 3974 1613 143 10
## 4_TTTGGTTAGGTGCTAG-1 6916 2258 287 12
## 4_TTTGGTTGTGCATCTA-1 2301 827 43 9
## subsets_Mito_percent total
## <numeric> <numeric>
## 1_AAACCTGAGACTTTCG-1 3.52375 8315
## 1_AAACCTGGTCTTCAAG-1 3.82140 14916
## 1_AAACCTGGTGCAACTT-1 27.50479 1567
## 1_AAACCTGGTGTTGAGG-1 4.09278 10433
## 1_AAACCTGTCCCAAGTA-1 5.07205 10410
## ... ... ...
## 4_TTTCCTCGTCGTCTTC-1 11.68070 6652
## 4_TTTGCGCTCTTGTTTG-1 2.36661 13944
## 4_TTTGGTTAGATAGTCA-1 3.59839 3974
## 4_TTTGGTTAGGTGCTAG-1 4.14980 6916
## 4_TTTGGTTGTGCATCTA-1 1.86875 2301
Before moving on to do the actual cell filtering, it is always a good idea to explore the distribution of the metrics across the droplets.
We can use the scater
function plotColData
to generate plots that provide a look at these distributions on a per
sample basis.
plotColData(sce, x="SampleName", y="sum") +
scale_y_log10() +
ggtitle("Total count")
plotColData(sce, x="SampleName", y="detected") +
scale_y_log10() +
ggtitle("Detected features")
plotColData(sce, x="SampleName", y="subsets_Mito_percent") +
ggtitle("Mito percent")
A scatter plot shows the extent to which library size and numbers of genes detected are correlated.
colData(sce) %>%
as.data.frame() %>%
arrange(subsets_Mito_percent) %>%
ggplot(aes(x = sum, y = detected)) +
geom_point(aes(colour = subsets_Mito_percent > 10)) +
facet_wrap(vars(SampleGroup))
One could use hard threshold for the library size, number of genes detected and mitochondrial content based on the distributions seen above. These would need vary across runs and the decision making process is somewhat arbitrary. It may therefore be preferable to rely on outlier detection to identify cells that markedly differ from most cells.
We saw above that the distribution of the QC metrics is close to Normal. Hence, we can detect outliers using the median and the median absolute deviation (MAD) from the median (not the mean and the standard deviation which both are sensitive to outliers).
For a given metric, an outlier value is one that lies over some number of MADs away from the median. A cell will be excluded if it is an outlier in the part of the range to avoid, for example low gene counts, or high mitochondrial content. For a normal distribution, a threshold defined with a distance of 3 MADs from the median retains about 99% of values.
The scater
function isOutlier
can be used
to detect outlier cells based on any metric in the colData
table. It returns a boolean vector that identifies outliers. By default
it will mark any cell that is 3 MADS in either direction from the median
as an outlier.
With library size we wish to identify outliers that have very low library sizes, this indicates that the droplets either contain poor quality cells, perhaps damaged or dying, or do not contain a cell at all.
The library size distribution tends to have a long tail to the right (small numbers of cells with very high UMI counts). We therefore log transform the library size in order to the make the distribution closer to normal. This also improves the resolution of the smaller library sizes and ensures that we do not end up with negative threshold.
low_lib_size <- isOutlier(sce$sum, log=TRUE, type="lower")
table(low_lib_size)
## low_lib_size
## FALSE
## 14809
This has excluded 0 cells. We can view the threshold values to check that they seem reasonable.
attr(low_lib_size, "thresholds")
## lower higher
## 373.1028 Inf
We can view the effect of the filtering using
plotColData
.
colData(sce)$low_lib_size <- low_lib_size
plotColData(sce, x="SampleName", y="sum", colour_by = "low_lib_size") +
scale_y_log10() +
labs(y = "Total count", title = "Total count") +
guides(colour=guide_legend(title="Discarded"))
As with the library size, we will log tranform the number of genes detected prior to filtering using the median absolute deviation.
low_n_features <- isOutlier(sce$detected, log=TRUE, type="lower")
table(low_n_features)
## low_n_features
## FALSE TRUE
## 14636 173
This has excluded out 173 cells. The threshold value was:
attr(low_n_features, "thresholds")[1]
## lower
## 387.8842
We can view the effect of the filtering using
plotColData
.
colData(sce)$low_n_features <- low_n_features
plotColData(sce, x="SampleName", y="detected", colour_by = "low_n_features") +
scale_y_log10() +
labs(y = "Genes detected", title = "Genes detected") +
guides(colour=guide_legend(title="Discarded"))
For the mitochondrial content the exclusion zone is in the higher part of the distribution. For this reason we do not need to worry about log transforming the data as want to remove the long right hand tail anyway.
high_Mito_percent <- isOutlier(sce$subsets_Mito_percent, type="higher")
table(high_Mito_percent)
## high_Mito_percent
## FALSE TRUE
## 13089 1720
This has removed 1720 cells in total. The upper threshold value:
attr(high_Mito_percent, "thresholds")[2]
## higher
## 11.14249
We can view the effect of the filtering using
plotColData
.
colData(sce)$high_Mito_percent <- high_Mito_percent
plotColData(sce,
x="SampleName",
y="subsets_Mito_percent",
colour_by = "high_Mito_percent") +
labs(y = "Percentage mitochondrial UMIs",
title = "Mitochondrial UMIs") +
guides(colour=guide_legend(title="Discarded"))
Having applied each of the three thresholds separately, we can now combine them to see how many droplets in total we will be excluding.
tibble(low_lib_size, low_n_features, high_Mito_percent) %>%
mutate(discard = low_lib_size | low_n_features | high_Mito_percent) %>%
mutate(SampleName=colData(sce)$SampleName) %>%
group_by(SampleName) %>%
summarise(across(where(is.logical), sum))
## # A tibble: 4 × 5
## SampleName low_lib_size low_n_features high_Mito_percent discard
## <chr> <int> <int> <int> <int>
## 1 ETV6-RUNX1_1 0 13 242 250
## 2 HHD_1 0 56 1132 1142
## 3 PBMMC_1 0 19 24 35
## 4 PRE-T_1 0 85 322 355
The three steps above may be run in one go using the
quickPerCellQC
function. This creates a DataFrame with 4
columns containing TRUE/FALSE - one for each filter metric and one
called “discard” that combined the three logicals.
cell_qc_results <- quickPerCellQC(colData(sce), sub.fields = TRUE)
cell_qc_results %>%
as.data.frame() %>%
mutate(SampleName=colData(sce)$SampleName) %>%
group_by(SampleName) %>%
summarise(across(where(is.logical), sum))
## # A tibble: 4 × 5
## SampleName low_lib_size low_n_features high_subsets_Mito_percent discard
## <chr> <int> <int> <int> <int>
## 1 ETV6-RUNX1_1 0 13 242 250
## 2 HHD_1 0 56 1132 1142
## 3 PBMMC_1 0 19 24 35
## 4 PRE-T_1 0 85 322 355
Data quality depends on the tissue analysed, some being difficult to dissociate, e.g. brain, so that one level of QC stringency will not fit all data sets.
Filtering based on QC metrics as done here assumes that these QC metrics are not correlated with biology. This may not necessarily be true in highly heterogenous data sets where some cell types represented by good-quality cells may have low RNA content or high mitochondrial content.
The samples analysed here may have been processed in different
batches leading to differences in the overall distribution of UMI
counts, numbers of genes detected and mitochondrial content. Such
differences would affect the adaptive thesholds discussed above - that
is, as the distributions of the metrics differ, perhaps we should really
apply the adaptive thresholding for each batch rather than universally
across all samples. The quickPerCellQC
has a “batch”
argument that allows us to specify with samples belong to which batches.
The batches are then filtered independently.
batch.cell_qc_results <- quickPerCellQC(colData(sce),
sub.fields = TRUE,
batch=sce$Sample)
batch.cell_qc_results %>%
as.data.frame() %>%
mutate(SampleName=colData(sce)$SampleName) %>%
group_by(SampleName) %>%
summarise(across(where(is.logical), sum))
## # A tibble: 4 × 5
## SampleName low_lib_size low_n_features high_subsets_Mito_percent discard
## <chr> <int> <int> <int> <int>
## 1 ETV6-RUNX1_1 168 231 299 380
## 2 HHD_1 0 18 992 998
## 3 PBMMC_1 6 29 54 70
## 4 PRE-T_1 0 74 837 868
The table below shows how the thresholds for each metric differ between the batch-wise analysis and the analysis using all samples.
all.thresholds <- tibble(`SampleName`="All",
`Library Size`=attr(cell_qc_results$low_lib_size, "thresholds")[1],
`Genes detected`=attr(cell_qc_results$low_n_features, "thresholds")[1],
`Mitochondrial UMIs`=attr(cell_qc_results$high_subsets_Mito_percent, "thresholds")[2])
tibble(`Sample`=names(attr(batch.cell_qc_results$low_lib_size, "thresholds")[1,]),
`Library Size`=attr(batch.cell_qc_results$low_lib_size, "thresholds")[1,],
`Genes detected`=attr(batch.cell_qc_results$low_n_features, "thresholds")[1,],
`Mitochondrial UMIs`=attr(batch.cell_qc_results$high_subsets_Mito_percent, "thresholds")[2,]) %>%
left_join(samplesheet) %>%
select(SampleName, `Library Size`, `Genes detected`, `Mitochondrial UMIs`) %>%
bind_rows(all.thresholds) %>%
mutate(across(where(is.numeric), round, digits=2)) %>%
datatable(rownames = FALSE, options = list(dom="t"))
Let’s replace the columns in the droplet annotation with these new filters.
sce$low_lib_size <- batch.cell_qc_results$low_lib_size
sce$low_n_features <- batch.cell_qc_results$low_n_features
sce$high_Mito_percent <- batch.cell_qc_results$high_subsets_Mito_percent
sce$discard <- batch.cell_qc_results$discard
We can visualise how the new filters look using violin plots.
plotColData(sce, x="SampleName", y="sum", colour_by = "low_lib_size") +
scale_y_log10() +
labs(y = "Total count", title = "Total count") +
guides(colour=guide_legend(title="Discarded"))
plotColData(sce, x="SampleName", y="detected", colour_by = "low_n_features") +
scale_y_log10() +
labs(y = "Genes detected", title = "Genes detected") +
guides(colour=guide_legend(title="Discarded"))
plotColData(sce,
x="Sample",
y="subsets_Mito_percent",
colour_by = "high_Mito_percent") +
labs(y = "Percentage mitochondrial UMIs",
title = "Mitochondrial UMIs") +
guides(colour=guide_legend(title="Discarded"))
There are some distinct differences, most noticeable is that some cells are now being filtered based on library size for both ETV6-RUNX1_1 and PBMMC_1a. The venn diagrams below show how the number of discarded droplets in have changed for each filter in comparison to when the MAD filtering was applied across all samples.
pc1 <- tibble(`All together`=cell_qc_results$low_lib_size,
`By batch`=batch.cell_qc_results$low_lib_size) %>%
ggvenn(show_percentage = FALSE) +
labs(title="Library Size")
pc2 <- tibble(`All together`=cell_qc_results$low_n_features,
`By batch`=batch.cell_qc_results$low_n_features) %>%
ggvenn(show_percentage = FALSE) +
labs(title="Genes detected")
pc3 <- tibble(`All together`=cell_qc_results$high_subsets_Mito_percent,
`By batch`=batch.cell_qc_results$high_subsets_Mito_percent) %>%
ggvenn(show_percentage = FALSE) +
labs(title="Mitochondrial UMIs")
pc1 + pc2 + pc3
The most striking difference is in the filtering by library size. As we can see from the violin plots ETV6-RUNX1_1 has a markedly different library size distribution to the other three samples. When we applied the adaptive filters across all samples, the lower distributions of the other three samples caused the MADs to be distorted and resulted in a threshold that was inappropriately low for the ETV6-RUNX1_1 samples.
Now that we have identified poor quality cells we can filter them out before proceeding to do any further analysis.
sce.filtered <- sce[, !sce$discard]
In the previous approach we used the three metrics in isolation to filter droplets. Another approach is to combine the three (or more) metrics in a single filtering step by looking for outliers in the multi-dimensional space defined by the metrics.
As with the adaptive thresholds above, this method should not be applied across batches or samples with differing distributions in the metrics or it will exclude many good quality cells. To demonstrate these methods, we’ll just extract one sample from our SingleCellExperiment object.
sce.E1 <- sce[ , sce$SampleName == "ETV6-RUNX1_1"]
Essentially we need to reduce our 3 metrics to a single metric, we
can then use isOutlier
to select outliers based on this
metric. One way to do this is to use the function
adjOutlyingness
from the robustbase
package.
This function computes the “outlyingness” for each droplet.
Here we will use the same three metrics as before: library size, the
number of genes detected and the mitochondrial content. Remember that
for “sum” (total UMIs) and “detected” (number of genes detected), we
want to use the log10
value.
library(robustbase)
stats <- cbind(log10(sce.E1$sum),
log10(sce.E1$detected),
sce.E1$subsets_Mito_percent)
outlying <- adjOutlyingness(stats, only.outlyingness = TRUE)
multi.outlier <- isOutlier(outlying, type = "higher")
summary(multi.outlier)
## Mode FALSE TRUE
## logical 3007 146
Another approach is to perform a principal component analysis (PCA)
on the table of metrics, apply adjOutlyingness
to the
metrics table and use this to detect outliers. The scater
function runColDataPCA
can be used to perform the PCA and
detect outliers. We’ll need to add a couple of columns to the
colData
for the log10 metrics first.
sce.E1$log10sum <- log10(sce.E1$sum)
sce.E1$log10detected <- log10(sce.E1$detected)
sce.E1 <- runColDataPCA(sce.E1,
variables=list("log10sum",
"log10detected",
"subsets_Mito_percent"),
outliers=TRUE,
BPPARAM = bp.params)
This has added the results of the principal component analysis into a
new slot in the SingleCellExperiment object specifically for holding the
results of dimension reduction transformations such as PCA, t-SNE and
UMAP. The results can be accessed using the reducedDim
function.
head(reducedDim(sce.E1))
## PC1 PC2
## 1_AAACCTGAGACTTTCG-1 -0.8654193 0.03100502
## 1_AAACCTGGTCTTCAAG-1 -1.8810440 0.68452367
## 1_AAACCTGGTGCAACTT-1 4.7281423 2.80380801
## 1_AAACCTGGTGTTGAGG-1 -1.2935022 0.41762932
## 1_AAACCTGTCCCAAGTA-1 -1.0744934 0.53511048
## 1_AAACCTGTCGAATGCT-1 1.2569088 -1.13761549
It has also added a column “outlier” to the colData
,
which specifies the droplets that have been identified as outliers.
summary(sce.E1$outlier)
## Mode FALSE TRUE
## logical 3013 140
These types of approach can provide more power for detecting outliers as they are looking at patterns across multiple metrics, however, it can be difficult to interpret the reason why any particular droplet has been excluded.
A useful diagnostic plot for assessing the impact of the filtering is to do a scatter plot of the mitochondrial content against the library size. We can overlay our final filter metric using the point colour.
plotColData(sce,
x="sum",
y="subsets_Mito_percent",
other_fields="SampleName",
colour_by="discard") +
facet_wrap(~SampleName, ncol=5, scale="free_x")
sessionInfo()
## R version 4.2.0 (2022-04-22)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
##
## locale:
## [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
## [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
## [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] grid stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] robustbase_0.95-0 ggvenn_0.1.9
## [3] patchwork_1.1.2 forcats_0.5.2
## [5] stringr_1.4.1 dplyr_1.0.10
## [7] purrr_0.3.4 readr_2.1.2
## [9] tidyr_1.2.0 tibble_3.1.8
## [11] tidyverse_1.3.2 BiocParallel_1.30.3
## [13] AnnotationHub_3.4.0 BiocFileCache_2.4.0
## [15] dbplyr_2.2.1 ensembldb_2.20.2
## [17] AnnotationFilter_1.20.0 GenomicFeatures_1.48.3
## [19] AnnotationDbi_1.58.0 scater_1.24.0
## [21] ggplot2_3.3.6 scuttle_1.6.3
## [23] DropletUtils_1.16.0 SingleCellExperiment_1.18.0
## [25] SummarizedExperiment_1.26.1 Biobase_2.56.0
## [27] GenomicRanges_1.48.0 GenomeInfoDb_1.32.3
## [29] IRanges_2.30.1 S4Vectors_0.34.0
## [31] BiocGenerics_0.42.0 MatrixGenerics_1.8.1
## [33] matrixStats_0.62.0 knitr_1.40
## [35] DT_0.24
##
## loaded via a namespace (and not attached):
## [1] readxl_1.4.1 backports_1.4.1
## [3] lazyeval_0.2.2 crosstalk_1.2.0
## [5] digest_0.6.29 htmltools_0.5.3
## [7] viridis_0.6.2 fansi_1.0.3
## [9] magrittr_2.0.3 memoise_2.0.1
## [11] ScaledMatrix_1.4.0 googlesheets4_1.0.1
## [13] tzdb_0.3.0 limma_3.52.2
## [15] Biostrings_2.64.1 modelr_0.1.9
## [17] vroom_1.5.7 R.utils_2.12.0
## [19] prettyunits_1.1.1 colorspace_2.0-3
## [21] rvest_1.0.3 blob_1.2.3
## [23] rappdirs_0.3.3 ggrepel_0.9.1
## [25] haven_2.5.1 xfun_0.32
## [27] crayon_1.5.1 RCurl_1.98-1.8
## [29] jsonlite_1.8.0 glue_1.6.2
## [31] gargle_1.2.0 gtable_0.3.1
## [33] zlibbioc_1.42.0 XVector_0.36.0
## [35] DelayedArray_0.22.0 BiocSingular_1.12.0
## [37] Rhdf5lib_1.18.2 DEoptimR_1.0-11
## [39] HDF5Array_1.24.2 scales_1.2.1
## [41] DBI_1.1.3 edgeR_3.38.4
## [43] Rcpp_1.0.9 viridisLite_0.4.1
## [45] xtable_1.8-4 progress_1.2.2
## [47] dqrng_0.3.0 bit_4.0.4
## [49] rsvd_1.0.5 htmlwidgets_1.5.4
## [51] httr_1.4.4 ellipsis_0.3.2
## [53] farver_2.1.1 pkgconfig_2.0.3
## [55] XML_3.99-0.10 R.methodsS3_1.8.2
## [57] sass_0.4.2 locfit_1.5-9.6
## [59] utf8_1.2.2 labeling_0.4.2
## [61] tidyselect_1.1.2 rlang_1.0.5
## [63] later_1.3.0 cellranger_1.1.0
## [65] munsell_0.5.0 BiocVersion_3.15.2
## [67] tools_4.2.0 cachem_1.0.6
## [69] cli_3.3.0 generics_0.1.3
## [71] RSQLite_2.2.16 broom_1.0.1
## [73] evaluate_0.16 fastmap_1.1.0
## [75] yaml_2.3.5 fs_1.5.2
## [77] bit64_4.0.5 KEGGREST_1.36.3
## [79] sparseMatrixStats_1.8.0 mime_0.12
## [81] R.oo_1.25.0 xml2_1.3.3
## [83] biomaRt_2.52.0 compiler_4.2.0
## [85] rstudioapi_0.14 beeswarm_0.4.0
## [87] filelock_1.0.2 curl_4.3.2
## [89] png_0.1-7 interactiveDisplayBase_1.34.0
## [91] reprex_2.0.2 bslib_0.4.0
## [93] stringi_1.7.8 highr_0.9
## [95] lattice_0.20-45 ProtGenerics_1.28.0
## [97] Matrix_1.4-1 vctrs_0.4.1
## [99] pillar_1.8.1 lifecycle_1.0.1
## [101] rhdf5filters_1.8.0 BiocManager_1.30.18
## [103] jquerylib_0.1.4 BiocNeighbors_1.14.0
## [105] cowplot_1.1.1 bitops_1.0-7
## [107] irlba_2.3.5 httpuv_1.6.5
## [109] rtracklayer_1.56.1 R6_2.5.1
## [111] BiocIO_1.6.0 promises_1.2.0.1
## [113] gridExtra_2.3 vipor_0.4.5
## [115] codetools_0.2-18 assertthat_0.2.1
## [117] rhdf5_2.40.0 rjson_0.2.21
## [119] withr_2.5.0 GenomicAlignments_1.32.1
## [121] Rsamtools_2.12.0 GenomeInfoDbData_1.2.8
## [123] parallel_4.2.0 hms_1.1.2
## [125] beachmat_2.12.0 rmarkdown_2.16
## [127] DelayedMatrixStats_1.18.0 googledrive_2.0.0
## [129] lubridate_1.8.0 shiny_1.7.2
## [131] ggbeeswarm_0.6.0 restfulr_0.0.15