We will use two sets of Bone Marrow Mononuclear Cells (BMMC):
Fastq files were retrieved from publicly available archive (SRA and HCA).
Sequencing quality was assessed and visualised using fastQC and MultiQC.
Reads were aligned against GRCh38 and features counted using cellranger (v3.1.0).
We will now check the quality of the data further:
We will then:
library(DropletUtils)
library(scater)
library(ensembldb)
library(AnnotationHub)
library(BiocParallel)
library(tidyverse)
library(patchwork)
library(ggvenn)
We will load both the Caron and HCA data sets. We have already prepared a sample meta data table that relates the sample/run ID to the sample group.
samplesheet <- read_tsv("Data/sample_sheet.tsv")
We will load the data for the SRR9264343. To load the data from the CellRanger outputs, we need to use the function read10xCounts
from the DropletUtils
package.
scRNAseq data sets tend to be very large and processing them requires a large amount of computing power and can take time. Many of the commands we will use have the option to be run in parallels across multiple processors. By default they will only use a single processor, but parallelisation will greatly speed up the analysis.
We need to first set up some parallel parameters using the package BiocParallel
.
bp.params <- MulticoreParam(workers = 7)
Here were are selecting to use forked processes with MulticoreParam
and instructing the function to use 7 cores (our machines have 8, this leaves 1 to run the desktop etc.) Note that on Windows MulticoreParam
does not work and it is necessary to use SnowParam
- please refer to the BiocParallel
vignettes for further information.
We pass the function the location of the directory containing the counts matrix, cell barcodes and features (genes). In this case we will load the raw matrix.
sample.path <- "CellRanger_Outputs/SRR9264343/outs/raw_feature_bc_matrix/"
sce.raw <- read10xCounts(sample.path, col.names=TRUE, BPPARAM = bp.params)
sce.raw
## class: SingleCellExperiment
## dim: 36601 255440
## metadata(1): Samples
## assays(1): counts
## rownames(36601): ENSG00000243485 ENSG00000237613 ... ENSG00000278817
## ENSG00000277196
## rowData names(3): ID Symbol Type
## colnames(255440): AAACCTGAGAAACCTA-1 AAACCTGAGAAACGAG-1 ...
## TTTGTCATCTTTAGTC-1 TTTGTCATCTTTCCTC-1
## colData names(2): Sample Barcode
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
SingleCellExperiment
objectThe data has been loaded as a SingleCellExperiment object. The details of the structure of the object are described here. In summary, it stores various data types in a single object. Currently it will contain:
Later we will also add the outcomes of downstream analysis such as dimensionality reduction.
Compared to bulk RNA-seq, Single-cell RNA-seq data is sparse, i.e. there many missing values or zeroes. This is particularly true with droplet-based methods such as 10X, mostly because:
We can access the counts matrix with counts
. Given the large number of droplets in a sample, count matrices can be large.
dim(counts(sce.raw))
## [1] 36601 255440
They are however very sparse, that is, most of the entries are 0’s. To save memory the counts can be stored in a ‘sparse matrix’ that only stores non-zero values, for example a dgCMatrix
object.
counts(sce.raw)[1:10, 1:10]
## 10 x 10 sparse Matrix of class "dgCMatrix"
##
## ENSG00000243485 . . . . . . . . . .
## ENSG00000237613 . . . . . . . . . .
## ENSG00000186092 . . . . . . . . . .
## ENSG00000238009 . . . . . . . . . .
## ENSG00000239945 . . . . . . . . . .
## ENSG00000239906 . . . . . . . . . .
## ENSG00000241860 . . . . . . . . . .
## ENSG00000241599 . . . . . . . . . .
## ENSG00000286448 . . . . . . . . . .
## ENSG00000236601 . . . . . . . . . .
Details about the “features” (in this case genes) can by accessed using the rowData
function. Currently it contains the ensembl gene ID and the gene symbol, which have been derived from the 10x reference used by CellRanger. It also contains a “Type” column, which tells us what sort of data we are looking at; in this case it is “Expression” for gene expression. If we wish to, we can add further annotation to the features by adding extra columns to this data frame.
head(rowData(sce.raw))
## DataFrame with 6 rows and 3 columns
## ID Symbol Type
## <character> <character> <character>
## ENSG00000243485 ENSG00000243485 MIR1302-2HG Gene Expression
## ENSG00000237613 ENSG00000237613 FAM138A Gene Expression
## ENSG00000186092 ENSG00000186092 OR4F5 Gene Expression
## ENSG00000238009 ENSG00000238009 AL627309.1 Gene Expression
## ENSG00000239945 ENSG00000239945 AL627309.3 Gene Expression
## ENSG00000239906 ENSG00000239906 AL627309.2 Gene Expression
Details about the droplets can be accessed using colData
. Currently it contains the sample names and droplet Barcodes. As with the feature data, we can add additional information about each droplet, e.g. counts of genes or the percentage of mitochondrial genes, to this data frame. The rows of this table correspond to the data in the columns of the count matrix; the row names of this table will match the column names of the counts matrix - currently these are the droplet barcodes.
head(colData(sce.raw))
## DataFrame with 6 rows and 2 columns
## Sample Barcode
## <character> <character>
## AAACCTGAGAAACCTA-1 CellRanger_Outputs/S.. AAACCTGAGAAACCTA-1
## AAACCTGAGAAACGAG-1 CellRanger_Outputs/S.. AAACCTGAGAAACGAG-1
## AAACCTGAGAAAGTGG-1 CellRanger_Outputs/S.. AAACCTGAGAAAGTGG-1
## AAACCTGAGAACAACT-1 CellRanger_Outputs/S.. AAACCTGAGAACAACT-1
## AAACCTGAGAACAATC-1 CellRanger_Outputs/S.. AAACCTGAGAACAATC-1
## AAACCTGAGAAGCCCA-1 CellRanger_Outputs/S.. AAACCTGAGAAGCCCA-1
colnames(as.matrix(counts(sce.raw)[1:2,1:6]))
## [1] "AAACCTGAGAAACCTA-1" "AAACCTGAGAAACGAG-1" "AAACCTGAGAAAGTGG-1"
## [4] "AAACCTGAGAACAACT-1" "AAACCTGAGAACAATC-1" "AAACCTGAGAAGCCCA-1"
In the previous example we loaded the raw matrix, in fact we are going to work with the filtered matrix, which only contains droplets called as cells by CellRanger. After this we will apply some additional QC steps to these droplets in order to remove any that we are not confident contain a real cell or those that we think contain multiple cells.
We can load multiple samples at the same time using the read10xCounts
command. This will create a single object containing the data for multiple samples. We can then QC and filter the samples in conjunction. As we will see later, this is not always optimal when samples have been processed in multiple batches.
As an example, we will begin by loading 2 samples from each sample group. This time we will start with the filtered counts matrix, which only contains cells called by CellRanger. We pass the read10xCounts
a named vector containing the paths to the filtered counts matrices that we wish to load; the names of the vector will be used as the sample names in the Single Cell Experiment object.
Note that for PBMMC_1 we have two sequencing runs, for now, we will only use one of them.
list_of_files <- samplesheet %>%
distinct(SampleName, .keep_all = TRUE) %>%
group_by(SampleGroup) %>%
slice(1:2) %>%
mutate(File=str_c("CellRanger_Outputs/",
SampleId,
"/outs/filtered_feature_bc_matrix")) %>%
pull(File, SampleId)
list_of_files
## MantonBM1
## "CellRanger_Outputs/MantonBM1/outs/filtered_feature_bc_matrix"
## MantonBM2
## "CellRanger_Outputs/MantonBM2/outs/filtered_feature_bc_matrix"
## SRR9264343
## "CellRanger_Outputs/SRR9264343/outs/filtered_feature_bc_matrix"
## SRR9264344
## "CellRanger_Outputs/SRR9264344/outs/filtered_feature_bc_matrix"
## SRR9264347
## "CellRanger_Outputs/SRR9264347/outs/filtered_feature_bc_matrix"
## SRR9264348
## "CellRanger_Outputs/SRR9264348/outs/filtered_feature_bc_matrix"
## SRR9264351
## "CellRanger_Outputs/SRR9264351/outs/filtered_feature_bc_matrix"
## SRR9264353
## "CellRanger_Outputs/SRR9264353/outs/filtered_feature_bc_matrix"
## SRR9264349
## "CellRanger_Outputs/SRR9264349/outs/filtered_feature_bc_matrix"
## SRR9264350
## "CellRanger_Outputs/SRR9264350/outs/filtered_feature_bc_matrix"
sce <- read10xCounts(list_of_files, col.names=TRUE, BPPARAM = bp.params)
sce
## class: SingleCellExperiment
## dim: 36601 82475
## metadata(1): Samples
## assays(1): counts
## rownames(36601): ENSG00000243485 ENSG00000237613 ... ENSG00000278817
## ENSG00000277196
## rowData names(3): ID Symbol Type
## colnames(82475): 1_AAACCTGAGAAACCTA-1 1_AAACCTGAGACAAGCC-1 ...
## 10_TTTGTCATCTACCAGA-1 10_TTTGTCATCTGCAGTA-1
## colData names(2): Sample Barcode
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
If you compare the dimensions of this object to those of the raw object above, you will see that whereas the raw object that 255440 columns, the filtered object only has **** columns.
Currently, the droplet annotation in colData
slot of the sce
object has two columns: “Sample” and “Barcode”. The “Sample” is the name of the sample as we provided it to read10xCounts
, the “Barcode” is the barcode for the droplet (cell).
colData(sce)
## DataFrame with 82475 rows and 2 columns
## Sample Barcode
## <character> <character>
## 1_AAACCTGAGAAACCTA-1 MantonBM1 AAACCTGAGAAACCTA-1
## 1_AAACCTGAGACAAGCC-1 MantonBM1 AAACCTGAGACAAGCC-1
## 1_AAACCTGAGAGACTTA-1 MantonBM1 AAACCTGAGAGACTTA-1
## 1_AAACCTGAGCAACGGT-1 MantonBM1 AAACCTGAGCAACGGT-1
## 1_AAACCTGAGCAGCGTA-1 MantonBM1 AAACCTGAGCAGCGTA-1
## ... ... ...
## 10_TTTGTCAGTGGACGAT-1 SRR9264350 TTTGTCAGTGGACGAT-1
## 10_TTTGTCATCAACGAAA-1 SRR9264350 TTTGTCATCAACGAAA-1
## 10_TTTGTCATCTAACTGG-1 SRR9264350 TTTGTCATCTAACTGG-1
## 10_TTTGTCATCTACCAGA-1 SRR9264350 TTTGTCATCTACCAGA-1
## 10_TTTGTCATCTGCAGTA-1 SRR9264350 TTTGTCATCTGCAGTA-1
The “Barcode” column contains the cell/droplet barcode and comprises the actual sequence and a ‘group ID’, e.g. AAACCTGAGAAACCAT-1. The ‘group ID’ helps distinguish cells from different samples that have identical barcode sequences, however, as each sample was processed separately with CellRanger, the group ID is set to 1 in all data sets. Some of our downstream tools will use this column to identify different droplets, so we will need modify these to be unique.
In order to distinguish droplets that originate from different samples but have the same barcode, read10XCounts
has added the “index” of the sample in the list_of_files
object (1-10) to the beginning of the cell barcode in the row names of the droplet annotation table and the column names of the count matrix. We will use this number to modify the “Barcode” column.
We will also switch the “Sample” column to be the sample name and add information from the sample sheet to the droplet annotation.
colData(sce) <- colData(sce) %>%
as.data.frame() %>%
rownames_to_column("RowName") %>%
mutate(SampleNum = str_extract(RowName, "^[0-9]+")) %>%
mutate(Barcode = str_replace(Barcode, "1$", SampleNum)) %>%
left_join(samplesheet, by=c(Sample="SampleId")) %>%
rename(SampleId=Sample) %>%
rename(Sample=SampleName) %>%
column_to_rownames("RowName") %>%
select(Sample, Barcode, SampleId, SampleGroup, DatasetName) %>%
DataFrame()
colData(sce)
## DataFrame with 82475 rows and 5 columns
## Sample Barcode SampleId SampleGroup
## <character> <character> <character> <character>
## 1_AAACCTGAGAAACCTA-1 ABMMC_1 AAACCTGAGAAACCTA-1 MantonBM1 ABMMC
## 1_AAACCTGAGACAAGCC-1 ABMMC_1 AAACCTGAGACAAGCC-1 MantonBM1 ABMMC
## 1_AAACCTGAGAGACTTA-1 ABMMC_1 AAACCTGAGAGACTTA-1 MantonBM1 ABMMC
## 1_AAACCTGAGCAACGGT-1 ABMMC_1 AAACCTGAGCAACGGT-1 MantonBM1 ABMMC
## 1_AAACCTGAGCAGCGTA-1 ABMMC_1 AAACCTGAGCAGCGTA-1 MantonBM1 ABMMC
## ... ... ... ... ...
## 10_TTTGTCAGTGGACGAT-1 PRE-T_2 TTTGTCAGTGGACGAT-10 SRR9264350 PRE-T
## 10_TTTGTCATCAACGAAA-1 PRE-T_2 TTTGTCATCAACGAAA-10 SRR9264350 PRE-T
## 10_TTTGTCATCTAACTGG-1 PRE-T_2 TTTGTCATCTAACTGG-10 SRR9264350 PRE-T
## 10_TTTGTCATCTACCAGA-1 PRE-T_2 TTTGTCATCTACCAGA-10 SRR9264350 PRE-T
## 10_TTTGTCATCTGCAGTA-1 PRE-T_2 TTTGTCATCTGCAGTA-10 SRR9264350 PRE-T
## DatasetName
## <character>
## 1_AAACCTGAGAAACCTA-1 HCA
## 1_AAACCTGAGACAAGCC-1 HCA
## 1_AAACCTGAGAGACTTA-1 HCA
## 1_AAACCTGAGCAACGGT-1 HCA
## 1_AAACCTGAGCAGCGTA-1 HCA
## ... ...
## 10_TTTGTCAGTGGACGAT-1 Caron
## 10_TTTGTCATCAACGAAA-1 Caron
## 10_TTTGTCATCTAACTGG-1 Caron
## 10_TTTGTCATCTACCAGA-1 Caron
## 10_TTTGTCATCTGCAGTA-1 Caron
The number and identity of genes detected in a cell vary greatly across cells: the total number of genes detected across all cells is far larger than the number of genes detected in each cell.
For the current set of samples the total number of genes detected across cells was 29256 out of 36601 gene in the reference, but if we look at the number of genes detected in each cell, we can see that this ranges from 27 to 8337, with a median of 998.
genesPerCell <- colSums(counts(sce) > 0)
summary(genesPerCell)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 27 813 998 1304 1552 8337
plot(density(genesPerCell), main="", xlab="Genes per cell")
If we compare the number of UMI’s assigned to an individual gene to the number of cells in which that gene is detected, we can see that highly expressed genes tend to be detected in a higher proportion of cells than lowly expressed genes.
set.seed(232)
tmpCounts <- counts(sce)[,sample(ncol(sce), 1000)]
plot(rowSums(tmpCounts),
rowMeans(tmpCounts > 0),
log = "x",
xlab="total number of UMIs",
ylab="proportion of cells expressing the gene"
)
rm(tmpCounts)
We could also look at the distribution of counts for individual genes across all cells. The plot below shows this distribution for the top 20 genes detected.
rel_expression <- t( t(counts(sce)) / colSums(counts(sce))) * 100
rownames(rel_expression) <- rowData(sce)$Symbol
most_expressed <- sort(rowSums( rel_expression ),T)[20:1]
plot_data <- as.matrix(t(rel_expression[names(most_expressed),]))
boxplot(plot_data, cex=0.1, las=1, xlab="% total count per cell", horizontal=TRUE)
Although the count matrix has 36601 genes, many of these will not have been detected in any droplet.
detected_genes <- rowSums(counts(sce)) > 0
table(detected_genes)
## detected_genes
## FALSE TRUE
## 7345 29256
About a fifth of the genes have not been detected in any droplet. We can remove these before proceeding in order to reduce the size of the single cell experiment object.
sce <- sce[detected_genes,]
The cell calling performed by CellRanger does not always retain only droplets containing cells. Poor-quality cells, or rather droplets, may be caused by cell damage during dissociation or failed library preparation. They usually have low UMI counts, few genes detected and/or high mitochondrial content. The presence of these droplets in the data set may affect normalisation, assessment of cell population heterogeneity, clustering and trajectory:
In order to remove or reduce the impact of poor-quality droplets on our downstream analysis we will attempt to filter them out using some QC metrics. The three principle means of doing this are to apply thresholds for inclusion on three characteristics:
The library size defined as the total sum of UMI counts across all genes; cells with small library sizes are considered to be of low quality as the RNA has not been efficiently captured, i.e. converted into cDNA and amplified, during library preparation.
The number of expressed genes in each cell defined as the number of genes with non-zero counts for that cell; any cell with very few expressed genes is likely to be of poor quality as the diverse transcript population has not been successfully captured.
The proportion of UMIs mapped to genes in the mitochondrial genome; high proportions are indicative of poor-quality cells, possibly because of loss of cytoplasmic RNA from perforated cells (the reasoning is that mitochondria are larger than individual transcript molecules and less likely to escape through tears in the cell membrane).
The scater function addPerCellQC()
will compute various per droplet QC metrics and will add this information as new columns in the droplet annotation (colData
) of the single cell object.
In order to assess the percentage of mitochondrial UMIs, we will need to be able to identify mitochondrial genes. The simplest way to do this is to annotate the genes with their chromosome of origin.
There are many ways we could annotate our genes in R. We will use AnnotationHub
.
ah <- AnnotationHub()
ens.mm.98 <- query(ah, c("Homo sapiens", "EnsDb", 98))[[1]]
genes <- rowData(sce)$ID
gene_annot <- AnnotationDbi::select(ens.mm.98,
keys = genes,
keytype = "GENEID",
columns = c("GENEID", "SEQNAME")) %>%
set_names(c("ID", "Chromosome"))
rowData(sce) <- merge(rowData(sce), gene_annot, by = "ID", sort=FALSE)
rownames(rowData(sce)) <- rowData(sce)$ID
rowData(sce)
## DataFrame with 29256 rows and 4 columns
## ID Symbol Type Chromosome
## <character> <character> <character> <character>
## ENSG00000243485 ENSG00000243485 MIR1302-2HG Gene Expression 1
## ENSG00000238009 ENSG00000238009 AL627309.1 Gene Expression 1
## ENSG00000239945 ENSG00000239945 AL627309.3 Gene Expression 1
## ENSG00000241860 ENSG00000241860 AL627309.5 Gene Expression 1
## ENSG00000241599 ENSG00000241599 AL627309.4 Gene Expression 1
## ... ... ... ... ...
## ENSG00000275869 ENSG00000275869 AC136612.1 Gene Expression KI270728.1
## ENSG00000277761 ENSG00000277761 AC136616.2 Gene Expression KI270728.1
## ENSG00000277836 ENSG00000277836 AC141272.1 Gene Expression KI270728.1
## ENSG00000278633 ENSG00000278633 AC023491.2 Gene Expression KI270731.1
## ENSG00000278817 ENSG00000278817 AC007325.4 Gene Expression KI270734.1
Number of genes per chromosome, including 13 on the mitochondrial genome:
rowData(sce) %>%
as.data.frame() %>%
count(Chromosome, name = "Number of Genes") %>%
datatable(rownames = FALSE)
We can add the per cell QC metrics to the droplet annotation using the function addPerCellQC
. In order to get the percentage of mitochondrial UMIs in each cell, we need to pass the function a vector indicating which genes are mitochondrial.
is.mito <- which(rowData(sce)$Chromosome=="MT")
sce <- addPerCellQC(sce, subsets=list(Mito=is.mito), BPPARAM = bp.params)
The function has added six columns to the droplet annotation:
We will use sum, detected, and subsets_Mito_percent to further filter the cells.
head(colData(sce)) %>%
as.data.frame() %>%
datatable(rownames = FALSE)
Before moving on to do the actual cell filtering, it is always a good idea to explore the distribution of the metrics across the droplets.
First let’s look across all cells in the data set.
par(mfrow=c(1, 3))
hist(log10(sce$sum),
breaks=20,
col="grey80",
xlab="log10(Total UMI count)",
main="")
hist(sce$detected,
breaks=20,
col="grey80",
xlab="Total genes detected",
main="")
hist(sce$subsets_Mito_percent,
breaks=20,
col="grey80",
xlab="Proportion of reads in mitochondrial genes",
main="")
This gives us an overview of all the cells in the experiment, but to get a better idea we need to have a look at these distributions on a per sample basis. The scater
function plotColData
provides a simple interface for generating these plots.
p1 <- plotColData(sce, x="Sample", y="sum",other_fields="SampleGroup") +
facet_wrap(~SampleGroup, nrow=1, scales = "free_x") +
scale_y_log10() +
ggtitle("Total count")
p2 <- plotColData(sce, x="Sample", y="detected", other_fields="SampleGroup") +
facet_wrap(~SampleGroup, nrow=1, scales = "free_x") +
scale_y_log10() +
ggtitle("Detected features")
p3 <- plotColData(sce, x="Sample", y="subsets_Mito_percent", other_fields="SampleGroup") +
facet_wrap(~SampleGroup, nrow=1, scales = "free_x") +
ggtitle("Mito percent")
p1 / p2 / p3
A scatter plot shows the extent to which library size and numbers of genes detected are correlated.
ggplot(data.frame(colData(sce)),
aes(x=sum,
y=detected,
col=subsets_Mito_percent)) +
geom_point() +
facet_wrap(~SampleGroup)
One can use hard threshold for the library size, number of genes detected and mitochondrial content. These will however vary across runs. It may therefore be preferable to rely on outlier detection to identify cells that markedly differ from most cells.
We saw above that the distribution of the QC metrics is close to Normal. Hence, we can detect outliers using the median and the median absolute deviation (MAD) from the median (not the mean and the standard deviation that both are sensitive to outliers).
For a given metric, an outlier value is one that lies over some number of MADs away from the median. A cell will be excluded if it is an outlier in the part of the range to avoid, for example low gene counts, or high mitochondrial content. For a normal distribution, a threshold defined with a distance of 3 MADs from the median retains about 99% of values.
The scater
function isOutlier
can be used to detect outlier cells based on any metric in the colData
table. It returns a boolean vector which identifies outliers. By default it will mark any cell that is 3 MADS in either direction from the median as an outlier.
With library size we wish to identify outliers that have very low library sizes, this indicates that the droplets either contain poor quality cells, perhaps damaged or dying, or do not contain a cell at all.
The library size distribution tends to have a long tail to the right (small numbers of cells with very high UMI counts). We therefore log transform the library size in order to the make the distribution closer to normal. This also improves the resolution of the smaller library sizes and ensures that we do not end up with negative threshold.
low_lib_size <- isOutlier(sce$sum, log=TRUE, type="lower")
table(low_lib_size)
## low_lib_size
## FALSE TRUE
## 80439 2036
This has excluded 2036 cells. We can view the threshold values to check that they seem reasonable.
attr(low_lib_size, "thresholds")[1]
## lower
## 861.9801
We can view the effect of the filtering using plotColData
.
colData(sce)$low_lib_size <- low_lib_size
plotColData(sce,
x="Sample",
y="sum",
other_fields="SampleGroup",
colour_by = "low_lib_size") +
facet_wrap(~SampleGroup, nrow=1, scales = "free_x") +
scale_y_log10() +
labs(y = "Total count", title = "Total count") +
guides(colour=guide_legend(title="Discarded"))
As with the library size, we will log tranform the number of genes detected prior to filtering using the median absolute deviation.
low_n_features <- isOutlier(sce$detected, log=TRUE, type="lower")
table(low_n_features)
## low_n_features
## FALSE TRUE
## 81729 746
This has filter out 444 cells. The threshold value was:
attr(low_n_features, "thresholds")[1]
## lower
## 293.6232
We can view the effect of the filtering using plotColData
.
colData(sce)$low_n_features <- low_n_features
plotColData(sce,
x="Sample",
y="detected",
other_fields="SampleGroup",
colour_by = "low_n_features") +
facet_wrap(~SampleGroup, nrow=1, scales = "free_x") +
scale_y_log10() +
labs(y = "Genes detected", title = "Genes detected") +
guides(colour=guide_legend(title="Discarded"))
For the mitochondrial content the exclusion zone is in the higher part of the distribution. For this reason we do not need to worry about log transforming the data as want to remove the long right hand tail anyway.
high_Mito_percent <- isOutlier(sce$subsets_Mito_percent, type="higher")
table(high_Mito_percent)
## high_Mito_percent
## FALSE TRUE
## 76924 5551
This has removed 5422 cells in total. The upper threshold value:
attr(high_Mito_percent, "thresholds")[2]
## higher
## 8.123782
We can view the effect of the filtering using plotColData
.
colData(sce)$high_Mito_percent <- high_Mito_percent
plotColData(sce,
x="Sample",
y="subsets_Mito_percent",
other_fields="SampleGroup",
colour_by = "high_Mito_percent") +
facet_wrap(~SampleGroup, nrow=1, scales = "free_x") +
labs(y = "Percentage mitochondrial UMIs",
title = "Mitochondrial UMIs") +
guides(colour=guide_legend(title="Discarded"))
Having applied each of the three thresholds separately, we can now combind them to see how many droplets in total we will be excluding.
tibble(`Library Size`=sum(low_lib_size),
`Genes detected`=sum(low_n_features),
`Mitochondrial UMIs`=sum(high_Mito_percent),
Total=sum(low_lib_size | low_n_features | high_Mito_percent)) %>%
datatable(rownames = FALSE, options = list(dom="t"))
Clearly there is some overlap, as we would expect. We can visualise this with a Venn diagram.
colData(sce) %>%
as.data.frame() %>%
dplyr::select(`Library Size`=low_lib_size,
`Genes detected`=low_n_features,
`Mitochondrial UMIs`=high_Mito_percent) %>%
ggvenn(show_percentage = FALSE)
The three steps above may be run in one go using the quickPerCellQC
function.
reasons <- quickPerCellQC(colData(sce),
percent_subsets=c("subsets_Mito_percent"))
as.data.frame(reasons) %>%
summarise(across(everything(), sum)) %>%
dplyr::select(`Library Size`=low_lib_size,
`Genes detected`=low_n_features,
`Mitochondrial UMIs`=high_subsets_Mito_percent,
Total=discard) %>%
datatable(rownames = FALSE, options = list(dom="t"))
Data quality depends on the tissue analysed, some being difficult to dissociate, e.g. brain, so that one level of QC stringency will not fit all data sets.
Filtering based on QC metrics as done here assumes that these QC metrics are not correlated with biology. This may not necessarily be true in highly heterogenous data sets where some cell types represented by good-quality cells may have low RNA content or high mitochondrial content.
The two data sets analysed here may have been obtained in experiments with different settings, such as cell preparation or sequencing depth. Such differences between these two batches would affect the adaptive thesholds discussed above - that is, the distributions of the metrics may be different in each batch and so we should really apply the adaptive thresholding separately for each batch. The quickPerCellQC
has a “batch” argument that allows us to specify with samples belong to which batches. The batches are then filtered independently.
We will first need to add a “batch” column to the droplet annotation that identifies samples from the Caron paper and samples from the HCA.
batch.reasons <- quickPerCellQC(colData(sce),
percent_subsets=c("subsets_Mito_percent"),
batch=sce$DatasetName)
as.data.frame(batch.reasons) %>%
summarise(across(everything(), sum)) %>%
dplyr::select(`Library Size`=low_lib_size,
`Genes detected`=low_n_features,
`Mitochondrial UMIs`=high_subsets_Mito_percent,
Total=discard) %>%
datatable(rownames = FALSE, options = list(dom="t"))
The table below shows how the thresholds for each metric differ between the batch-wise analysis and the analysis using all samples.
all.thresholds <- tibble(`Batch`="All",
`Library Size`=attr(reasons$low_lib_size, "thresholds")[1],
`Genes detected`=attr(reasons$low_n_features, "thresholds")[1],
`Mitochondrial UMIs`=attr(reasons$high_subsets_Mito_percent, "thresholds")[2])
tibble(`Batch`=names(attr(batch.reasons$low_lib_size, "thresholds")[1,]),
`Library Size`=attr(batch.reasons$low_lib_size, "thresholds")[1,],
`Genes detected`=attr(batch.reasons$low_n_features, "thresholds")[1,],
`Mitochondrial UMIs`=attr(batch.reasons$high_subsets_Mito_percent, "thresholds")[2,]) %>%
bind_rows(all.thresholds) %>%
mutate(across(where(is.numeric), round, digits=2)) %>%
datatable(rownames = FALSE, options = list(dom="t"))
Let’s replace the columns in the droplet annotation with these new filters.
sce$low_lib_size <- batch.reasons$low_lib_size
sce$low_n_features <- batch.reasons$low_n_features
sce$high_Mito_percent <- batch.reasons$high_subsets_Mito_percent
sce$discard <- batch.reasons$discard
We can visualise how the new filters look using violin plots.
p.lib <- plotColData(sce,
x="Sample",
y="sum",
other_fields="SampleGroup",
colour_by = "low_lib_size") +
facet_wrap(~SampleGroup, nrow=1, scales = "free_x") +
scale_y_log10() +
labs(y = "Total count", title = "Total count") +
guides(colour=guide_legend(title="Discarded"))
p.ng <- plotColData(sce,
x="Sample",
y="detected",
other_fields="SampleGroup",
colour_by = "low_n_features") +
facet_wrap(~SampleGroup, nrow=1, scales = "free_x") +
scale_y_log10() +
labs(y = "Genes detected", title = "Genes detected") +
guides(colour=guide_legend(title="Discarded"))
p.mt <- plotColData(sce,
x="Sample",
y="subsets_Mito_percent",
other_fields="SampleGroup",
colour_by = "high_Mito_percent") +
facet_wrap(~SampleGroup, nrow=1, scales = "free_x") +
labs(y = "Percentage mitochondrial UMIs",
title = "Mitochondrial UMIs") +
guides(colour=guide_legend(title="Discarded"))
p.lib / p.ng / p.mt
There are some distinct differences, most noticeable is that there is now no filtering using library size. The venn diagrams below show how the number of discarded droplets in HCA and Caron have changed for each filter in comparison to when the MAD filtering was applied across all samples.
libDat <- tibble(`All together`=reasons$low_lib_size,
`By batch`=batch.reasons$low_lib_size,
Batch=sce$DatasetName)
ph1 <- libDat %>%
dplyr::filter(Batch=="HCA") %>%
ggvenn(show_percentage = FALSE) +
labs(title="Library Size - HCA")
pc1 <- libDat %>%
dplyr::filter(Batch=="Caron") %>%
ggvenn(show_percentage = FALSE) +
labs(title="Library Size - Caron")
nGenDat <- tibble(`All together`=reasons$low_n_features,
`By batch`=batch.reasons$low_n_features,
Batch=sce$DatasetName)
ph2 <- nGenDat %>%
dplyr::filter(Batch=="HCA") %>%
ggvenn(show_percentage = FALSE) +
labs(title="Genes detected - HCA")
pc2 <- nGenDat %>%
dplyr::filter(Batch=="Caron") %>%
ggvenn(show_percentage = FALSE) +
labs(title="Genes detected - Caron")
mitDat <- tibble(`All together`=reasons$high_subsets_Mito_percent,
`By batch`=batch.reasons$high_subsets_Mito_percent,
Batch=sce$DatasetName)
ph3 <- mitDat %>%
dplyr::filter(Batch=="HCA") %>%
ggvenn(show_percentage = FALSE) +
labs(title="Mitochondrial UMIs - HCA")
pc3 <- mitDat %>%
dplyr::filter(Batch=="Caron") %>%
ggvenn(show_percentage = FALSE) +
labs(title="Mitochondrial UMIs - Caron")
(pc1 + pc2 + pc3) / (ph1 + ph2 + ph3)
The most striking difference is in the filtering of the Caron data by library size. As we can see from the violin plots the ABMMC samples from the HCA have a radically different library size distribution to the Caron samples, with all cells having > 1000 UMIs. When we applied the adaptive filters across all samples, these two samples caused the MADs to be distorted and resulted in a threshold that was inappropriately high for the Caron samples.
In the previous approach we used the three metrics in isolation to filter droplets. Another approach is to combine the three (or more) metrics in a single filtering step by looking for outliers in the multi-dimensional space defined by the metrics.
As with the adaptive thresholds above, this method should not be applied across batches or samples with differing distributions in the metrics or it will exclude many good quality cells. To demonstrate these methods, we’ll just extract one sample from our SingleCellExperiment object.
sce.BM1 <- sce[ , sce$Sample == "ABMMC_1"]
Essentially we need to reduce our 3 metrics to a single metric, we can then use isOutlier
to select outliers based on this metric. One way to do this is to use the function adjOutlyingness
from the robustbase
package. This function computes the “outlyingness” for each droplet.
Here we will use the same three metrics as before: library size, the number of genes detected and the mitochondrial content. Remember that for “sum” (total UMIs) and “detected” (number of genes detected), we want to use the log10
value.
library(robustbase)
stats <- cbind(log10(sce.BM1$sum),
log10(sce.BM1$detected),
sce.BM1$subsets_Mito_percent)
outlying <- adjOutlyingness(stats, only.outlyingness = TRUE)
multi.outlier <- isOutlier(outlying, type = "higher")
summary(multi.outlier)
## Mode FALSE TRUE
## logical 21972 754
Another approach is to perform a principal component analysis (PCA) on the table of metrics, apply adjOutlyingness
to the metrics table and use this to detect outliers. The scater
function runColDataPCA
can be used to perform the PCA and detect outliers. We’ll need to add a couple of columns to the colData
for the log10 metrics first.
sce.BM1$log10sum <- log10(sce.BM1$sum)
sce.BM1$log10detected <- log10(sce.BM1$detected)
sce.BM1 <- runColDataPCA(sce.BM1,
variables=list("log10sum", "log10detected", "subsets_Mito_percent"),
outliers=TRUE,
BPPARAM = bp.params)
This has added the results of the principal component analysis into a new slot in the SingleCellExperiment object specifically for holding the results of dimension reduction transformations such as PCA, t-SNE and UMAP. The results can be accessed using the reducedDim
function.
head(reducedDim(sce.BM1))
## PC1 PC2
## 1_AAACCTGAGAAACCTA-1 0.1691270 0.25778928
## 1_AAACCTGAGACAAGCC-1 0.9679269 0.56071696
## 1_AAACCTGAGAGACTTA-1 -4.2614190 0.09282786
## 1_AAACCTGAGCAACGGT-1 -3.7302602 0.21828016
## 1_AAACCTGAGCAGCGTA-1 0.7472166 -0.50376807
## 1_AAACCTGAGCAGGTCA-1 0.5764579 -0.04517415
It has also added a column “outlier” to the colData
, which specifies the droplets that have been identified as outliers.
summary(sce.BM1$outlier)
## Mode FALSE TRUE
## logical 22425 301
These types of approach can provide more power for detecting outliers as they are looking at patterns across multiple metrics, however, it can be difficult to interpret the reason why any particular droplet has been excluded.
A useful diagnostic plot for assessing the impact of the filtering is to do a scatter plot of the mitochondrial content against the library size. We can overlay our final filter metric using the point colour.
plotColData(sce,
x="sum",
y="subsets_Mito_percent",
other_fields="Sample",
colour_by="discard") +
facet_wrap(~Sample, ncol=5, scale="free_x")
We will now exclude poor-quality cells.
sce.Filtered <- sce[,!sce$discard]
sce.Filtered
## class: SingleCellExperiment
## dim: 29256 77092
## metadata(1): Samples
## assays(1): counts
## rownames(29256): ENSG00000243485 ENSG00000238009 ... ENSG00000278633
## ENSG00000278817
## rowData names(4): ID Symbol Type Chromosome
## colnames(77092): 1_AAACCTGAGAAACCTA-1 1_AAACCTGAGACAAGCC-1 ...
## 10_TTTGTCATCTACCAGA-1 10_TTTGTCATCTGCAGTA-1
## colData names(15): Sample Barcode ... high_Mito_percent discard
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
We also write the R object to file to use later if need be.
saveRDS(sce.Filtered, "results/sce_filtered.rds")
The approach above identified poor-quality using thresholds on the number of genes detected and mitochondrial content. We will here specifically look at the sparsity of the data, both at the gene and cell levels.
To illustrate this we will just consider the Caron samples.
sce.caron <- sce[,sce$DatasetName=="Caron"]
We will compute:
To help calculate the gene sparsity we can generate QC metrics for genes with addPerFeatureQC
. This adds two columns to the gene annotation (rowData
):
sce.caron <- addPerFeatureQC(sce.caron, BPPARAM = bp.params)
head(rowData(sce.caron)) %>%
as.data.frame() %>%
datatable(rownames = FALSE)
Now we can calculate sparsity using the “detected” columns in the colData
and the rowData
.
colData(sce.caron)$cell_sparsity <- 1 - (colData(sce.caron)$detected / nrow(sce.caron))
rowData(sce.caron)$gene_sparsity <- (100 - rowData(sce.caron)$detected) / 100
We now plot the distribution of these two metrics.
The cell sparsity plot shows that most cells have between 85% and 99% 0’s, which is typical.
hist(sce.caron$cell_sparsity, breaks=50, col="grey80", xlab="Cell sparsity", main="")
The gene sparsity plot shows that a large number of genes are almost never detected, which is also regularly observed.
hist(rowData(sce.caron)$gene_sparsity, breaks=50, col="grey80", xlab="Gene sparsity", main="")
We will remove cells with sparsity higher than 0.99, and/or mitochondrial content higher than 10%.
Genes detected in a few cells only are unlikely to be informative and would hinder normalisation. We will remove genes that are expressed in fewer than 20 cells.
sparse.cells <- sce.caron$cell_sparsity > 0.99
mito.cells <- sce.caron$subsets_Mito_percent > 10
min.cells <- 1 - (20 / ncol(sce.caron))
sparse.genes <- rowData(sce.caron)$gene_sparsity > min.cells
Number of genes removed:
table(sparse.genes)
## sparse.genes
## FALSE TRUE
## 19294 9962
Number of cells removed:
table(sparse.cells, mito.cells)
## mito.cells
## sparse.cells FALSE TRUE
## FALSE 30647 3434
## TRUE 367 249
sce.caron <- sce.caron[!sparse.genes, !(sparse.cells|mito.cells)]
sce.caron
## class: SingleCellExperiment
## dim: 19294 30647
## metadata(1): Samples
## assays(1): counts
## rownames(19294): ENSG00000238009 ENSG00000241860 ... ENSG00000275063
## ENSG00000278817
## rowData names(7): ID Symbol ... detected gene_sparsity
## colnames(30647): 3_AAACCTGAGACTTTCG-1 3_AAACCTGGTCTTCAAG-1 ...
## 10_TTTGTCATCTACCAGA-1 10_TTTGTCATCTGCAGTA-1
## colData names(16): Sample Barcode ... discard cell_sparsity
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
Note: An important thing to note is that, now that we have filtered this object, some of the QC metrics that we calculated across all genes (for colData
) and across all cells (for rowData
) are no longer correct for the filtered data set. We need to remove them, and if necessary recalculate them.
colData(sce.caron) <- colData(sce.caron)[,1:3]
sce.caron <- addPerCellQC(sce.caron, BPPARAM = bp.params)
rowData(sce.caron) <- rowData(sce.caron)[,1:4]
sce.caron <- addPerFeatureQC(sce.caron, BPPARAM = bp.params)
The number of genes per UMI for each cell informs on the level of sequencing saturation achieved ( hbctraining). For a given cell, as sequencing depth increases each extra UMI is less likely to correspond to a gene not already detected in that cell. Cells with small library size tend to have higher overall ‘novelty’, i.e. they have not reached saturation for any given gene.
Here we see that some cells have fewer genes detected per UMI than in others. This may suggest dying cell, which should be filtered out, or these might represent a cell type with a less complex suite of genes being expressed, e.g. red blood cells, which you may or may not wish to filter out.
colData(sce) %>%
as.data.frame() %>%
arrange(subsets_Mito_percent) %>%
ggplot(aes(x=sum, y=detected)) +
geom_point(aes(color=subsets_Mito_percent)) +
stat_smooth(method=lm) +
scale_x_log10() +
scale_y_log10() +
facet_wrap(vars(SampleGroup, Sample), ncol=5, dir="v")
We can calculate a “novelty” metrics that related the number of genes detected to the total number of UMIs for each cell:
\(\frac{\log_{10}(Number\ of\ Genes)}{\log_{10}(Number\ of\ UMIs)}\)
colData(sce)$log10GenesPerUMI <- log10(colData(sce)$detected) / log10(colData(sce)$sum)
The expected novelty is 0.8 or higher.
colData(sce) %>%
as.data.frame() %>%
ggplot(aes(x=log10GenesPerUMI)) +
geom_density(aes(color = Sample, fill = Sample), alpha=0.6) +
facet_wrap(vars(SampleGroup), ncol=10)
sessionInfo()
## R version 4.1.0 (2021-05-18)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.5 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
##
## locale:
## [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
## [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
## [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] grid parallel stats4 stats graphics grDevices utils
## [8] datasets methods base
##
## other attached packages:
## [1] robustbase_0.93-8 ggvenn_0.1.9
## [3] patchwork_1.1.1 forcats_0.5.1
## [5] stringr_1.4.0 dplyr_1.0.6
## [7] purrr_0.3.4 readr_1.4.0
## [9] tidyr_1.1.3 tibble_3.1.2
## [11] tidyverse_1.3.1 BiocParallel_1.26.0
## [13] AnnotationHub_3.0.0 BiocFileCache_2.0.0
## [15] dbplyr_2.1.1 ensembldb_2.16.0
## [17] AnnotationFilter_1.16.0 GenomicFeatures_1.44.0
## [19] AnnotationDbi_1.54.1 scater_1.20.0
## [21] ggplot2_3.3.3 scuttle_1.2.0
## [23] DropletUtils_1.12.1 SingleCellExperiment_1.14.1
## [25] SummarizedExperiment_1.22.0 Biobase_2.52.0
## [27] GenomicRanges_1.44.0 GenomeInfoDb_1.28.0
## [29] IRanges_2.26.0 S4Vectors_0.30.0
## [31] BiocGenerics_0.38.0 MatrixGenerics_1.4.0
## [33] matrixStats_0.59.0 knitr_1.33
## [35] DT_0.18
##
## loaded via a namespace (and not attached):
## [1] readxl_1.3.1 backports_1.2.1
## [3] lazyeval_0.2.2 splines_4.1.0
## [5] crosstalk_1.1.1 digest_0.6.27
## [7] htmltools_0.5.1.1 viridis_0.6.1
## [9] fansi_0.5.0 magrittr_2.0.1
## [11] memoise_2.0.0 ScaledMatrix_1.0.0
## [13] limma_3.48.0 Biostrings_2.60.1
## [15] modelr_0.1.8 R.utils_2.10.1
## [17] prettyunits_1.1.1 colorspace_2.0-1
## [19] rvest_1.0.0 blob_1.2.1
## [21] rappdirs_0.3.3 haven_2.4.1
## [23] xfun_0.23 crayon_1.4.1
## [25] RCurl_1.98-1.3 jsonlite_1.7.2
## [27] glue_1.4.2 gtable_0.3.0
## [29] zlibbioc_1.38.0 XVector_0.32.0
## [31] DelayedArray_0.18.0 BiocSingular_1.8.1
## [33] Rhdf5lib_1.14.1 DEoptimR_1.0-9
## [35] HDF5Array_1.20.0 scales_1.1.1
## [37] DBI_1.1.1 edgeR_3.34.0
## [39] Rcpp_1.0.6 viridisLite_0.4.0
## [41] xtable_1.8-4 progress_1.2.2
## [43] dqrng_0.3.0 bit_4.0.4
## [45] rsvd_1.0.5 htmlwidgets_1.5.3
## [47] httr_1.4.2 ellipsis_0.3.2
## [49] farver_2.1.0 pkgconfig_2.0.3
## [51] XML_3.99-0.6 R.methodsS3_1.8.1
## [53] sass_0.4.0 locfit_1.5-9.4
## [55] utf8_1.2.1 labeling_0.4.2
## [57] tidyselect_1.1.1 rlang_0.4.11
## [59] later_1.2.0 cellranger_1.1.0
## [61] munsell_0.5.0 BiocVersion_3.13.1
## [63] tools_4.1.0 cachem_1.0.5
## [65] cli_2.5.0 generics_0.1.0
## [67] RSQLite_2.2.7 broom_0.7.7
## [69] evaluate_0.14 fastmap_1.1.0
## [71] yaml_2.2.1 fs_1.5.0
## [73] bit64_4.0.5 KEGGREST_1.32.0
## [75] nlme_3.1-152 sparseMatrixStats_1.4.0
## [77] mime_0.10 R.oo_1.24.0
## [79] xml2_1.3.2 biomaRt_2.48.1
## [81] rstudioapi_0.13 compiler_4.1.0
## [83] beeswarm_0.4.0 filelock_1.0.2
## [85] curl_4.3.1 png_0.1-7
## [87] interactiveDisplayBase_1.30.0 reprex_2.0.0
## [89] bslib_0.2.5.1 stringi_1.6.2
## [91] highr_0.9 lattice_0.20-44
## [93] ProtGenerics_1.24.0 Matrix_1.3-4
## [95] vctrs_0.3.8 pillar_1.6.1
## [97] lifecycle_1.0.0 rhdf5filters_1.4.0
## [99] BiocManager_1.30.15 jquerylib_0.1.4
## [101] BiocNeighbors_1.10.0 cowplot_1.1.1
## [103] bitops_1.0-7 irlba_2.3.3
## [105] httpuv_1.6.1 rtracklayer_1.52.0
## [107] R6_2.5.0 BiocIO_1.2.0
## [109] promises_1.2.0.1 gridExtra_2.3
## [111] vipor_0.4.5 codetools_0.2-18
## [113] assertthat_0.2.1 rhdf5_2.36.0
## [115] rjson_0.2.20 withr_2.4.2
## [117] GenomicAlignments_1.28.0 Rsamtools_2.8.0
## [119] GenomeInfoDbData_1.2.6 mgcv_1.8-36
## [121] hms_1.1.0 beachmat_2.8.0
## [123] rmarkdown_2.8 DelayedMatrixStats_1.14.0
## [125] Cairo_1.5-12.2 lubridate_1.7.10
## [127] shiny_1.6.0 ggbeeswarm_0.6.0
## [129] restfulr_0.0.13