In the course materials we performed QC and filtering of 2 samples from each of the sample groups. For this challenge we would like you to perform QC and filtering on all of the samples from the Caron data set.
We have prepared a SingleCellExperiment object with the filtered count data from CellRanger already loaded. We have also already added the sample meta data and annotated the genes with their chromosome. Essentially, we have run all the steps up to and including section 6.0.1 in the course materials book.
While running the QC and particularly the filtering, you should consider the experimental set up and will need to decide on whether to apply the filters across all samples at once or whether you should separate the samples into batches for the adapative filters, and at what level the “batch” should be defined.
library(DropletUtils)
library(scater)
library(BiocParallel)
library(tidyverse)
library(patchwork)
library(ggvenn)
library(DT)
We will load both the Caron and HCA data sets. We have already prepared a sample meta data table that relates the sample/run ID to the sample group. This is the metadata for all samples from both data sets.
samplesheet <- read_tsv("../Data/sample_sheet.tsv")
samplesheet %>%
as.data.frame() %>%
datatable(rownames = FALSE, options = list(dom="tpl", nrows=20))
We first need to first set up some parallel parameters using the package BiocParallel
.
bp.params <- MulticoreParam(workers = 7)
sce <- readRDS("../Robjects/Caron_data.sce.annot.rds")
Use the colData
function to check that the samples you have in the sce
object are the ones you are expecting (all of the Caron data set).
colData(sce) %>%
as.data.frame() %>%
select(Sample, SampleId, SampleGroup, DatasetName) %>%
distinct()
## Sample SampleId SampleGroup DatasetName
## 1_AAACCTGAGACTTTCG-1 ETV6-RUNX1_1 SRR9264343 ETV6-RUNX1 Caron
## 2_AAACCTGAGCAGACTG-1 ETV6-RUNX1_2 SRR9264344 ETV6-RUNX1 Caron
## 3_AAACCTGAGGTGATTA-1 ETV6-RUNX1_3 SRR9264345 ETV6-RUNX1 Caron
## 4_AAACCTGAGAGTGACC-1 ETV6-RUNX1_4 SRR9264346 ETV6-RUNX1 Caron
## 5_AAACCTGAGCCAGAAC-1 HHD_1 SRR9264347 HHD Caron
## 6_AAACCTGAGCTAGCCC-1 HHD_2 SRR9264348 HHD Caron
## 7_AAACCTGAGAGCTGCA-1 PRE-T_1 SRR9264349 PRE-T Caron
## 8_AAACCTGAGAACAACT-1 PRE-T_2 SRR9264350 PRE-T Caron
## 9_AAACCTGCACTTCGAA-1 PBMMC_1a SRR9264351 PBMMC Caron
## 10_AAACCTGAGATGAGAG-1 PBMMC_1b SRR9264352 PBMMC Caron
## 11_AAACCTGAGAAACGAG-1 PBMMC_2 SRR9264353 PBMMC Caron
## 12_AAACCTGAGGACAGCT-1 PBMMC_3 SRR9264354 PBMMC Caron
Although the count matrix has genes, many of these will not have been detected in any droplet.
What fraction of the genes are not detected?
detected_genes <- rowSums(counts(sce)) > 0
table(detected_genes)
## detected_genes
## FALSE TRUE
## 8224 28377
Remove these before proceeding in order to reduce the size of the single cell experiment object.
sce <- sce[detected_genes,]
We will look at three QC metrics:
Add the per cell QC metrics to the droplet annotation using the function addPerCellQC
. In order to get the percentage of mitochondrial UMIs in each cell, your will need to pass the function a vector indicating which genes are mitochondrial.
is.mito <- which(rowData(sce)$Chromosome=="MT")
sce <- addPerCellQC(sce, subsets=list(Mito=is.mito), BPPARAM = bp.params)
Check that the function has added six columns to the droplet annotation:
head(colData(sce)) %>%
as.data.frame() %>%
datatable(rownames = FALSE)
Plot the distributions of the three metrics of interest for each sample. We have given you the code for the library size, you will need to modify this for the other two metrics. Check the course materials if you need to figure out which column to use.
plotColData(sce, x="Sample", y="sum",other_fields="SampleGroup") +
facet_wrap(~SampleGroup, nrow=1, scales = "free_x") +
scale_y_log10() +
ggtitle("Total count")
plotColData(sce, x="Sample", y="detected", other_fields="SampleGroup") +
facet_wrap(~SampleGroup, nrow=1, scales = "free_x") +
scale_y_log10() +
ggtitle("Detected features")
plotColData(sce, x="Sample", y="subsets_Mito_percent", other_fields="SampleGroup") +
facet_wrap(~SampleGroup, nrow=1, scales = "free_x") +
ggtitle("Mito percent")
First we will apply the adaptive threshold filtering across all samples together. We have given you the code for filtering the by library size, you will need to add the commands for filtering on the number of genes detected and the mitochondrial content (make sure you filter the correct tail of the distribution).
low_lib_size <- isOutlier(sce$sum, log=TRUE, type="lower")
sce$low_lib_size <- low_lib_size
low_n_features <- isOutlier(sce$detected, log=TRUE, type="lower")
sce$low_n_features <- low_n_features
high_Mito_percent <- isOutlier(sce$subsets_Mito_percent, type="higher")
sce$high_Mito_percent <- high_Mito_percent
Let’s have a look at how these filters would affect the data set.
plotColData(sce,
x="Sample",
y="sum",
other_fields="SampleGroup",
colour_by = "low_lib_size") +
facet_wrap(~SampleGroup, nrow=1, scales = "free_x") +
scale_y_log10() +
labs(y = "Total count", title = "Total count") +
guides(colour=guide_legend(title="Discarded"))
plotColData(sce,
x="Sample",
y="detected",
other_fields="SampleGroup",
colour_by = "low_n_features") +
facet_wrap(~SampleGroup, nrow=1, scales = "free_x") +
scale_y_log10() +
labs(y = "Genes detected", title = "Genes detected") +
guides(colour=guide_legend(title="Discarded"))
plotColData(sce,
x="Sample",
y="subsets_Mito_percent",
other_fields="SampleGroup",
colour_by = "high_Mito_percent") +
facet_wrap(~SampleGroup, nrow=1, scales = "free_x") +
labs(y = "Percentage mitochondrial UMIs",
title = "Mitochondrial UMIs") +
guides(colour=guide_legend(title="Discarded"))
Based on the experimental design and the plots above, d you think applying the adaptive filtering across all the samples is appropriate?
Perhaps a batched approach is more suitable. This time use the quickPerCellQC
function to perform all three filters and add a parameter for the batch (this could be any column in the colData
or you could add a new one).
batch.reasons <- quickPerCellQC(colData(sce),
percent_subsets=c("subsets_Mito_percent"),
batch=sce$Sample)
The table below shows how the thresholds for each metric differ between the batch-wise analysis and the analysis using all samples.
all.thresholds <- tibble(`Batch`="All",
`Library Size`=attr(low_lib_size, "thresholds")[1],
`Genes detected`=attr(low_n_features, "thresholds")[1],
`Mitochondrial UMIs`=attr(high_Mito_percent, "thresholds")[2])
tibble(`Batch`=names(attr(batch.reasons$low_lib_size, "thresholds")[1,]),
`Library Size`=attr(batch.reasons$low_lib_size, "thresholds")[1,],
`Genes detected`=attr(batch.reasons$low_n_features, "thresholds")[1,],
`Mitochondrial UMIs`=attr(batch.reasons$high_subsets_Mito_percent, "thresholds")[2,]) %>%
bind_rows(all.thresholds) %>%
mutate(across(where(is.numeric), round, digits=2)) %>%
datatable(rownames = FALSE)
Once you have decided how best to filter the data you should remove the droplets that do not pass your criteria from the data set.
Exclude poor-quality cells from the SingleCellExperiment object and save the final object to the results
directory for later use.
sce.Filtered <- sce[,!batch.reasons$discard]
saveRDS(sce.Filtered, "results/Caron_filtered.rds")
You may additionally wish to filter the genes further to remove very sparse genes (see section 12). Remember that you will need to recalculate the colData metrics, as some of them were calculated across all genes. Save this final object as results/Caron_filtered_genes.rds.
sce <- sce.Filtered
sce <- addPerFeatureQC(sce, BPPARAM = bp.params)
rowData(sce)$gene_sparsity <- (100 - rowData(sce)$detected) / 100
min.cells <- 1 - (20 / ncol(sce))
sparse.genes <- rowData(sce)$gene_sparsity > min.cells
Generate a table showing the number of genes removed
table(sparse.genes)
## sparse.genes
## FALSE TRUE
## 19967 8410
Filter the genes.
sce <- sce[!sparse.genes, ]
Replace the Cell metrics
colData(sce) <- colData(sce)[,1:3]
sce <- addPerCellQC(sce, BPPARAM = bp.params)
Remove the suffix letters so that the droplets are assigned to the same sample - note that we can still distinguish the libraries based on the “SampleId” column.
colData(sce) <- colData(sce) %>%
as.data.frame() %>%
mutate(across(Sample, str_remove, "[ab]$")) %>%
DataFrame()
Write the object out to a file in the results
directory for later use.
saveRDS(sce, "results/Caron_filtered_genes.rds")
sessionInfo()
## R version 4.1.0 (2021-05-18)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.5 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
##
## locale:
## [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
## [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
## [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] grid parallel stats4 stats graphics grDevices utils
## [8] datasets methods base
##
## other attached packages:
## [1] DT_0.18 ggvenn_0.1.9
## [3] patchwork_1.1.1 forcats_0.5.1
## [5] stringr_1.4.0 dplyr_1.0.6
## [7] purrr_0.3.4 readr_1.4.0
## [9] tidyr_1.1.3 tibble_3.1.2
## [11] tidyverse_1.3.1 BiocParallel_1.26.0
## [13] scater_1.20.0 ggplot2_3.3.3
## [15] scuttle_1.2.0 DropletUtils_1.12.1
## [17] SingleCellExperiment_1.14.1 SummarizedExperiment_1.22.0
## [19] Biobase_2.52.0 GenomicRanges_1.44.0
## [21] GenomeInfoDb_1.28.0 IRanges_2.26.0
## [23] S4Vectors_0.30.0 BiocGenerics_0.38.0
## [25] MatrixGenerics_1.4.0 matrixStats_0.59.0
##
## loaded via a namespace (and not attached):
## [1] fs_1.5.0 bitops_1.0-7
## [3] lubridate_1.7.10 httr_1.4.2
## [5] tools_4.1.0 backports_1.2.1
## [7] bslib_0.2.5.1 utf8_1.2.1
## [9] R6_2.5.0 irlba_2.3.3
## [11] HDF5Array_1.20.0 vipor_0.4.5
## [13] DBI_1.1.1 colorspace_2.0-1
## [15] rhdf5filters_1.4.0 withr_2.4.2
## [17] tidyselect_1.1.1 gridExtra_2.3
## [19] compiler_4.1.0 cli_2.5.0
## [21] rvest_1.0.0 BiocNeighbors_1.10.0
## [23] xml2_1.3.2 DelayedArray_0.18.0
## [25] sass_0.4.0 scales_1.1.1
## [27] digest_0.6.27 rmarkdown_2.8
## [29] R.utils_2.10.1 XVector_0.32.0
## [31] pkgconfig_2.0.3 htmltools_0.5.1.1
## [33] sparseMatrixStats_1.4.0 dbplyr_2.1.1
## [35] limma_3.48.0 htmlwidgets_1.5.3
## [37] readxl_1.3.1 rlang_0.4.11
## [39] rstudioapi_0.13 DelayedMatrixStats_1.14.0
## [41] jquerylib_0.1.4 generics_0.1.0
## [43] jsonlite_1.7.2 R.oo_1.24.0
## [45] RCurl_1.98-1.3 magrittr_2.0.1
## [47] BiocSingular_1.8.1 GenomeInfoDbData_1.2.6
## [49] Matrix_1.3-4 Rcpp_1.0.6
## [51] ggbeeswarm_0.6.0 munsell_0.5.0
## [53] Rhdf5lib_1.14.1 fansi_0.5.0
## [55] viridis_0.6.1 lifecycle_1.0.0
## [57] R.methodsS3_1.8.1 stringi_1.6.2
## [59] yaml_2.2.1 edgeR_3.34.0
## [61] zlibbioc_1.38.0 rhdf5_2.36.0
## [63] dqrng_0.3.0 crayon_1.4.1
## [65] lattice_0.20-44 haven_2.4.1
## [67] beachmat_2.8.0 hms_1.1.0
## [69] locfit_1.5-9.4 knitr_1.33
## [71] pillar_1.6.1 codetools_0.2-18
## [73] ScaledMatrix_1.0.0 reprex_2.0.0
## [75] glue_1.4.2 evaluate_0.14
## [77] modelr_0.1.8 vctrs_0.3.8
## [79] cellranger_1.1.0 gtable_0.3.0
## [81] assertthat_0.2.1 xfun_0.23
## [83] rsvd_1.0.5 broom_0.7.7
## [85] viridisLite_0.4.0 beeswarm_0.4.0
## [87] ellipsis_0.3.2