In the course materials we performed QC and filtering of 2 samples from each of the sample groups. For this challenge we would like you to perform QC and filtering on all of the samples from the Caron data set.
library(DropletUtils)
library(scater)
library(ensembldb)
library(AnnotationHub)
library(BiocParallel)
library(tidyverse)
library(patchwork)
library(ggvenn)
We will load both the Caron and HCA data sets. We have already prepared a sample meta data table that relates the sample/run ID to the sample group.
samplesheet <- read_tsv("Data/sample_sheet.tsv")
NOTE: There are two PBMMC_1 samples. These are two libraries from the same sample material. Later on, we will combine these to form a single sample, but for now we will need to rename them so that they are QC’d separately.
We first need to first set up some parallel parameters using the package BiocParallel
.
bp.params <- MulticoreParam(workers = 7)
In order to load the CellRanger data for all of the Caron samples, you will first need to create a named vector of the paths to the filtered count matrix folders called list_of_files
and then use this in the read10xCounts
command.
The paths to filtered_feature_bc_matrix
directories for each sample can be constructed using the SampleId as:
CellRanger_Outputs/SampleId/outs/filtered_feature_bc_matrix
You will need to use a function such as str_c
or paste
samples_list <- samplesheet %>%
filter(DatasetName=="Caron") %>%
pull(SampleId)
list_of_files <- str_c("CellRanger_Outputs/",
samples_list,
"/outs/filtered_feature_bc_matrix")
names(list_of_files) <- samples_list
sce <- read10xCounts(list_of_files, col.names=TRUE, BPPARAM = bp.params)
A good sanity check is to look at the colData to ensure that we have all the samples we are expecting and check the overall size of the new object.
colData(sce) %>%
as.data.frame() %>%
select(Sample) %>%
distinct()
sce
## class: SingleCellExperiment
## dim: 36601 51286
## metadata(1): Samples
## assays(1): counts
## rownames(36601): ENSG00000243485 ENSG00000237613 ... ENSG00000278817 ENSG00000277196
## rowData names(3): ID Symbol Type
## colnames(51286): 1_AAACCTGAGACTTTCG-1 1_AAACCTGGTCTTCAAG-1 ... 12_TTTGTCATCAGTTGAC-1
## 12_TTTGTCATCTCGTTTA-1
## colData names(2): Sample Barcode
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
We need to modify the cell barcodes so that they unique for each sample. We should also modify the names of the two PBMMC_1 samples, so that we can distinguish them from one another.
We will also switch the “Sample” column to be the sample name and add information from the sample sheet to the droplet annotation.
In the code below, replace the “XXXXXXXX” to rename the PBMMC_1 samples appropriately.
colData(sce) <- colData(sce) %>%
as.data.frame() %>%
rownames_to_column("RowName") %>%
mutate(SampleNum = str_extract(RowName, "^[0-9]+")) %>%
mutate(Barcode = str_replace(Barcode, "1$", SampleNum)) %>%
left_join(samplesheet, by=c(Sample="SampleId")) %>%
rename(SampleId=Sample) %>%
rename(Sample=SampleName) %>%
mutate(Sample = case_when(
SampleId == XXXXXXXX ~ str_c(Sample, "a"),
SampleId == XXXXXXXX ~ str_c(Sample, "b"),
TRUE ~ Sample)) %>%
column_to_rownames("RowName") %>%
select(Sample, Barcode, SampleId, SampleGroup, DatasetName) %>%
DataFrame()
SampleId
columns contains the “SRR” numbers, as does the SampleId in the sample sheet. Replace the XXXXXXXXX with the appropriate “SRR” number by checking for the PBMMC_1 samples in the samplesheet.
colData(sce) <- colData(sce) %>%
as.data.frame() %>%
rownames_to_column("RowName") %>%
mutate(SampleNum = str_extract(RowName, "^[0-9]+")) %>%
mutate(Barcode = str_replace(Barcode, "1$", SampleNum)) %>%
left_join(samplesheet, by=c(Sample="SampleId")) %>%
rename(SampleId=Sample) %>%
rename(Sample=SampleName) %>%
mutate(Sample = case_when(
SampleId == "SRR9264351" ~ str_c(Sample, "a"),
SampleId == "SRR9264352" ~ str_c(Sample, "b"),
TRUE ~ Sample)) %>%
column_to_rownames("RowName") %>%
select(Sample, Barcode, SampleId, SampleGroup, DatasetName) %>%
DataFrame()
Before moving on, check that the columns are as we expect them to be and that the PBMMC_1 samples have been renamed:
colData(sce)
## DataFrame with 51286 rows and 5 columns
## Sample Barcode SampleId SampleGroup DatasetName
## <character> <character> <character> <character> <character>
## 1_AAACCTGAGACTTTCG-1 ETV6-RUNX1_1 AAACCTGAGACTTTCG-1 SRR9264343 ETV6-RUNX1 Caron
## 1_AAACCTGGTCTTCAAG-1 ETV6-RUNX1_1 AAACCTGGTCTTCAAG-1 SRR9264343 ETV6-RUNX1 Caron
## 1_AAACCTGGTGCAACTT-1 ETV6-RUNX1_1 AAACCTGGTGCAACTT-1 SRR9264343 ETV6-RUNX1 Caron
## 1_AAACCTGGTGTTGAGG-1 ETV6-RUNX1_1 AAACCTGGTGTTGAGG-1 SRR9264343 ETV6-RUNX1 Caron
## 1_AAACCTGTCCCAAGTA-1 ETV6-RUNX1_1 AAACCTGTCCCAAGTA-1 SRR9264343 ETV6-RUNX1 Caron
## ... ... ... ... ... ...
## 12_TTTGTCAGTGTGAATA-1 PBMMC_3 TTTGTCAGTGTGAATA-12 SRR9264354 PBMMC Caron
## 12_TTTGTCAGTTCCACGG-1 PBMMC_3 TTTGTCAGTTCCACGG-12 SRR9264354 PBMMC Caron
## 12_TTTGTCAGTTCTCATT-1 PBMMC_3 TTTGTCAGTTCTCATT-12 SRR9264354 PBMMC Caron
## 12_TTTGTCATCAGTTGAC-1 PBMMC_3 TTTGTCATCAGTTGAC-12 SRR9264354 PBMMC Caron
## 12_TTTGTCATCTCGTTTA-1 PBMMC_3 TTTGTCATCTCGTTTA-12 SRR9264354 PBMMC Caron
unique(sce$Sample)
## [1] "ETV6-RUNX1_1" "ETV6-RUNX1_2" "ETV6-RUNX1_3" "ETV6-RUNX1_4" "HHD_1" "HHD_2"
## [7] "PRE-T_1" "PRE-T_2" "PBMMC_1a" "PBMMC_1b" "PBMMC_2" "PBMMC_3"
Remove undetected genes.
detected_genes <- rowSums(counts(sce)) > 0
sce <- sce[detected_genes,]
What proportion of genes have been detected
detected <- sum(detected_genes)/length(detected_genes)
detected
## [1] 0.7753067
Approximately 77.5 percent of genes were detected in at least one sample.
In order to assess the percentage of mitochondrial UMIs, we will need to be able to identify mitochondrial genes. The simplest way to do this is to annotate the genes with their chromosome of origin.
ah <- AnnotationHub()
ens.mm.98 <- query(ah, c("Homo sapiens", "EnsDb", 98))[[1]]
genes <- rowData(sce)$ID
gene_annot <- AnnotationDbi::select(ens.mm.98,
keys = genes,
keytype = "GENEID",
columns = c("GENEID", "SEQNAME")) %>%
set_names(c("ID", "Chromosome"))
rowData(sce) <- merge(rowData(sce), gene_annot, by = "ID", sort=FALSE)
rownames(rowData(sce)) <- rowData(sce)$ID
rowData(sce)
## DataFrame with 28377 rows and 4 columns
## ID Symbol Type Chromosome
## <character> <character> <character> <character>
## ENSG00000238009 ENSG00000238009 AL627309.1 Gene Expression 1
## ENSG00000239945 ENSG00000239945 AL627309.3 Gene Expression 1
## ENSG00000241860 ENSG00000241860 AL627309.5 Gene Expression 1
## ENSG00000241599 ENSG00000241599 AL627309.4 Gene Expression 1
## ENSG00000286448 ENSG00000286448 AP006222.2 Gene Expression 1
## ... ... ... ... ...
## ENSG00000274792 ENSG00000274792 AC171558.1 Gene Expression KI270727.1
## ENSG00000275869 ENSG00000275869 AC136612.1 Gene Expression KI270728.1
## ENSG00000277836 ENSG00000277836 AC141272.1 Gene Expression KI270728.1
## ENSG00000278633 ENSG00000278633 AC023491.2 Gene Expression KI270731.1
## ENSG00000278817 ENSG00000278817 AC007325.4 Gene Expression KI270734.1
Now add the per cell QC metrics to the droplet annotation using the function addPerCellQC
.
is.mito <- which(rowData(sce)$Chromosome=="MT")
sce <- addPerCellQC(sce, subsets=list(Mito=is.mito), BPPARAM = bp.params)
Before moving on to do the actual cell filtering, it is always a good idea to explore the distribution of the metrics across the droplets.
Use the scater
function plotColData
to generate plots showing the distributions of the total number of UMIs, the number of genes detected and percentage of UMIs aligned to mitochondrial genes across all cells for each sample.
plotColData(sce, x="Sample", y="sum",other_fields="SampleGroup") +
facet_wrap(~SampleGroup, nrow=1, scales = "free_x") +
scale_y_log10() +
ggtitle("Total count")
plotColData(sce, x="Sample", y="detected", other_fields="SampleGroup") +
facet_wrap(~SampleGroup, nrow=1, scales = "free_x") +
scale_y_log10() +
ggtitle("Detected features")
plotColData(sce, x="Sample", y="subsets_Mito_percent", other_fields="SampleGroup") +
facet_wrap(~SampleGroup, nrow=1, scales = "free_x") +
ggtitle("Mito percent")
Use the scater function quickPerCellQC
to assess cell quality based on the three metrics. Name the object generated cell_qc_results
.
When running the command, consider the distribution plots above and decide whether to use the batch
option and if so, at what level it should be applied.
How many cells will be removed from the data set?
batch
), we could apply it by sample group (ETV6-RUNX1, HHD, PBMMC, PRE-T), or we could apply it per sample.
It is clear that there are significantly different distribution profiles between samples in the same group. Noticeably the distribution profiles for the two PBMMC_1 samples are quite different - it would appear that PBMMC_1b has a lower overal UMI count indicating less sequencing. For this reason it may be prudent to apply the adaptive filtering to each sample independently.
cell_qc_results <- quickPerCellQC(colData(sce),
percent_subsets=c("subsets_Mito_percent"),
batch=sce$Sample)
as.data.frame(cell_qc_results) %>% summarise(across(everything(), sum))
In total 0 cells will be removed from the dataset.
Let’s replace the columns in the droplet annotation with these new filters.
sce$low_lib_size <- cell_qc_results$low_lib_size
sce$low_n_features <- cell_qc_results$low_n_features
sce$high_Mito_percent <- cell_qc_results$high_subsets_Mito_percent
sce$discard <- cell_qc_results$discard
We can visualise how the new filters look using violin plots.
plotColData(sce,
x="Sample",
y="sum",
other_fields="SampleGroup",
colour_by = "low_lib_size") +
facet_wrap(vars(SampleGroup), nrow=1, scales = "free_x") +
scale_y_log10() +
labs(y = "Total count", title = "Total count") +
guides(colour=guide_legend(title="Discarded"))
plotColData(sce,
x="Sample",
y="detected",
other_fields="SampleGroup",
colour_by = "low_n_features") +
facet_wrap(vars(SampleGroup), nrow=1, scales = "free_x") +
scale_y_log10() +
labs(y = "Genes detected", title = "Genes detected") +
guides(colour=guide_legend(title="Discarded"))
plotColData(sce,
x="Sample",
y="subsets_Mito_percent",
other_fields="SampleGroup",
colour_by = "high_Mito_percent") +
facet_wrap(vars(SampleGroup), nrow=1, scales = "free_x") +
labs(y = "Percentage mitochondrial UMIs",
title = "Mitochondrial UMIs") +
guides(colour=guide_legend(title="Discarded"))
Filter out the poor quality cells, recalculate the cell QC metrics and save the filtered object.
sce <- sce[, !sce$discard]
colData(sce) <- colData(sce)[,1:3]
sce <- addPerCellQC(sce, BPPARAM = bp.params)
saveRDS(sce, "../results/Caron_filtered_genes.rds")
So far we have only eliminated genes that are undetected across all samples. In reality there will be a large number of genes that are only detected at low levels in a small number of cells - much of this will just be due to technical noise. We could further filter the genes using “sparsity”.
Look at section 8 of the documention. Filter the dataset to remove genes that have only been detected in fewer than 20 cells.
How many additional genes are filtered out?
sce <- addPerFeatureQC(sce, BPPARAM = bp.params)
rowData(sce)$gene_sparsity <- (100 - rowData(sce)$detected) / 100
max.sparsity <- 1 - (20 / ncol(sce))
sparse.genes <- rowData(sce)$gene_sparsity > max.sparsity
sce <- sce[!sparse.genes, ]
sum(sparse.genes)
## [1] 8410
An additional 8410 have been filtered out.
sessionInfo()
## R version 4.1.2 (2021-11-01)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.6 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
##
## locale:
## [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C LC_TIME=en_GB.UTF-8
## [4] LC_COLLATE=en_GB.UTF-8 LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
## [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C LC_ADDRESS=C
## [10] LC_TELEPHONE=C LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] grid parallel stats4 stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] rmarkdown_2.10 ggvenn_0.1.9 patchwork_1.1.1
## [4] forcats_0.5.1 stringr_1.4.0 dplyr_1.0.7
## [7] purrr_0.3.4 readr_2.0.1 tidyr_1.1.3
## [10] tibble_3.1.4 tidyverse_1.3.1 BiocParallel_1.26.2
## [13] AnnotationHub_3.0.1 BiocFileCache_2.0.0 dbplyr_2.1.1
## [16] ensembldb_2.16.4 AnnotationFilter_1.16.0 GenomicFeatures_1.44.2
## [19] AnnotationDbi_1.54.1 scater_1.20.1 ggplot2_3.3.5
## [22] scuttle_1.2.1 DropletUtils_1.12.2 SingleCellExperiment_1.14.1
## [25] SummarizedExperiment_1.22.0 Biobase_2.52.0 GenomicRanges_1.44.0
## [28] GenomeInfoDb_1.28.2 IRanges_2.26.0 S4Vectors_0.30.0
## [31] BiocGenerics_0.38.0 MatrixGenerics_1.4.3 matrixStats_0.60.1
## [34] nvimcom_0.9-82
##
## loaded via a namespace (and not attached):
## [1] readxl_1.3.1 backports_1.2.1 lazyeval_0.2.2
## [4] digest_0.6.27 htmltools_0.5.2 viridis_0.6.1
## [7] fansi_0.5.0 magrittr_2.0.1 memoise_2.0.0
## [10] ScaledMatrix_1.0.0 tzdb_0.1.2 limma_3.48.3
## [13] Biostrings_2.60.2 modelr_0.1.8 vroom_1.5.4
## [16] R.utils_2.10.1 prettyunits_1.1.1 colorspace_2.0-2
## [19] rvest_1.0.1 blob_1.2.2 rappdirs_0.3.3
## [22] haven_2.4.3 xfun_0.25 crayon_1.4.1
## [25] RCurl_1.98-1.4 jsonlite_1.7.2 glue_1.4.2
## [28] gtable_0.3.0 zlibbioc_1.38.0 XVector_0.32.0
## [31] DelayedArray_0.18.0 BiocSingular_1.8.1 Rhdf5lib_1.14.2
## [34] HDF5Array_1.20.0 scales_1.1.1 DBI_1.1.1
## [37] edgeR_3.34.0 Rcpp_1.0.7 viridisLite_0.4.0
## [40] xtable_1.8-4 progress_1.2.2 dqrng_0.3.0
## [43] bit_4.0.4 rsvd_1.0.5 httr_1.4.2
## [46] ellipsis_0.3.2 farver_2.1.0 pkgconfig_2.0.3
## [49] XML_3.99-0.7 R.methodsS3_1.8.1 sass_0.4.0
## [52] locfit_1.5-9.4 utf8_1.2.2 labeling_0.4.2
## [55] tidyselect_1.1.1 rlang_0.4.11 later_1.3.0
## [58] cellranger_1.1.0 munsell_0.5.0 BiocVersion_3.13.1
## [61] tools_4.1.2 cachem_1.0.6 cli_3.0.1
## [64] generics_0.1.0 RSQLite_2.2.8 broom_0.7.9
## [67] evaluate_0.14 fastmap_1.1.0 yaml_2.2.1
## [70] fs_1.5.0 knitr_1.33 bit64_4.0.5
## [73] KEGGREST_1.32.0 sparseMatrixStats_1.4.2 mime_0.11
## [76] R.oo_1.24.0 xml2_1.3.2 biomaRt_2.48.3
## [79] compiler_4.1.2 rstudioapi_0.13 beeswarm_0.4.0
## [82] filelock_1.0.2 curl_4.3.2 png_0.1-7
## [85] interactiveDisplayBase_1.30.0 reprex_2.0.1 bslib_0.2.5.1
## [88] stringi_1.7.4 highr_0.9 lattice_0.20-45
## [91] ProtGenerics_1.24.0 Matrix_1.3-4 vctrs_0.3.8
## [94] pillar_1.6.2 lifecycle_1.0.0 rhdf5filters_1.4.0
## [97] BiocManager_1.30.16 jquerylib_0.1.4 BiocNeighbors_1.10.0
## [100] cowplot_1.1.1 bitops_1.0-7 irlba_2.3.3
## [103] httpuv_1.6.2 rtracklayer_1.52.1 R6_2.5.1
## [106] BiocIO_1.2.0 promises_1.2.0.1 gridExtra_2.3
## [109] vipor_0.4.5 codetools_0.2-18 assertthat_0.2.1
## [112] rhdf5_2.36.0 rjson_0.2.20 withr_2.4.2
## [115] GenomicAlignments_1.28.0 Rsamtools_2.8.0 GenomeInfoDbData_1.2.6
## [118] hms_1.1.0 beachmat_2.8.1 DelayedMatrixStats_1.14.3
## [121] Cairo_1.5-12.2 shiny_1.6.0 lubridate_1.7.10
## [124] ggbeeswarm_0.6.0 restfulr_0.0.13