In the course materials we performed QC and filtering of 2 samples from each of the sample groups. For this challenge we would like you to perform QC and filtering on all of the samples from the Caron data set.
library(DropletUtils)
library(scater)
library(ensembldb)
library(AnnotationHub)
library(BiocParallel)
library(tidyverse)
We will load both the Caron and HCA data sets. We have already prepared a sample meta data table that relates the sample/run ID to the sample group.
samplesheet <- read_tsv("Data/sample_sheet.tsv")
NOTE: There are two PBMMC_1 samples. These are two libraries from the same sample material. Later on, we will combine these to form a single sample, but for now we will need to rename them so that they are QC’d separately.
We first need to first set up some parallel parameters using the
package BiocParallel
.
bp.params <- MulticoreParam(workers = 7)
In order to load the CellRanger data for all of the samples, you will
first need to create a named vector of the paths to the filtered count
matrix folders called list_of_files
and then use this in
the read10xCounts
command.
The paths to filtered_feature_bc_matrix
directories for
each sample can be constructed using the Sample column
as:
Data/CellRanger_Outputs/Sample/outs/filtered_feature_bc_matrix
You will need to use a function such as str_c
or
paste
samples_list <- samplesheet$Sample
list_of_files <- str_c("Data/CellRanger_Outputs/",
samples_list,
"/outs/filtered_feature_bc_matrix")
names(list_of_files) <- samples_list
sce <- read10xCounts(list_of_files, col.names=TRUE, BPPARAM = bp.params)
A good sanity check is to look at the colData to ensure that we have all the samples we are expecting and check the overall size of the new object.
colData(sce) %>%
as.data.frame() %>%
select(Sample) %>%
distinct()
## Sample
## 1_AAACCTGAGACTTTCG-1 SRR9264343
## 2_AAACCTGAGATAGTCA-1 SRR9264344
## 3_AAACCTGAGGTGATTA-1 SRR9264345
## 4_AAACCTGAGAGTGACC-1 SRR9264346
## 5_AAACCTGAGCCAGAAC-1 SRR9264347
## 6_AAACCTGAGCTAGCCC-1 SRR9264348
## 7_AAACCTGAGAGCTGCA-1 SRR9264349
## 8_AAACCTGAGAACAACT-1 SRR9264350
## 9_AAACCTGCACTTCGAA-1 SRR9264351
## 10_AAACCTGAGAAACGAG-1 SRR9264353
## 11_AAACCTGAGGACAGCT-1 SRR9264354
sce
## class: SingleCellExperiment
## dim: 37764 53450
## metadata(1): Samples
## assays(1): counts
## rownames(37764): ENSG00000243485 ENSG00000237613 ... ENSG00000278817
## ENSG00000277196
## rowData names(3): ID Symbol Type
## colnames(53450): 1_AAACCTGAGACTTTCG-1 1_AAACCTGGTCTTCAAG-1 ...
## 11_TTTGTCATCAGTTGAC-1 11_TTTGTCATCTCGTTTA-1
## colData names(2): Sample Barcode
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
We need to modify the cell barcodes so that they unique for each sample.
We will also add information from the sample sheet to the droplet annotation.
sce$Barcode <- rownames(colData(sce))
colData(sce) <- merge(colData(sce), samplesheet, by="Sample", sort=FALSE)
rownames(colData(sce)) <- sce$Barcode
Remove undetected genes.
detected_genes <- rowSums(counts(sce)) > 0
sce <- sce[detected_genes,]
What proportion of genes have been detected
mean(detected_genes)
## [1] 0.877979
Approximately 87.8 percent of genes were detected in at least one
sample.
In order to assess the percentage of mitochondrial UMIs, we will need to be able to identify mitochondrial genes. The simplest way to do this is to annotate the genes with their chromosome of origin.
ah <- AnnotationHub()
ens.hs.107<- query(ah, c("Homo sapiens", "EnsDb", 107))[[1]]
genes <- rowData(sce)$ID
gene_annot <- AnnotationDbi::select(ens.hs.107,
keys = genes,
keytype = "GENEID",
columns = c("GENEID", "SEQNAME")) %>%
set_names(c("ID", "Chromosome"))
rowData(sce) <- merge(rowData(sce), gene_annot, by = "ID", sort=FALSE)
rownames(rowData(sce)) <- rowData(sce)$ID
rowData(sce)
## DataFrame with 33156 rows and 4 columns
## ID Symbol Type Chromosome
## <character> <character> <character> <character>
## ENSG00000243485 ENSG00000243485 MIR1302-2HG Gene Expression 1
## ENSG00000238009 ENSG00000238009 ENSG00000238009 Gene Expression 1
## ENSG00000239945 ENSG00000239945 ENSG00000239945 Gene Expression 1
## ENSG00000241860 ENSG00000241860 ENSG00000241860 Gene Expression 1
## ENSG00000241599 ENSG00000241599 ENSG00000241599 Gene Expression 1
## ... ... ... ... ...
## ENSG00000277856 ENSG00000277856 ENSG00000277856 Gene Expression KI270726.1
## ENSG00000275063 ENSG00000275063 ENSG00000275063 Gene Expression KI270726.1
## ENSG00000275869 ENSG00000275869 ENSG00000275869 Gene Expression KI270728.1
## ENSG00000276017 ENSG00000276017 ENSG00000276017 Gene Expression KI270734.1
## ENSG00000278817 ENSG00000278817 ENSG00000278817 Gene Expression KI270734.1
Now add the per cell QC metrics to the droplet annotation using the
function addPerCellQC
.
is.mito <- which(rowData(sce)$Chromosome=="MT")
sce <- addPerCellQC(sce, subsets=list(Mito=is.mito), BPPARAM = bp.params)
Before moving on to do the actual cell filtering, it is always a good idea to explore the distribution of the metrics across the droplets.
Use the scater
function plotColData
to
generate plots showing the distributions of the total number of UMIs,
the number of genes detected and percentage of UMIs aligned to
mitochondrial genes across all cells for each sample.
This time we will use facet_wrap
to split the plot
according to the Sample Group. We will need to include the sample group
in the plot data using the other_fields
argument in
plotColData
(see the help page for details -
?plotColData
) in order that we can use it in the
facet_wrap
command.
The code for plotting the total number of UMIs is shown below. You will also need to plot the the number of genes detected and percentage of UMIs aligned to mitochondrial.
plotColData(sce, x="SampleName", y="sum", other_fields="SampleGroup") +
facet_wrap(~SampleGroup, nrow=1, scales = "free_x") +
scale_y_log10() +
ggtitle("Total count")
plotColData(sce, x="SampleName", y="detected", other_fields="SampleGroup") +
facet_wrap(~SampleGroup, nrow=1, scales = "free_x") +
scale_y_log10() +
ggtitle("Detected features")
plotColData(sce,
x="SampleName",
y="subsets_Mito_percent",
other_fields="SampleGroup") +
facet_wrap(~SampleGroup, nrow=1, scales = "free_x") +
ggtitle("Mito percent")
Use the scater function quickPerCellQC
to assess cell
quality based on the three metrics. Name the object generated
cell_qc_results
.
When running the command, consider the distribution plots above and
decide whether to use the batch
option and if so, at what
level it should be applied.
How many cells will be removed from the data set?
batch
), we could apply it by
sample group (ETV6-RUNX1, HHD, PBMMC, PRE-T), or we could apply it per
sample.
It is clear that there are significantly different distribution profiles between samples in the same group. Noticeably the distribution profiles for the two HDD samples are quite different. For this reason it may be prudent to apply the adaptive filtering to each sample independently.
cell_qc_filters <- quickPerCellQC(colData(sce),
percent_subsets=c("subsets_Mito_percent"),
batch=sce$Sample)
as.data.frame(cell_qc_filters) %>% summarise(across(everything(), sum))
## low_lib_size low_n_features high_subsets_Mito_percent discard
## 1 608 1619 3821 5128
In total 5128 cells will be removed from the dataset.
Let’s add the columns in the droplet annotation with these new filters.
colData(sce) <- cbind(colData(sce), cell_qc_filters)
We can visualise how the new filters look using violin plots.
plotColData(sce,
x="SampleName",
y="sum",
other_fields="SampleGroup",
colour_by = "low_lib_size") +
facet_wrap(vars(SampleGroup), nrow=1, scales = "free_x") +
scale_y_log10() +
labs(y = "Total count", title = "Total count") +
guides(colour=guide_legend(title="Discarded"))
plotColData(sce,
x="SampleName",
y="detected",
other_fields="SampleGroup",
colour_by = "low_n_features") +
facet_wrap(vars(SampleGroup), nrow=1, scales = "free_x") +
scale_y_log10() +
labs(y = "Genes detected", title = "Genes detected") +
guides(colour=guide_legend(title="Discarded"))
plotColData(sce,
x="SampleName",
y="subsets_Mito_percent",
other_fields="SampleGroup",
colour_by = "high_subsets_Mito_percent") +
facet_wrap(vars(SampleGroup), nrow=1, scales = "free_x") +
labs(y = "Percentage mitochondrial UMIs",
title = "Mitochondrial UMIs") +
guides(colour=guide_legend(title="Discarded"))
Filter out the poor quality cells.
sce <- sce[, !sce$discard]
sessionInfo()
## R version 4.2.0 (2022-04-22)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
##
## locale:
## [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
## [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
## [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] forcats_0.5.2 stringr_1.4.1
## [3] dplyr_1.0.10 purrr_0.3.4
## [5] readr_2.1.2 tidyr_1.2.0
## [7] tibble_3.1.8 tidyverse_1.3.2
## [9] BiocParallel_1.30.3 AnnotationHub_3.4.0
## [11] BiocFileCache_2.4.0 dbplyr_2.2.1
## [13] ensembldb_2.20.2 AnnotationFilter_1.20.0
## [15] GenomicFeatures_1.48.3 AnnotationDbi_1.58.0
## [17] scater_1.24.0 ggplot2_3.3.6
## [19] scuttle_1.6.3 DropletUtils_1.16.0
## [21] SingleCellExperiment_1.18.0 SummarizedExperiment_1.26.1
## [23] Biobase_2.56.0 GenomicRanges_1.48.0
## [25] GenomeInfoDb_1.32.3 IRanges_2.30.1
## [27] S4Vectors_0.34.0 BiocGenerics_0.42.0
## [29] MatrixGenerics_1.8.1 matrixStats_0.62.0
##
## loaded via a namespace (and not attached):
## [1] readxl_1.4.1 backports_1.4.1
## [3] lazyeval_0.2.2 digest_0.6.29
## [5] htmltools_0.5.3 viridis_0.6.2
## [7] fansi_1.0.3 magrittr_2.0.3
## [9] memoise_2.0.1 ScaledMatrix_1.4.0
## [11] googlesheets4_1.0.1 tzdb_0.3.0
## [13] limma_3.52.2 Biostrings_2.64.1
## [15] modelr_0.1.9 vroom_1.5.7
## [17] R.utils_2.12.0 prettyunits_1.1.1
## [19] colorspace_2.0-3 rvest_1.0.3
## [21] blob_1.2.3 rappdirs_0.3.3
## [23] ggrepel_0.9.1 haven_2.5.1
## [25] xfun_0.32 crayon_1.5.1
## [27] RCurl_1.98-1.8 jsonlite_1.8.0
## [29] glue_1.6.2 gargle_1.2.0
## [31] gtable_0.3.1 zlibbioc_1.42.0
## [33] XVector_0.36.0 DelayedArray_0.22.0
## [35] BiocSingular_1.12.0 Rhdf5lib_1.18.2
## [37] HDF5Array_1.24.2 scales_1.2.1
## [39] DBI_1.1.3 edgeR_3.38.4
## [41] Rcpp_1.0.9 viridisLite_0.4.1
## [43] xtable_1.8-4 progress_1.2.2
## [45] dqrng_0.3.0 bit_4.0.4
## [47] rsvd_1.0.5 httr_1.4.4
## [49] ellipsis_0.3.2 farver_2.1.1
## [51] pkgconfig_2.0.3 XML_3.99-0.10
## [53] R.methodsS3_1.8.2 sass_0.4.2
## [55] locfit_1.5-9.6 utf8_1.2.2
## [57] labeling_0.4.2 tidyselect_1.1.2
## [59] rlang_1.0.5 later_1.3.0
## [61] cellranger_1.1.0 munsell_0.5.0
## [63] BiocVersion_3.15.2 tools_4.2.0
## [65] cachem_1.0.6 cli_3.3.0
## [67] generics_0.1.3 RSQLite_2.2.16
## [69] broom_1.0.1 evaluate_0.16
## [71] fastmap_1.1.0 yaml_2.3.5
## [73] fs_1.5.2 knitr_1.40
## [75] bit64_4.0.5 KEGGREST_1.36.3
## [77] sparseMatrixStats_1.8.0 mime_0.12
## [79] R.oo_1.25.0 xml2_1.3.3
## [81] biomaRt_2.52.0 compiler_4.2.0
## [83] rstudioapi_0.14 beeswarm_0.4.0
## [85] filelock_1.0.2 curl_4.3.2
## [87] png_0.1-7 interactiveDisplayBase_1.34.0
## [89] reprex_2.0.2 bslib_0.4.0
## [91] stringi_1.7.8 highr_0.9
## [93] lattice_0.20-45 ProtGenerics_1.28.0
## [95] Matrix_1.4-1 vctrs_0.4.1
## [97] pillar_1.8.1 lifecycle_1.0.1
## [99] rhdf5filters_1.8.0 BiocManager_1.30.18
## [101] jquerylib_0.1.4 BiocNeighbors_1.14.0
## [103] cowplot_1.1.1 bitops_1.0-7
## [105] irlba_2.3.5 httpuv_1.6.5
## [107] rtracklayer_1.56.1 R6_2.5.1
## [109] BiocIO_1.6.0 promises_1.2.0.1
## [111] gridExtra_2.3 vipor_0.4.5
## [113] codetools_0.2-18 assertthat_0.2.1
## [115] rhdf5_2.40.0 rjson_0.2.21
## [117] withr_2.5.0 GenomicAlignments_1.32.1
## [119] Rsamtools_2.12.0 GenomeInfoDbData_1.2.8
## [121] parallel_4.2.0 hms_1.1.2
## [123] grid_4.2.0 beachmat_2.12.0
## [125] rmarkdown_2.16 DelayedMatrixStats_1.18.0
## [127] googledrive_2.0.0 lubridate_1.8.0
## [129] shiny_1.7.2 ggbeeswarm_0.6.0
## [131] restfulr_0.0.15