1 Introduction

In the course materials we performed QC and filtering of 2 samples from each of the sample groups. For this challenge we would like you to perform QC and filtering on all of the samples from the Caron data set.

2 Load packages

library(DropletUtils)
library(scater)
library(ensembldb)
library(AnnotationHub)
library(BiocParallel)
library(tidyverse)

3 Sample meta data

We will load both the Caron and HCA data sets. We have already prepared a sample meta data table that relates the sample/run ID to the sample group.

samplesheet <- read_tsv("Data/sample_sheet.tsv")

NOTE: There are two PBMMC_1 samples. These are two libraries from the same sample material. Later on, we will combine these to form a single sample, but for now we will need to rename them so that they are QC’d separately.

4 Read CellRanger outputs into R

4.1 Parallelisation

We first need to first set up some parallel parameters using the package BiocParallel.

bp.params <- MulticoreParam(workers = 7)

4.2 Load the data

In order to load the CellRanger data for all of the samples, you will first need to create a named vector of the paths to the filtered count matrix folders called list_of_files and then use this in the read10xCounts command.

Hint

The paths to filtered_feature_bc_matrix directories for each sample can be constructed using the Sample column as:

Data/CellRanger_Outputs/Sample/outs/filtered_feature_bc_matrix

You will need to use a function such as str_c or paste

The names of the vector will determine the sample name used in the counts matrix, this should be the ‘SRR’ number as in the Sample column of the sample sheet.
Answer
samples_list <- samplesheet$Sample
list_of_files <- str_c("Data/CellRanger_Outputs/", 
                       samples_list, 
                       "/outs/filtered_feature_bc_matrix")
names(list_of_files) <- samples_list

    
sce <- read10xCounts(list_of_files, col.names=TRUE, BPPARAM = bp.params)

4.3 Check samples in the data set

A good sanity check is to look at the colData to ensure that we have all the samples we are expecting and check the overall size of the new object.

colData(sce) %>%
    as.data.frame() %>% 
    select(Sample) %>% 
    distinct()
##                           Sample
## 1_AAACCTGAGACTTTCG-1  SRR9264343
## 2_AAACCTGAGATAGTCA-1  SRR9264344
## 3_AAACCTGAGGTGATTA-1  SRR9264345
## 4_AAACCTGAGAGTGACC-1  SRR9264346
## 5_AAACCTGAGCCAGAAC-1  SRR9264347
## 6_AAACCTGAGCTAGCCC-1  SRR9264348
## 7_AAACCTGAGAGCTGCA-1  SRR9264349
## 8_AAACCTGAGAACAACT-1  SRR9264350
## 9_AAACCTGCACTTCGAA-1  SRR9264351
## 10_AAACCTGAGAAACGAG-1 SRR9264353
## 11_AAACCTGAGGACAGCT-1 SRR9264354
sce
## class: SingleCellExperiment 
## dim: 37764 53450 
## metadata(1): Samples
## assays(1): counts
## rownames(37764): ENSG00000243485 ENSG00000237613 ... ENSG00000278817
##   ENSG00000277196
## rowData names(3): ID Symbol Type
## colnames(53450): 1_AAACCTGAGACTTTCG-1 1_AAACCTGGTCTTCAAG-1 ...
##   11_TTTGTCATCAGTTGAC-1 11_TTTGTCATCTCGTTTA-1
## colData names(2): Sample Barcode
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):

4.4 Modify the droplet annotation

We need to modify the cell barcodes so that they unique for each sample.

We will also add information from the sample sheet to the droplet annotation.

sce$Barcode <- rownames(colData(sce))
colData(sce) <- merge(colData(sce), samplesheet, by="Sample", sort=FALSE)
rownames(colData(sce)) <- sce$Barcode

5 Undetected genes

Remove undetected genes.

detected_genes <- rowSums(counts(sce)) > 0
sce <- sce[detected_genes,]

What proportion of genes have been detected

Answer
mean(detected_genes)
## [1] 0.877979
Approximately 87.8 percent of genes were detected in at least one sample.

6 Annotate genes

In order to assess the percentage of mitochondrial UMIs, we will need to be able to identify mitochondrial genes. The simplest way to do this is to annotate the genes with their chromosome of origin.

ah <- AnnotationHub()
ens.hs.107<- query(ah, c("Homo sapiens", "EnsDb", 107))[[1]] 

genes <- rowData(sce)$ID
gene_annot <- AnnotationDbi::select(ens.hs.107, 
                                    keys = genes,
                                    keytype = "GENEID",
                                    columns = c("GENEID", "SEQNAME")) %>%
    set_names(c("ID", "Chromosome"))
rowData(sce) <- merge(rowData(sce), gene_annot, by = "ID", sort=FALSE)
rownames(rowData(sce)) <- rowData(sce)$ID

rowData(sce)
## DataFrame with 33156 rows and 4 columns
##                              ID          Symbol            Type  Chromosome
##                     <character>     <character>     <character> <character>
## ENSG00000243485 ENSG00000243485     MIR1302-2HG Gene Expression           1
## ENSG00000238009 ENSG00000238009 ENSG00000238009 Gene Expression           1
## ENSG00000239945 ENSG00000239945 ENSG00000239945 Gene Expression           1
## ENSG00000241860 ENSG00000241860 ENSG00000241860 Gene Expression           1
## ENSG00000241599 ENSG00000241599 ENSG00000241599 Gene Expression           1
## ...                         ...             ...             ...         ...
## ENSG00000277856 ENSG00000277856 ENSG00000277856 Gene Expression  KI270726.1
## ENSG00000275063 ENSG00000275063 ENSG00000275063 Gene Expression  KI270726.1
## ENSG00000275869 ENSG00000275869 ENSG00000275869 Gene Expression  KI270728.1
## ENSG00000276017 ENSG00000276017 ENSG00000276017 Gene Expression  KI270734.1
## ENSG00000278817 ENSG00000278817 ENSG00000278817 Gene Expression  KI270734.1

7 Add per cell QC metrics

Now add the per cell QC metrics to the droplet annotation using the function addPerCellQC.

is.mito <- which(rowData(sce)$Chromosome=="MT")

sce <- addPerCellQC(sce, subsets=list(Mito=is.mito), BPPARAM = bp.params)

8 Explore QC metric distribution

Before moving on to do the actual cell filtering, it is always a good idea to explore the distribution of the metrics across the droplets.

Use the scater function plotColData to generate plots showing the distributions of the total number of UMIs, the number of genes detected and percentage of UMIs aligned to mitochondrial genes across all cells for each sample.

This time we will use facet_wrap to split the plot according to the Sample Group. We will need to include the sample group in the plot data using the other_fields argument in plotColData (see the help page for details - ?plotColData) in order that we can use it in the facet_wrap command.

The code for plotting the total number of UMIs is shown below. You will also need to plot the the number of genes detected and percentage of UMIs aligned to mitochondrial.

plotColData(sce, x="SampleName", y="sum", other_fields="SampleGroup") + 
    facet_wrap(~SampleGroup, nrow=1, scales = "free_x") + 
    scale_y_log10() + 
    ggtitle("Total count")
Answer
plotColData(sce, x="SampleName", y="detected", other_fields="SampleGroup") + 
    facet_wrap(~SampleGroup, nrow=1, scales = "free_x") + 
    scale_y_log10() + 
    ggtitle("Detected features")
plotColData(sce, 
            x="SampleName", 
            y="subsets_Mito_percent", 
            other_fields="SampleGroup") + 
    facet_wrap(~SampleGroup, nrow=1, scales = "free_x") +
    ggtitle("Mito percent")

9 Identification of low-quality cells with adaptive thresholds

Use the scater function quickPerCellQC to assess cell quality based on the three metrics. Name the object generated cell_qc_results.

When running the command, consider the distribution plots above and decide whether to use the batch option and if so, at what level it should be applied.

How many cells will be removed from the data set?

Hint With these samples we have three possible “batch” levels at which we could run adaptive filtering. We could apply the filtering across all samples together (i.e. no batch), we could apply it by sample group (ETV6-RUNX1, HHD, PBMMC, PRE-T), or we could apply it per sample.
Answer

It is clear that there are significantly different distribution profiles between samples in the same group. Noticeably the distribution profiles for the two HDD samples are quite different. For this reason it may be prudent to apply the adaptive filtering to each sample independently.

cell_qc_filters <- quickPerCellQC(colData(sce),
                                percent_subsets=c("subsets_Mito_percent"),
                                batch=sce$Sample)
as.data.frame(cell_qc_filters) %>% summarise(across(everything(), sum))
##   low_lib_size low_n_features high_subsets_Mito_percent discard
## 1          608           1619                      3821    5128

In total 5128 cells will be removed from the dataset.

Let’s add the columns in the droplet annotation with these new filters.

colData(sce) <- cbind(colData(sce), cell_qc_filters)

We can visualise how the new filters look using violin plots.

plotColData(sce, 
            x="SampleName", 
            y="sum",
            other_fields="SampleGroup", 
            colour_by = "low_lib_size") + 
    facet_wrap(vars(SampleGroup), nrow=1, scales = "free_x") + 
    scale_y_log10() + 
    labs(y = "Total count", title = "Total count") +
    guides(colour=guide_legend(title="Discarded"))

plotColData(sce, 
            x="SampleName", 
            y="detected",
            other_fields="SampleGroup", 
            colour_by = "low_n_features") + 
    facet_wrap(vars(SampleGroup), nrow=1, scales = "free_x") + 
    scale_y_log10() + 
    labs(y = "Genes detected", title = "Genes detected") +
    guides(colour=guide_legend(title="Discarded"))

plotColData(sce, 
        x="SampleName", 
        y="subsets_Mito_percent",
        other_fields="SampleGroup", 
        colour_by = "high_subsets_Mito_percent") + 
    facet_wrap(vars(SampleGroup), nrow=1, scales = "free_x") + 
    labs(y = "Percentage mitochondrial UMIs",
         title = "Mitochondrial UMIs") +
    guides(colour=guide_legend(title="Discarded"))

10 Filtering out poor quality cells

Filter out the poor quality cells.

sce <- sce[, !sce$discard]

11 Session information

sessionInfo()
## R version 4.2.0 (2022-04-22)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
##  [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
##  [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] forcats_0.5.2               stringr_1.4.1              
##  [3] dplyr_1.0.10                purrr_0.3.4                
##  [5] readr_2.1.2                 tidyr_1.2.0                
##  [7] tibble_3.1.8                tidyverse_1.3.2            
##  [9] BiocParallel_1.30.3         AnnotationHub_3.4.0        
## [11] BiocFileCache_2.4.0         dbplyr_2.2.1               
## [13] ensembldb_2.20.2            AnnotationFilter_1.20.0    
## [15] GenomicFeatures_1.48.3      AnnotationDbi_1.58.0       
## [17] scater_1.24.0               ggplot2_3.3.6              
## [19] scuttle_1.6.3               DropletUtils_1.16.0        
## [21] SingleCellExperiment_1.18.0 SummarizedExperiment_1.26.1
## [23] Biobase_2.56.0              GenomicRanges_1.48.0       
## [25] GenomeInfoDb_1.32.3         IRanges_2.30.1             
## [27] S4Vectors_0.34.0            BiocGenerics_0.42.0        
## [29] MatrixGenerics_1.8.1        matrixStats_0.62.0         
## 
## loaded via a namespace (and not attached):
##   [1] readxl_1.4.1                  backports_1.4.1              
##   [3] lazyeval_0.2.2                digest_0.6.29                
##   [5] htmltools_0.5.3               viridis_0.6.2                
##   [7] fansi_1.0.3                   magrittr_2.0.3               
##   [9] memoise_2.0.1                 ScaledMatrix_1.4.0           
##  [11] googlesheets4_1.0.1           tzdb_0.3.0                   
##  [13] limma_3.52.2                  Biostrings_2.64.1            
##  [15] modelr_0.1.9                  vroom_1.5.7                  
##  [17] R.utils_2.12.0                prettyunits_1.1.1            
##  [19] colorspace_2.0-3              rvest_1.0.3                  
##  [21] blob_1.2.3                    rappdirs_0.3.3               
##  [23] ggrepel_0.9.1                 haven_2.5.1                  
##  [25] xfun_0.32                     crayon_1.5.1                 
##  [27] RCurl_1.98-1.8                jsonlite_1.8.0               
##  [29] glue_1.6.2                    gargle_1.2.0                 
##  [31] gtable_0.3.1                  zlibbioc_1.42.0              
##  [33] XVector_0.36.0                DelayedArray_0.22.0          
##  [35] BiocSingular_1.12.0           Rhdf5lib_1.18.2              
##  [37] HDF5Array_1.24.2              scales_1.2.1                 
##  [39] DBI_1.1.3                     edgeR_3.38.4                 
##  [41] Rcpp_1.0.9                    viridisLite_0.4.1            
##  [43] xtable_1.8-4                  progress_1.2.2               
##  [45] dqrng_0.3.0                   bit_4.0.4                    
##  [47] rsvd_1.0.5                    httr_1.4.4                   
##  [49] ellipsis_0.3.2                farver_2.1.1                 
##  [51] pkgconfig_2.0.3               XML_3.99-0.10                
##  [53] R.methodsS3_1.8.2             sass_0.4.2                   
##  [55] locfit_1.5-9.6                utf8_1.2.2                   
##  [57] labeling_0.4.2                tidyselect_1.1.2             
##  [59] rlang_1.0.5                   later_1.3.0                  
##  [61] cellranger_1.1.0              munsell_0.5.0                
##  [63] BiocVersion_3.15.2            tools_4.2.0                  
##  [65] cachem_1.0.6                  cli_3.3.0                    
##  [67] generics_0.1.3                RSQLite_2.2.16               
##  [69] broom_1.0.1                   evaluate_0.16                
##  [71] fastmap_1.1.0                 yaml_2.3.5                   
##  [73] fs_1.5.2                      knitr_1.40                   
##  [75] bit64_4.0.5                   KEGGREST_1.36.3              
##  [77] sparseMatrixStats_1.8.0       mime_0.12                    
##  [79] R.oo_1.25.0                   xml2_1.3.3                   
##  [81] biomaRt_2.52.0                compiler_4.2.0               
##  [83] rstudioapi_0.14               beeswarm_0.4.0               
##  [85] filelock_1.0.2                curl_4.3.2                   
##  [87] png_0.1-7                     interactiveDisplayBase_1.34.0
##  [89] reprex_2.0.2                  bslib_0.4.0                  
##  [91] stringi_1.7.8                 highr_0.9                    
##  [93] lattice_0.20-45               ProtGenerics_1.28.0          
##  [95] Matrix_1.4-1                  vctrs_0.4.1                  
##  [97] pillar_1.8.1                  lifecycle_1.0.1              
##  [99] rhdf5filters_1.8.0            BiocManager_1.30.18          
## [101] jquerylib_0.1.4               BiocNeighbors_1.14.0         
## [103] cowplot_1.1.1                 bitops_1.0-7                 
## [105] irlba_2.3.5                   httpuv_1.6.5                 
## [107] rtracklayer_1.56.1            R6_2.5.1                     
## [109] BiocIO_1.6.0                  promises_1.2.0.1             
## [111] gridExtra_2.3                 vipor_0.4.5                  
## [113] codetools_0.2-18              assertthat_0.2.1             
## [115] rhdf5_2.40.0                  rjson_0.2.21                 
## [117] withr_2.5.0                   GenomicAlignments_1.32.1     
## [119] Rsamtools_2.12.0              GenomeInfoDbData_1.2.8       
## [121] parallel_4.2.0                hms_1.1.2                    
## [123] grid_4.2.0                    beachmat_2.12.0              
## [125] rmarkdown_2.16                DelayedMatrixStats_1.18.0    
## [127] googledrive_2.0.0             lubridate_1.8.0              
## [129] shiny_1.7.2                   ggbeeswarm_0.6.0             
## [131] restfulr_0.0.15