1 Introduction

In the course materials we performed QC and filtering of 2 samples from each of the sample groups. For this challenge we would like you to perform QC and filtering on all of the samples from the Caron data set.

2 Load packages

library(DropletUtils)
library(scater)
library(ensembldb)
library(AnnotationHub)
library(BiocParallel)
library(tidyverse)
library(patchwork)
library(ggvenn)

3 Sample meta data

We will load both the Caron and HCA data sets. We have already prepared a sample meta data table that relates the sample/run ID to the sample group.

samplesheet <- read_tsv("Data/sample_sheet.tsv")

NOTE: There are two PBMMC_1 samples. These are two libraries from the same sample material. Later on, we will combine these to form a single sample, but for now we will need to rename them so that they are QC’d separately.

4 Read CellRanger outputs into R

4.1 Parallelisation

We first need to first set up some parallel parameters using the package BiocParallel.

bp.params <- MulticoreParam(workers = 7)

4.2 Load the data

In order to load the CellRanger data for all of the Caron samples, you will first need to create a named vector of the paths to the filtered count matrix folders called list_of_files and then use this in the read10xCounts command.

Hint

The paths to filtered_feature_bc_matrix directories for each sample can be constructed using the SampleId as:

CellRanger_Outputs/SampleId/outs/filtered_feature_bc_matrix

You will need to use a function such as str_c or paste

The names of the vector will determine the sample name used in the counts matrix, this should be the ‘SRR’ number as in the SampleId column of the sample sheet.
Answer
samples_list <- samplesheet %>% 
    filter(DatasetName=="Caron") %>%  
    pull(SampleId)
list_of_files <- str_c("CellRanger_Outputs/", 
                       samples_list, 
                       "/outs/filtered_feature_bc_matrix")
names(list_of_files) <- samples_list

    
sce <- read10xCounts(list_of_files, col.names=TRUE, BPPARAM = bp.params)

4.3 Check samples in the data set

A good sanity check is to look at the colData to ensure that we have all the samples we are expecting and check the overall size of the new object.

colData(sce) %>%
    as.data.frame() %>% 
    select(Sample) %>% 
    distinct()
sce
## class: SingleCellExperiment 
## dim: 36601 51286 
## metadata(1): Samples
## assays(1): counts
## rownames(36601): ENSG00000243485 ENSG00000237613 ... ENSG00000278817 ENSG00000277196
## rowData names(3): ID Symbol Type
## colnames(51286): 1_AAACCTGAGACTTTCG-1 1_AAACCTGGTCTTCAAG-1 ... 12_TTTGTCATCAGTTGAC-1
##   12_TTTGTCATCTCGTTTA-1
## colData names(2): Sample Barcode
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):

4.4 Modify the droplet annotation

We need to modify the cell barcodes so that they unique for each sample. We should also modify the names of the two PBMMC_1 samples, so that we can distinguish them from one another.

We will also switch the “Sample” column to be the sample name and add information from the sample sheet to the droplet annotation.

In the code below, replace the “XXXXXXXX” to rename the PBMMC_1 samples appropriately.

colData(sce) <- colData(sce) %>% 
    as.data.frame() %>%
    rownames_to_column("RowName") %>% 
    mutate(SampleNum = str_extract(RowName, "^[0-9]+")) %>%
    mutate(Barcode = str_replace(Barcode, "1$", SampleNum)) %>%
    left_join(samplesheet, by=c(Sample="SampleId")) %>%
    rename(SampleId=Sample) %>% 
    rename(Sample=SampleName) %>%    
    mutate(Sample = case_when(
           SampleId == XXXXXXXX ~ str_c(Sample, "a"),
           SampleId == XXXXXXXX ~ str_c(Sample, "b"),
           TRUE ~ Sample)) %>% 
    column_to_rownames("RowName") %>% 
    select(Sample, Barcode, SampleId, SampleGroup, DatasetName) %>%
    DataFrame()
Hint The new SampleId columns contains the “SRR” numbers, as does the SampleId in the sample sheet. Replace the XXXXXXXXX with the appropriate “SRR” number by checking for the PBMMC_1 samples in the samplesheet.
Answer
colData(sce) <- colData(sce) %>% 
    as.data.frame() %>%
    rownames_to_column("RowName") %>% 
    mutate(SampleNum = str_extract(RowName, "^[0-9]+")) %>%
    mutate(Barcode = str_replace(Barcode, "1$", SampleNum)) %>%
    left_join(samplesheet, by=c(Sample="SampleId")) %>%
    rename(SampleId=Sample) %>% 
    rename(Sample=SampleName) %>%    
    mutate(Sample = case_when(
           SampleId == "SRR9264351" ~ str_c(Sample, "a"),
           SampleId == "SRR9264352" ~ str_c(Sample, "b"),
           TRUE ~ Sample)) %>% 
    column_to_rownames("RowName") %>% 
    select(Sample, Barcode, SampleId, SampleGroup, DatasetName) %>%
    DataFrame()

Before moving on, check that the columns are as we expect them to be and that the PBMMC_1 samples have been renamed:

colData(sce)
## DataFrame with 51286 rows and 5 columns
##                             Sample             Barcode    SampleId SampleGroup DatasetName
##                        <character>         <character> <character> <character> <character>
## 1_AAACCTGAGACTTTCG-1  ETV6-RUNX1_1  AAACCTGAGACTTTCG-1  SRR9264343  ETV6-RUNX1       Caron
## 1_AAACCTGGTCTTCAAG-1  ETV6-RUNX1_1  AAACCTGGTCTTCAAG-1  SRR9264343  ETV6-RUNX1       Caron
## 1_AAACCTGGTGCAACTT-1  ETV6-RUNX1_1  AAACCTGGTGCAACTT-1  SRR9264343  ETV6-RUNX1       Caron
## 1_AAACCTGGTGTTGAGG-1  ETV6-RUNX1_1  AAACCTGGTGTTGAGG-1  SRR9264343  ETV6-RUNX1       Caron
## 1_AAACCTGTCCCAAGTA-1  ETV6-RUNX1_1  AAACCTGTCCCAAGTA-1  SRR9264343  ETV6-RUNX1       Caron
## ...                            ...                 ...         ...         ...         ...
## 12_TTTGTCAGTGTGAATA-1      PBMMC_3 TTTGTCAGTGTGAATA-12  SRR9264354       PBMMC       Caron
## 12_TTTGTCAGTTCCACGG-1      PBMMC_3 TTTGTCAGTTCCACGG-12  SRR9264354       PBMMC       Caron
## 12_TTTGTCAGTTCTCATT-1      PBMMC_3 TTTGTCAGTTCTCATT-12  SRR9264354       PBMMC       Caron
## 12_TTTGTCATCAGTTGAC-1      PBMMC_3 TTTGTCATCAGTTGAC-12  SRR9264354       PBMMC       Caron
## 12_TTTGTCATCTCGTTTA-1      PBMMC_3 TTTGTCATCTCGTTTA-12  SRR9264354       PBMMC       Caron
unique(sce$Sample)
##  [1] "ETV6-RUNX1_1" "ETV6-RUNX1_2" "ETV6-RUNX1_3" "ETV6-RUNX1_4" "HHD_1"        "HHD_2"       
##  [7] "PRE-T_1"      "PRE-T_2"      "PBMMC_1a"     "PBMMC_1b"     "PBMMC_2"      "PBMMC_3"

5 Undetected genes

Remove undetected genes.

detected_genes <- rowSums(counts(sce)) > 0
sce <- sce[detected_genes,]

What proportion of genes have been detected

Answer
detected <- sum(detected_genes)/length(detected_genes)
detected
## [1] 0.7753067
Approximately 77.5 percent of genes were detected in at least one sample.

6 Annotate genes

In order to assess the percentage of mitochondrial UMIs, we will need to be able to identify mitochondrial genes. The simplest way to do this is to annotate the genes with their chromosome of origin.

ah <- AnnotationHub()
ens.mm.98 <- query(ah, c("Homo sapiens", "EnsDb", 98))[[1]] 

genes <- rowData(sce)$ID
gene_annot <- AnnotationDbi::select(ens.mm.98, 
                                    keys = genes,
                                    keytype = "GENEID",
                                    columns = c("GENEID", "SEQNAME")) %>%
    set_names(c("ID", "Chromosome"))
rowData(sce) <- merge(rowData(sce), gene_annot, by = "ID", sort=FALSE)
rownames(rowData(sce)) <- rowData(sce)$ID

rowData(sce)
## DataFrame with 28377 rows and 4 columns
##                              ID      Symbol            Type  Chromosome
##                     <character> <character>     <character> <character>
## ENSG00000238009 ENSG00000238009  AL627309.1 Gene Expression           1
## ENSG00000239945 ENSG00000239945  AL627309.3 Gene Expression           1
## ENSG00000241860 ENSG00000241860  AL627309.5 Gene Expression           1
## ENSG00000241599 ENSG00000241599  AL627309.4 Gene Expression           1
## ENSG00000286448 ENSG00000286448  AP006222.2 Gene Expression           1
## ...                         ...         ...             ...         ...
## ENSG00000274792 ENSG00000274792  AC171558.1 Gene Expression  KI270727.1
## ENSG00000275869 ENSG00000275869  AC136612.1 Gene Expression  KI270728.1
## ENSG00000277836 ENSG00000277836  AC141272.1 Gene Expression  KI270728.1
## ENSG00000278633 ENSG00000278633  AC023491.2 Gene Expression  KI270731.1
## ENSG00000278817 ENSG00000278817  AC007325.4 Gene Expression  KI270734.1

7 Add per cell QC metrics

Now add the per cell QC metrics to the droplet annotation using the function addPerCellQC.

is.mito <- which(rowData(sce)$Chromosome=="MT")

sce <- addPerCellQC(sce, subsets=list(Mito=is.mito), BPPARAM = bp.params)

8 Explore QC metric distribution

Before moving on to do the actual cell filtering, it is always a good idea to explore the distribution of the metrics across the droplets.

Use the scater function plotColData to generate plots showing the distributions of the total number of UMIs, the number of genes detected and percentage of UMIs aligned to mitochondrial genes across all cells for each sample.

Answer
plotColData(sce, x="Sample", y="sum",other_fields="SampleGroup") + 
    facet_wrap(~SampleGroup, nrow=1, scales = "free_x") + 
    scale_y_log10() + 
    ggtitle("Total count")
plotColData(sce, x="Sample", y="detected", other_fields="SampleGroup") + 
    facet_wrap(~SampleGroup, nrow=1, scales = "free_x") + 
    scale_y_log10() + 
    ggtitle("Detected features")
plotColData(sce, x="Sample", y="subsets_Mito_percent", other_fields="SampleGroup") + 
    facet_wrap(~SampleGroup, nrow=1, scales = "free_x") +
    ggtitle("Mito percent")

9 Identification of low-quality cells with adaptive thresholds

Use the scater function quickPerCellQC to assess cell quality based on the three metrics. Name the object generated cell_qc_results.

When running the command, consider the distribution plots above and decide whether to use the batch option and if so, at what level it should be applied.

How many cells will be removed from the data set?

Hint With these samples we have three possible “batch” levels at which we could run adaptive filtering. We could apply the filtering across all samples together (i.e. no batch), we could apply it by sample group (ETV6-RUNX1, HHD, PBMMC, PRE-T), or we could apply it per sample.
Answer

It is clear that there are significantly different distribution profiles between samples in the same group. Noticeably the distribution profiles for the two PBMMC_1 samples are quite different - it would appear that PBMMC_1b has a lower overal UMI count indicating less sequencing. For this reason it may be prudent to apply the adaptive filtering to each sample independently.

cell_qc_results <- quickPerCellQC(colData(sce),
                                percent_subsets=c("subsets_Mito_percent"),
                                batch=sce$Sample)
as.data.frame(cell_qc_results) %>% summarise(across(everything(), sum))

In total 0 cells will be removed from the dataset.

Let’s replace the columns in the droplet annotation with these new filters.

sce$low_lib_size <- cell_qc_results$low_lib_size
sce$low_n_features <- cell_qc_results$low_n_features
sce$high_Mito_percent <- cell_qc_results$high_subsets_Mito_percent
sce$discard <- cell_qc_results$discard

We can visualise how the new filters look using violin plots.

plotColData(sce, 
            x="Sample", 
            y="sum",
            other_fields="SampleGroup", 
            colour_by = "low_lib_size") + 
    facet_wrap(vars(SampleGroup), nrow=1, scales = "free_x") + 
    scale_y_log10() + 
    labs(y = "Total count", title = "Total count") +
    guides(colour=guide_legend(title="Discarded"))

plotColData(sce, 
            x="Sample", 
            y="detected",
            other_fields="SampleGroup", 
            colour_by = "low_n_features") + 
    facet_wrap(vars(SampleGroup), nrow=1, scales = "free_x") + 
    scale_y_log10() + 
    labs(y = "Genes detected", title = "Genes detected") +
    guides(colour=guide_legend(title="Discarded"))

plotColData(sce, 
        x="Sample", 
        y="subsets_Mito_percent",
        other_fields="SampleGroup", 
        colour_by = "high_Mito_percent") + 
    facet_wrap(vars(SampleGroup), nrow=1, scales = "free_x") + 
    labs(y = "Percentage mitochondrial UMIs",
         title = "Mitochondrial UMIs") +
    guides(colour=guide_legend(title="Discarded"))

10 Filtering out poor quality cells

Filter out the poor quality cells, recalculate the cell QC metrics and save the filtered object.

sce <- sce[, !sce$discard]
colData(sce) <- colData(sce)[,1:3]
sce <- addPerCellQC(sce, BPPARAM = bp.params)
saveRDS(sce, "../results/Caron_filtered_genes.rds")

11 Filtering genes based on sparsity - Bonus exercise

So far we have only eliminated genes that are undetected across all samples. In reality there will be a large number of genes that are only detected at low levels in a small number of cells - much of this will just be due to technical noise. We could further filter the genes using “sparsity”.

Look at section 8 of the documention. Filter the dataset to remove genes that have only been detected in fewer than 20 cells.

How many additional genes are filtered out?

Answer
sce <- addPerFeatureQC(sce, BPPARAM = bp.params)
rowData(sce)$gene_sparsity <- (100 - rowData(sce)$detected) / 100
max.sparsity <- 1 - (20 / ncol(sce))
sparse.genes <- rowData(sce)$gene_sparsity > max.sparsity
sce <- sce[!sparse.genes, ]
sum(sparse.genes)
## [1] 8410

An additional 8410 have been filtered out.

12 Session information

sessionInfo()
## R version 4.1.2 (2021-11-01)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.6 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
## 
## locale:
##  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C               LC_TIME=en_GB.UTF-8       
##  [4] LC_COLLATE=en_GB.UTF-8     LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
##  [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
## [10] LC_TELEPHONE=C             LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
##  [1] grid      parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] rmarkdown_2.10              ggvenn_0.1.9                patchwork_1.1.1            
##  [4] forcats_0.5.1               stringr_1.4.0               dplyr_1.0.7                
##  [7] purrr_0.3.4                 readr_2.0.1                 tidyr_1.1.3                
## [10] tibble_3.1.4                tidyverse_1.3.1             BiocParallel_1.26.2        
## [13] AnnotationHub_3.0.1         BiocFileCache_2.0.0         dbplyr_2.1.1               
## [16] ensembldb_2.16.4            AnnotationFilter_1.16.0     GenomicFeatures_1.44.2     
## [19] AnnotationDbi_1.54.1        scater_1.20.1               ggplot2_3.3.5              
## [22] scuttle_1.2.1               DropletUtils_1.12.2         SingleCellExperiment_1.14.1
## [25] SummarizedExperiment_1.22.0 Biobase_2.52.0              GenomicRanges_1.44.0       
## [28] GenomeInfoDb_1.28.2         IRanges_2.26.0              S4Vectors_0.30.0           
## [31] BiocGenerics_0.38.0         MatrixGenerics_1.4.3        matrixStats_0.60.1         
## [34] nvimcom_0.9-82             
## 
## loaded via a namespace (and not attached):
##   [1] readxl_1.3.1                  backports_1.2.1               lazyeval_0.2.2               
##   [4] digest_0.6.27                 htmltools_0.5.2               viridis_0.6.1                
##   [7] fansi_0.5.0                   magrittr_2.0.1                memoise_2.0.0                
##  [10] ScaledMatrix_1.0.0            tzdb_0.1.2                    limma_3.48.3                 
##  [13] Biostrings_2.60.2             modelr_0.1.8                  vroom_1.5.4                  
##  [16] R.utils_2.10.1                prettyunits_1.1.1             colorspace_2.0-2             
##  [19] rvest_1.0.1                   blob_1.2.2                    rappdirs_0.3.3               
##  [22] haven_2.4.3                   xfun_0.25                     crayon_1.4.1                 
##  [25] RCurl_1.98-1.4                jsonlite_1.7.2                glue_1.4.2                   
##  [28] gtable_0.3.0                  zlibbioc_1.38.0               XVector_0.32.0               
##  [31] DelayedArray_0.18.0           BiocSingular_1.8.1            Rhdf5lib_1.14.2              
##  [34] HDF5Array_1.20.0              scales_1.1.1                  DBI_1.1.1                    
##  [37] edgeR_3.34.0                  Rcpp_1.0.7                    viridisLite_0.4.0            
##  [40] xtable_1.8-4                  progress_1.2.2                dqrng_0.3.0                  
##  [43] bit_4.0.4                     rsvd_1.0.5                    httr_1.4.2                   
##  [46] ellipsis_0.3.2                farver_2.1.0                  pkgconfig_2.0.3              
##  [49] XML_3.99-0.7                  R.methodsS3_1.8.1             sass_0.4.0                   
##  [52] locfit_1.5-9.4                utf8_1.2.2                    labeling_0.4.2               
##  [55] tidyselect_1.1.1              rlang_0.4.11                  later_1.3.0                  
##  [58] cellranger_1.1.0              munsell_0.5.0                 BiocVersion_3.13.1           
##  [61] tools_4.1.2                   cachem_1.0.6                  cli_3.0.1                    
##  [64] generics_0.1.0                RSQLite_2.2.8                 broom_0.7.9                  
##  [67] evaluate_0.14                 fastmap_1.1.0                 yaml_2.2.1                   
##  [70] fs_1.5.0                      knitr_1.33                    bit64_4.0.5                  
##  [73] KEGGREST_1.32.0               sparseMatrixStats_1.4.2       mime_0.11                    
##  [76] R.oo_1.24.0                   xml2_1.3.2                    biomaRt_2.48.3               
##  [79] compiler_4.1.2                rstudioapi_0.13               beeswarm_0.4.0               
##  [82] filelock_1.0.2                curl_4.3.2                    png_0.1-7                    
##  [85] interactiveDisplayBase_1.30.0 reprex_2.0.1                  bslib_0.2.5.1                
##  [88] stringi_1.7.4                 highr_0.9                     lattice_0.20-45              
##  [91] ProtGenerics_1.24.0           Matrix_1.3-4                  vctrs_0.3.8                  
##  [94] pillar_1.6.2                  lifecycle_1.0.0               rhdf5filters_1.4.0           
##  [97] BiocManager_1.30.16           jquerylib_0.1.4               BiocNeighbors_1.10.0         
## [100] cowplot_1.1.1                 bitops_1.0-7                  irlba_2.3.3                  
## [103] httpuv_1.6.2                  rtracklayer_1.52.1            R6_2.5.1                     
## [106] BiocIO_1.2.0                  promises_1.2.0.1              gridExtra_2.3                
## [109] vipor_0.4.5                   codetools_0.2-18              assertthat_0.2.1             
## [112] rhdf5_2.36.0                  rjson_0.2.20                  withr_2.4.2                  
## [115] GenomicAlignments_1.28.0      Rsamtools_2.8.0               GenomeInfoDbData_1.2.6       
## [118] hms_1.1.0                     beachmat_2.8.1                DelayedMatrixStats_1.14.3    
## [121] Cairo_1.5-12.2                shiny_1.6.0                   lubridate_1.7.10             
## [124] ggbeeswarm_0.6.0              restfulr_0.0.13