1 Introduction

In the course materials we performed QC and filtering of 1 sample from each of the sample groups. For this challenge we would like you to perform QC and filtering on all of the samples from the Caron data set.

2 Starting the RStudio server

To start the RStudio server use the following command:

rstudio-scRNAseq

This will send a job to the cluster to start the RStudio server with 8 cores and 32 Gigabytes of memory. Once the job starts you will get a message like this:

To launch the RStudio server:

The RStudio server may not start with the working directory set to your Course_Materials directory, so you may need to change this (once you’ve done this once, it should remember for later sessions.)

3 Load packages

library(DropletUtils)
library(scater)
library(ensembldb)
library(AnnotationHub)
library(BiocParallel)
library(tidyverse)
library(patchwork)
library(ggvenn)

4 Sample meta data

We will load both the Caron and HCA data sets. We have already prepared a sample meta data table that relates the sample/run ID to the sample group.

samplesheet <- read_tsv("Data/sample_sheet.tsv")

NOTE: There are two PBMMC_1 samples. These are two libraries from the same sample material. Later on, we will combine these to form a single sample, but for now we will need to rename them so that they are QC’d separately.

5 Read CellRanger outputs into R

5.1 Parallelisation

We first need to first set up some parallel parameters using the package BiocParallel.

bp.params <- MulticoreParam(workers = 7)

5.2 Load the data

In order to load the CellRanger data for all of the Caron samples, you will first need to create a named vector of the paths to the filtered count matrix folders called list_of_files and then use this in the read10xCounts command.

Hint

The paths to filtered_feature_bc_matrix directories for each sample can be constructed using the SampleId as:

CellRanger_Outputs/SampleId/outs/filtered_feature_bc_matrix

You will need to use a function such as str_c or paste

The names of the vector will determine the sample name used in the counts matrix, this should be the ‘SRR’ number as in the SampleId column of the sample sheet.
Answer
samples_list <- samplesheet %>% 
    filter(DatasetName=="Caron") %>%  
    pull(SampleId)
list_of_files <- str_c("CellRanger_Outputs/", 
                       samples_list, 
                       "/outs/filtered_feature_bc_matrix")
names(list_of_files) <- samples_list

    
sce <- read10xCounts(list_of_files, col.names=TRUE, BPPARAM = bp.params)

5.3 Check samples in the data set

A good sanity check is to look at the colData to ensure that we have all the samples we are expecting and check the overall size of the new object.

colData(sce) %>%
    as.data.frame() %>% 
    select(Sample) %>% 
    distinct()
ABCDEFGHIJ0123456789
 
 
Sample
<chr>
1_AAACCTGAGACTTTCG-1SRR9264343
2_AAACCTGAGCAGACTG-1SRR9264344
3_AAACCTGAGGTGATTA-1SRR9264345
4_AAACCTGAGAGTGACC-1SRR9264346
5_AAACCTGAGCCAGAAC-1SRR9264347
6_AAACCTGAGCTAGCCC-1SRR9264348
7_AAACCTGAGAGCTGCA-1SRR9264349
8_AAACCTGAGAACAACT-1SRR9264350
9_AAACCTGCACTTCGAA-1SRR9264351
10_AAACCTGAGATGAGAG-1SRR9264352
sce
## class: SingleCellExperiment 
## dim: 36601 51286 
## metadata(1): Samples
## assays(1): counts
## rownames(36601): ENSG00000243485 ENSG00000237613 ... ENSG00000278817
##   ENSG00000277196
## rowData names(3): ID Symbol Type
## colnames(51286): 1_AAACCTGAGACTTTCG-1 1_AAACCTGGTCTTCAAG-1 ...
##   12_TTTGTCATCAGTTGAC-1 12_TTTGTCATCTCGTTTA-1
## colData names(2): Sample Barcode
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):

5.4 Modify the droplet annotation

We need to modify the cell barcodes so that they unique for each sample. We should also modify the names of the two PBMMC_1 samples, so that we can distinguish them from one another.

We will also switch the “Sample” column to be the sample name and add information from the sample sheet to the droplet annotation.

In the code below, replace the “XXXXXXXX” to rename the PBMMC_1 samples appropriately.

colData(sce) <- colData(sce) %>% 
    as.data.frame() %>%
    rownames_to_column("RowName") %>% 
    mutate(SampleNum = str_extract(RowName, "^[0-9]+")) %>%
    mutate(Barcode = str_replace(Barcode, "1$", SampleNum)) %>%
    left_join(samplesheet, by=c(Sample="SampleId")) %>%
    rename(SampleId=Sample) %>% 
    rename(Sample=SampleName) %>%    
    mutate(Sample = case_when(
           SampleId == XXXXXXXX ~ str_c(Sample, "a"),
           SampleId == XXXXXXXX ~ str_c(Sample, "b"),
           TRUE ~ Sample)) %>% 
    column_to_rownames("RowName") %>% 
    select(Sample, Barcode, SampleId, SampleGroup, DatasetName) %>%
    DataFrame()
Hint The new SampleId columns contains the “SRR” numbers, as does the SampleId in the sample sheet. Replace the XXXXXXXXX with the appropriate “SRR” number by checking for the PBMMC_1 samples in the samplesheet.
Answer
colData(sce) <- colData(sce) %>% 
    as.data.frame() %>%
    rownames_to_column("RowName") %>% 
    mutate(SampleNum = str_extract(RowName, "^[0-9]+")) %>%
    mutate(Barcode = str_replace(Barcode, "1$", SampleNum)) %>%
    left_join(samplesheet, by=c(Sample="SampleId")) %>%
    rename(SampleId=Sample) %>% 
    rename(Sample=SampleName) %>%    
    mutate(Sample = case_when(
           SampleId == "SRR9264351" ~ str_c(Sample, "a"),
           SampleId == "SRR9264352" ~ str_c(Sample, "b"),
           TRUE ~ Sample)) %>% 
    column_to_rownames("RowName") %>% 
    select(Sample, Barcode, SampleId, SampleGroup, DatasetName) %>%
    DataFrame()

Before moving on, check that the columns are as we expect them to be and that the PBMMC_1 samples have been renamed:

colData(sce)
## DataFrame with 51286 rows and 5 columns
##                             Sample             Barcode    SampleId SampleGroup
##                        <character>         <character> <character> <character>
## 1_AAACCTGAGACTTTCG-1  ETV6-RUNX1_1  AAACCTGAGACTTTCG-1  SRR9264343  ETV6-RUNX1
## 1_AAACCTGGTCTTCAAG-1  ETV6-RUNX1_1  AAACCTGGTCTTCAAG-1  SRR9264343  ETV6-RUNX1
## 1_AAACCTGGTGCAACTT-1  ETV6-RUNX1_1  AAACCTGGTGCAACTT-1  SRR9264343  ETV6-RUNX1
## 1_AAACCTGGTGTTGAGG-1  ETV6-RUNX1_1  AAACCTGGTGTTGAGG-1  SRR9264343  ETV6-RUNX1
## 1_AAACCTGTCCCAAGTA-1  ETV6-RUNX1_1  AAACCTGTCCCAAGTA-1  SRR9264343  ETV6-RUNX1
## ...                            ...                 ...         ...         ...
## 12_TTTGTCAGTGTGAATA-1      PBMMC_3 TTTGTCAGTGTGAATA-12  SRR9264354       PBMMC
## 12_TTTGTCAGTTCCACGG-1      PBMMC_3 TTTGTCAGTTCCACGG-12  SRR9264354       PBMMC
## 12_TTTGTCAGTTCTCATT-1      PBMMC_3 TTTGTCAGTTCTCATT-12  SRR9264354       PBMMC
## 12_TTTGTCATCAGTTGAC-1      PBMMC_3 TTTGTCATCAGTTGAC-12  SRR9264354       PBMMC
## 12_TTTGTCATCTCGTTTA-1      PBMMC_3 TTTGTCATCTCGTTTA-12  SRR9264354       PBMMC
##                       DatasetName
##                       <character>
## 1_AAACCTGAGACTTTCG-1        Caron
## 1_AAACCTGGTCTTCAAG-1        Caron
## 1_AAACCTGGTGCAACTT-1        Caron
## 1_AAACCTGGTGTTGAGG-1        Caron
## 1_AAACCTGTCCCAAGTA-1        Caron
## ...                           ...
## 12_TTTGTCAGTGTGAATA-1       Caron
## 12_TTTGTCAGTTCCACGG-1       Caron
## 12_TTTGTCAGTTCTCATT-1       Caron
## 12_TTTGTCATCAGTTGAC-1       Caron
## 12_TTTGTCATCTCGTTTA-1       Caron
unique(sce$Sample)
##  [1] "ETV6-RUNX1_1" "ETV6-RUNX1_2" "ETV6-RUNX1_3" "ETV6-RUNX1_4" "HHD_1"       
##  [6] "HHD_2"        "PRE-T_1"      "PRE-T_2"      "PBMMC_1a"     "PBMMC_1b"    
## [11] "PBMMC_2"      "PBMMC_3"

6 Undetected genes

Remove undetected genes.

detected_genes <- rowSums(counts(sce)) > 0
sce <- sce[detected_genes,]

What proportion of genes have been detected

Answer
detected <- sum(detected_genes)/length(detected_genes)
detected
## [1] 0.7753067
Approximately 77.5 percent of genes were detected in at least one sample.

7 Annotate genes

In order to assess the percentage of mitochondrial UMIs, we will need to be able to identify mitochondrial genes. The simplest way to do this is to annotate the genes with their chromosome of origin.

ah <- AnnotationHub()
ens.mm.98 <- query(ah, c("Homo sapiens", "EnsDb", 98))[[1]] 

genes <- rowData(sce)$ID
gene_annot <- AnnotationDbi::select(ens.mm.98, 
                                    keys = genes,
                                    keytype = "GENEID",
                                    columns = c("GENEID", "SEQNAME")) %>%
    set_names(c("ID", "Chromosome"))
rowData(sce) <- merge(rowData(sce), gene_annot, by = "ID", sort=FALSE)
rownames(rowData(sce)) <- rowData(sce)$ID

rowData(sce)
## DataFrame with 28377 rows and 4 columns
##                              ID      Symbol            Type  Chromosome
##                     <character> <character>     <character> <character>
## ENSG00000238009 ENSG00000238009  AL627309.1 Gene Expression           1
## ENSG00000239945 ENSG00000239945  AL627309.3 Gene Expression           1
## ENSG00000241860 ENSG00000241860  AL627309.5 Gene Expression           1
## ENSG00000241599 ENSG00000241599  AL627309.4 Gene Expression           1
## ENSG00000286448 ENSG00000286448  AP006222.2 Gene Expression           1
## ...                         ...         ...             ...         ...
## ENSG00000274792 ENSG00000274792  AC171558.1 Gene Expression  KI270727.1
## ENSG00000275869 ENSG00000275869  AC136612.1 Gene Expression  KI270728.1
## ENSG00000277836 ENSG00000277836  AC141272.1 Gene Expression  KI270728.1
## ENSG00000278633 ENSG00000278633  AC023491.2 Gene Expression  KI270731.1
## ENSG00000278817 ENSG00000278817  AC007325.4 Gene Expression  KI270734.1

8 Add per cell QC metrics

Now add the per cell QC metrics to the droplet annotation using the function addPerCellQC.

is.mito <- which(rowData(sce)$Chromosome=="MT")

sce <- addPerCellQC(sce, subsets=list(Mito=is.mito), BPPARAM = bp.params)

9 Explore QC metric distribution

Before moving on to do the actual cell filtering, it is always a good idea to explore the distribution of the metrics across the droplets.

Use the scater function plotColData to generate plots showing the distributions of the total number of UMIs, the number of genes detected and percentage of UMIs aligned to mitochondrial genes across all cells for each sample.

Answer
plotColData(sce, x="Sample", y="sum",other_fields="SampleGroup") + 
    facet_wrap(~SampleGroup, nrow=1, scales = "free_x") + 
    scale_y_log10() + 
    ggtitle("Total count")
plotColData(sce, x="Sample", y="detected", other_fields="SampleGroup") + 
    facet_wrap(~SampleGroup, nrow=1, scales = "free_x") + 
    scale_y_log10() + 
    ggtitle("Detected features")
plotColData(sce, x="Sample", y="subsets_Mito_percent", other_fields="SampleGroup") + 
    facet_wrap(~SampleGroup, nrow=1, scales = "free_x") +
    ggtitle("Mito percent")

10 Identification of low-quality cells with adaptive thresholds

Use the scater function quickPerCellQC to assess cell quality based on the three metrics. Name the object generated cell_qc_results.

When running the command, consider the distribution plots above and decide whether to use the batch option and if so, at what level it should be applied.

How many cells will be removed from the data set?

Hint With these samples we have three possible “batch” levels at which we could run adaptive filtering. We could apply the filtering across all samples together (i.e. no batch), we could apply it by sample group (ETV6-RUNX1, HHD, PBMMC, PRE-T) (batch=sce$SampleGroup), or we could apply it per sample (batch=sce$Sample)
Answer
# at sample level
cell_qc_results <- quickPerCellQC(colData(sce),
                                percent_subsets=c("subsets_Mito_percent"),
                                batch=sce$Sample)
as.data.frame(cell_qc_results) %>% summarise(across(everything(), sum))
ABCDEFGHIJ0123456789
low_lib_size
<int>
low_n_features
<int>
high_subsets_Mito_percent
<int>
discard
<int>
425123538884715
# at sample group level
cell_qc_results_gGroup <- quickPerCellQC(colData(sce),
                                percent_subsets=c("subsets_Mito_percent"),
                                batch=sce$SampleGroup)
as.data.frame(cell_qc_results_gGroup) %>% summarise(across(everything(), sum))
ABCDEFGHIJ0123456789
low_lib_size
<int>
low_n_features
<int>
high_subsets_Mito_percent
<int>
discard
<int>
0126941295178
# at all samples
cell_qc_results_all <- quickPerCellQC(colData(sce),
                                percent_subsets=c("subsets_Mito_percent")
                                )
as.data.frame(cell_qc_results_all) %>% summarise(across(everything(), sum))
ABCDEFGHIJ0123456789
low_lib_size
<int>
low_n_features
<int>
high_subsets_Mito_percent
<int>
discard
<int>
057839604483

In total 0 cells will be removed from the dataset.

Let’s replace the columns in the droplet annotation with these new filters.

sce$low_lib_size <- cell_qc_results$low_lib_size
sce$low_n_features <- cell_qc_results$low_n_features
sce$high_Mito_percent <- cell_qc_results$high_subsets_Mito_percent
sce$discard <- cell_qc_results$discard

We can visualise how the new filters look using violin plots.

plotColData(sce, 
            x="Sample", 
            y="sum",
            other_fields="SampleGroup", 
            colour_by = "low_lib_size") + 
    facet_wrap(vars(SampleGroup), nrow=1, scales = "free_x") + 
    scale_y_log10() + 
    labs(y = "Total count", title = "Total count") +
    guides(colour=guide_legend(title="Discarded"))

plotColData(sce, 
            x="Sample", 
            y="detected",
            other_fields="SampleGroup", 
            colour_by = "low_n_features") + 
    facet_wrap(vars(SampleGroup), nrow=1, scales = "free_x") + 
    scale_y_log10() + 
    labs(y = "Genes detected", title = "Genes detected") +
    guides(colour=guide_legend(title="Discarded"))

plotColData(sce, 
        x="Sample", 
        y="subsets_Mito_percent",
        other_fields="SampleGroup", 
        colour_by = "high_Mito_percent") + 
    facet_wrap(vars(SampleGroup), nrow=1, scales = "free_x") + 
    labs(y = "Percentage mitochondrial UMIs",
         title = "Mitochondrial UMIs") +
    guides(colour=guide_legend(title="Discarded"))

11 Filtering out poor quality cells

Filter out the poor quality cells, recalculate the cell QC metrics and save the filtered object.

sce <- sce[, !sce$discard]
colData(sce) <- colData(sce)[,1:3]
sce <- addPerCellQC(sce, BPPARAM = bp.params)
saveRDS(sce, "../results/Caron_filtered_genes.rds")

12 Filtering genes based on sparsity - Bonus exercise

So far we have only eliminated genes that are undetected across all samples. In reality there will be a large number of genes that are only detected at low levels in a small number of cells - much of this will just be due to technical noise. We could further filter the genes using “sparsity”.

Look at section 8 of the documention. Filter the dataset to remove genes that have only been detected in fewer than 20 cells.

How many additional genes are filtered out?

Answer
sce <- addPerFeatureQC(sce, BPPARAM = bp.params)
rowData(sce)$gene_sparsity <- (100 - rowData(sce)$detected) / 100
max.sparsity <- 1 - (20 / ncol(sce))
sparse.genes <- rowData(sce)$gene_sparsity > max.sparsity
sce <- sce[!sparse.genes, ]
sum(sparse.genes)
## [1] 8410

An additional 8410 have been filtered out.

13 Session information

sessionInfo()
## R version 4.1.3 (2022-03-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.6 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
## 
## locale:
##  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
##  [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
##  [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] grid      stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] ggvenn_0.1.9                patchwork_1.1.1            
##  [3] forcats_0.5.1               stringr_1.4.0              
##  [5] dplyr_1.0.8                 purrr_0.3.4                
##  [7] readr_2.1.2                 tidyr_1.2.0                
##  [9] tibble_3.1.6                tidyverse_1.3.1            
## [11] BiocParallel_1.28.3         AnnotationHub_3.2.1        
## [13] BiocFileCache_2.2.1         dbplyr_2.1.1               
## [15] ensembldb_2.18.3            AnnotationFilter_1.18.0    
## [17] GenomicFeatures_1.46.5      AnnotationDbi_1.56.2       
## [19] scater_1.22.0               ggplot2_3.3.5              
## [21] scuttle_1.4.0               DropletUtils_1.14.2        
## [23] SingleCellExperiment_1.16.0 SummarizedExperiment_1.24.0
## [25] Biobase_2.54.0              GenomicRanges_1.46.1       
## [27] GenomeInfoDb_1.30.1         IRanges_2.28.0             
## [29] S4Vectors_0.32.3            BiocGenerics_0.40.0        
## [31] MatrixGenerics_1.6.0        matrixStats_0.61.0         
## 
## loaded via a namespace (and not attached):
##   [1] readxl_1.3.1                  backports_1.4.1              
##   [3] lazyeval_0.2.2                digest_0.6.29                
##   [5] htmltools_0.5.2               viridis_0.6.2                
##   [7] fansi_1.0.2                   magrittr_2.0.2               
##   [9] memoise_2.0.1                 ScaledMatrix_1.2.0           
##  [11] tzdb_0.2.0                    limma_3.50.1                 
##  [13] Biostrings_2.62.0             modelr_0.1.8                 
##  [15] vroom_1.5.7                   R.utils_2.11.0               
##  [17] prettyunits_1.1.1             colorspace_2.0-3             
##  [19] rvest_1.0.2                   blob_1.2.2                   
##  [21] rappdirs_0.3.3                ggrepel_0.9.1                
##  [23] haven_2.4.3                   xfun_0.29                    
##  [25] crayon_1.5.0                  RCurl_1.98-1.6               
##  [27] jsonlite_1.8.0                glue_1.6.2                   
##  [29] gtable_0.3.0                  zlibbioc_1.40.0              
##  [31] XVector_0.34.0                DelayedArray_0.20.0          
##  [33] BiocSingular_1.10.0           Rhdf5lib_1.16.0              
##  [35] HDF5Array_1.22.1              scales_1.1.1                 
##  [37] DBI_1.1.2                     edgeR_3.36.0                 
##  [39] Rcpp_1.0.8                    viridisLite_0.4.0            
##  [41] xtable_1.8-4                  progress_1.2.2               
##  [43] dqrng_0.3.0                   bit_4.0.4                    
##  [45] rsvd_1.0.5                    httr_1.4.2                   
##  [47] ellipsis_0.3.2                farver_2.1.0                 
##  [49] pkgconfig_2.0.3               XML_3.99-0.9                 
##  [51] R.methodsS3_1.8.1             sass_0.4.0                   
##  [53] locfit_1.5-9.4                utf8_1.2.2                   
##  [55] labeling_0.4.2                tidyselect_1.1.2             
##  [57] rlang_1.0.1                   later_1.3.0                  
##  [59] cellranger_1.1.0              munsell_0.5.0                
##  [61] BiocVersion_3.14.0            tools_4.1.3                  
##  [63] cachem_1.0.6                  cli_3.2.0                    
##  [65] generics_0.1.2                RSQLite_2.2.10               
##  [67] broom_0.7.12                  evaluate_0.15                
##  [69] fastmap_1.1.0                 yaml_2.3.5                   
##  [71] fs_1.5.2                      knitr_1.37                   
##  [73] bit64_4.0.5                   KEGGREST_1.34.0              
##  [75] sparseMatrixStats_1.6.0       mime_0.12                    
##  [77] R.oo_1.24.0                   xml2_1.3.3                   
##  [79] biomaRt_2.50.3                compiler_4.1.3               
##  [81] rstudioapi_0.13               beeswarm_0.4.0               
##  [83] filelock_1.0.2                curl_4.3.2                   
##  [85] png_0.1-7                     interactiveDisplayBase_1.32.0
##  [87] reprex_2.0.1                  bslib_0.3.1                  
##  [89] stringi_1.7.6                 highr_0.9                    
##  [91] lattice_0.20-45               ProtGenerics_1.26.0          
##  [93] Matrix_1.4-0                  vctrs_0.3.8                  
##  [95] pillar_1.7.0                  lifecycle_1.0.1              
##  [97] rhdf5filters_1.6.0            BiocManager_1.30.16          
##  [99] jquerylib_0.1.4               BiocNeighbors_1.12.0         
## [101] cowplot_1.1.1                 bitops_1.0-7                 
## [103] irlba_2.3.5                   httpuv_1.6.5                 
## [105] rtracklayer_1.54.0            R6_2.5.1                     
## [107] BiocIO_1.4.0                  promises_1.2.0.1             
## [109] gridExtra_2.3                 vipor_0.4.5                  
## [111] codetools_0.2-18              assertthat_0.2.1             
## [113] rhdf5_2.38.0                  rjson_0.2.21                 
## [115] withr_2.4.3                   GenomicAlignments_1.30.0     
## [117] Rsamtools_2.10.0              GenomeInfoDbData_1.2.7       
## [119] parallel_4.1.3                hms_1.1.1                    
## [121] beachmat_2.10.0               rmarkdown_2.11               
## [123] DelayedMatrixStats_1.16.0     Cairo_1.5-12.2               
## [125] lubridate_1.8.0               shiny_1.7.1                  
## [127] ggbeeswarm_0.6.0              restfulr_0.0.13