Introduction

In the course materials we performed QC and filtering of 2 samples from each of the sample groups. For this challenge we would like you to perform QC and filtering on all of the samples from the Caron data set.

We have prepared a SingleCellExperiment object with the filtered count data from CellRanger already loaded. We have also already added the sample meta data and annotated the genes with their chromosome. Essentially, we have run all the steps up to and including section 6.0.1 in the course materials book.

While running the QC and particularly the filtering, you should consider the experimental set up and will need to decide on whether to apply the filters across all samples at once or whether you should separate the samples into batches for the adapative filters, and at what level the “batch” should be defined.

Load packages

library(DropletUtils)
library(scater)
library(BiocParallel)
library(tidyverse)
library(patchwork)
library(ggvenn)
library(DT)

Load the data

Sample meta data

We will load both the Caron and HCA data sets. We have already prepared a sample meta data table that relates the sample/run ID to the sample group. This is the metadata for all samples from both data sets.

samplesheet <- read_tsv("../Data/sample_sheet.tsv")

samplesheet %>%
    as.data.frame() %>%
    datatable(rownames = FALSE, options = list(dom="tpl", nrows=20))

The scRNAseq count data from CellRanger

Parallelisation

We first need to first set up some parallel parameters using the package BiocParallel.

bp.params <- MulticoreParam(workers = 7)

Load the data

sce <- readRDS("../Robjects/Caron_data.sce.annot.rds")

QC

Check samples in the data set

Use the colData function to check that the samples you have in the sce object are the ones you are expecting (all of the Caron data set).

colData(sce) %>%
    as.data.frame() %>% 
    select(Sample, SampleId, SampleGroup, DatasetName) %>% 
    distinct()

##                             Sample   SampleId SampleGroup DatasetName
## 1_AAACCTGAGACTTTCG-1  ETV6-RUNX1_1 SRR9264343  ETV6-RUNX1       Caron
## 2_AAACCTGAGCAGACTG-1  ETV6-RUNX1_2 SRR9264344  ETV6-RUNX1       Caron
## 3_AAACCTGAGGTGATTA-1  ETV6-RUNX1_3 SRR9264345  ETV6-RUNX1       Caron
## 4_AAACCTGAGAGTGACC-1  ETV6-RUNX1_4 SRR9264346  ETV6-RUNX1       Caron
## 5_AAACCTGAGCCAGAAC-1         HHD_1 SRR9264347         HHD       Caron
## 6_AAACCTGAGCTAGCCC-1         HHD_2 SRR9264348         HHD       Caron
## 7_AAACCTGAGAGCTGCA-1       PRE-T_1 SRR9264349       PRE-T       Caron
## 8_AAACCTGAGAACAACT-1       PRE-T_2 SRR9264350       PRE-T       Caron
## 9_AAACCTGCACTTCGAA-1      PBMMC_1a SRR9264351       PBMMC       Caron
## 10_AAACCTGAGATGAGAG-1     PBMMC_1b SRR9264352       PBMMC       Caron
## 11_AAACCTGAGAAACGAG-1      PBMMC_2 SRR9264353       PBMMC       Caron
## 12_AAACCTGAGGACAGCT-1      PBMMC_3 SRR9264354       PBMMC       Caron

Filter undetected genes

Although the count matrix has genes, many of these will not have been detected in any droplet.

What fraction of the genes are not detected?

detected_genes <- rowSums(counts(sce)) > 0
table(detected_genes)

## detected_genes
## FALSE  TRUE 
##  8224 28377

Remove these before proceeding in order to reduce the size of the single cell experiment object.

sce <- sce[detected_genes,]

Quality control

We will look at three QC metrics:

library size
number of expressed genes in each cell
proportion of UMIs mapped to genes in the mitochondrial genome

Add per cell QC metrics

Add the per cell QC metrics to the droplet annotation using the function addPerCellQC. In order to get the percentage of mitochondrial UMIs in each cell, your will need to pass the function a vector indicating which genes are mitochondrial.

is.mito <- which(rowData(sce)$Chromosome=="MT")

sce <- addPerCellQC(sce, subsets=list(Mito=is.mito), BPPARAM = bp.params)

Check that the function has added six columns to the droplet annotation:

sum: total UMI count
detected: number of features (genes) detected
subsets_Mito_sum: number of UMIs mapped to mitochondrial transcripts
subsets_Mito_detected: number of mitochondrial genes detected
subsets_Mito_percent: percentage of reads mapped to mitochondrial transcripts
total: also the total UMI count

head(colData(sce)) %>%
    as.data.frame() %>%
    datatable(rownames = FALSE)

QC metric distribution

Plot the distributions of the three metrics of interest for each sample. We have given you the code for the library size, you will need to modify this for the other two metrics. Check the course materials if you need to figure out which column to use.

Library Size

plotColData(sce, x="Sample", y="sum",other_fields="SampleGroup") + 
        facet_wrap(~SampleGroup, nrow=1, scales = "free_x") + 
        scale_y_log10() + 
        ggtitle("Total count")

Number of Genes detected

plotColData(sce, x="Sample", y="detected", other_fields="SampleGroup") + 
        facet_wrap(~SampleGroup, nrow=1, scales = "free_x") + 
        scale_y_log10() + 
        ggtitle("Detected features")

Percentage of UMIs mapped to mitochondrial transcripts

plotColData(sce, x="Sample", y="subsets_Mito_percent", other_fields="SampleGroup") + 
        facet_wrap(~SampleGroup, nrow=1, scales = "free_x") +
        ggtitle("Mito percent")

Identification of low-quality cells with adaptive thresholds

First we will apply the adaptive threshold filtering across all samples together. We have given you the code for filtering the by library size, you will need to add the commands for filtering on the number of genes detected and the mitochondrial content (make sure you filter the correct tail of the distribution).

Add the filters

low_lib_size <- isOutlier(sce$sum, log=TRUE, type="lower")
sce$low_lib_size <- low_lib_size

Number of Genes detected

low_n_features <- isOutlier(sce$detected, log=TRUE, type="lower")
sce$low_n_features <- low_n_features

Percentage of UMIs mapped to mitochondrial transcripts

high_Mito_percent <- isOutlier(sce$subsets_Mito_percent, type="higher")
sce$high_Mito_percent <- high_Mito_percent

Visualise results

Let’s have a look at how these filters would affect the data set.

Library Size

plotColData(sce, 
            x="Sample", 
            y="sum",
            other_fields="SampleGroup", 
            colour_by = "low_lib_size") + 
        facet_wrap(~SampleGroup, nrow=1, scales = "free_x") + 
        scale_y_log10() + 
        labs(y = "Total count", title = "Total count") +
        guides(colour=guide_legend(title="Discarded"))

Number of Genes detected

plotColData(sce, 
            x="Sample", 
            y="detected",
            other_fields="SampleGroup", 
            colour_by = "low_n_features") + 
        facet_wrap(~SampleGroup, nrow=1, scales = "free_x") + 
        scale_y_log10() + 
        labs(y = "Genes detected", title = "Genes detected") +
        guides(colour=guide_legend(title="Discarded"))

Percentage of UMIs mapped to mitochondrial transcripts

plotColData(sce, 
            x="Sample", 
            y="subsets_Mito_percent",
            other_fields="SampleGroup", 
            colour_by = "high_Mito_percent") + 
        facet_wrap(~SampleGroup, nrow=1, scales = "free_x") + 
        labs(y = "Percentage mitochondrial UMIs",
             title = "Mitochondrial UMIs") +
        guides(colour=guide_legend(title="Discarded"))

Considering experimental factors when filtering

Based on the experimental design and the plots above, d you think applying the adaptive filtering across all the samples is appropriate?

Perhaps a batched approach is more suitable. This time use the quickPerCellQC function to perform all three filters and add a parameter for the batch (this could be any column in the colData or you could add a new one).

batch.reasons <- quickPerCellQC(colData(sce),
                                percent_subsets=c("subsets_Mito_percent"),
                                batch=sce$Sample)

The table below shows how the thresholds for each metric differ between the batch-wise analysis and the analysis using all samples.

all.thresholds <- tibble(`Batch`="All",
       `Library Size`=attr(low_lib_size, "thresholds")[1],
       `Genes detected`=attr(low_n_features, "thresholds")[1],
       `Mitochondrial UMIs`=attr(high_Mito_percent, "thresholds")[2])


tibble(`Batch`=names(attr(batch.reasons$low_lib_size, "thresholds")[1,]),
       `Library Size`=attr(batch.reasons$low_lib_size, "thresholds")[1,],
       `Genes detected`=attr(batch.reasons$low_n_features, "thresholds")[1,],
       `Mitochondrial UMIs`=attr(batch.reasons$high_subsets_Mito_percent, "thresholds")[2,]) %>% 
    bind_rows(all.thresholds) %>% 
    mutate(across(where(is.numeric), round, digits=2)) %>% 
    datatable(rownames = FALSE)

Filter the data

Once you have decided how best to filter the data you should remove the droplets that do not pass your criteria from the data set.

Exclude poor-quality cells from the SingleCellExperiment object and save the final object to the results directory for later use.

sce.Filtered <- sce[,!batch.reasons$discard]
saveRDS(sce.Filtered, "results/Caron_filtered.rds")

Optional: Filter genes

You may additionally wish to filter the genes further to remove very sparse genes (see section 12). Remember that you will need to recalculate the colData metrics, as some of them were calculated across all genes. Save this final object as results/Caron_filtered_genes.rds.

sce <- sce.Filtered
sce <- addPerFeatureQC(sce, BPPARAM = bp.params)
rowData(sce)$gene_sparsity <- (100 - rowData(sce)$detected) / 100
min.cells <- 1 - (20 / ncol(sce))
sparse.genes <- rowData(sce)$gene_sparsity > min.cells

Generate a table showing the number of genes removed

table(sparse.genes)

## sparse.genes
## FALSE  TRUE 
## 19967  8410

Filter the genes.

sce <- sce[!sparse.genes, ]

Replace the Cell metrics

colData(sce) <- colData(sce)[,1:3]
sce <- addPerCellQC(sce, BPPARAM = bp.params)

Modify the sample names for PBMMC_1

Remove the suffix letters so that the droplets are assigned to the same sample - note that we can still distinguish the libraries based on the “SampleId” column.

colData(sce) <- colData(sce) %>%
    as.data.frame() %>% 
    mutate(across(Sample, str_remove, "[ab]$")) %>%
    DataFrame()

Export the final obect

Write the object out to a file in the results directory for later use.

saveRDS(sce, "results/Caron_filtered_genes.rds")

Session information

sessionInfo()

## R version 4.1.0 (2021-05-18)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.5 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
## 
## locale:
##  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
##  [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
##  [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
##  [1] grid      parallel  stats4    stats     graphics  grDevices utils    
##  [8] datasets  methods   base     
## 
## other attached packages:
##  [1] DT_0.18                     ggvenn_0.1.9               
##  [3] patchwork_1.1.1             forcats_0.5.1              
##  [5] stringr_1.4.0               dplyr_1.0.6                
##  [7] purrr_0.3.4                 readr_1.4.0                
##  [9] tidyr_1.1.3                 tibble_3.1.2               
## [11] tidyverse_1.3.1             BiocParallel_1.26.0        
## [13] scater_1.20.0               ggplot2_3.3.3              
## [15] scuttle_1.2.0               DropletUtils_1.12.1        
## [17] SingleCellExperiment_1.14.1 SummarizedExperiment_1.22.0
## [19] Biobase_2.52.0              GenomicRanges_1.44.0       
## [21] GenomeInfoDb_1.28.0         IRanges_2.26.0             
## [23] S4Vectors_0.30.0            BiocGenerics_0.38.0        
## [25] MatrixGenerics_1.4.0        matrixStats_0.59.0         
## 
## loaded via a namespace (and not attached):
##  [1] fs_1.5.0                  bitops_1.0-7             
##  [3] lubridate_1.7.10          httr_1.4.2               
##  [5] tools_4.1.0               backports_1.2.1          
##  [7] bslib_0.2.5.1             utf8_1.2.1               
##  [9] R6_2.5.0                  irlba_2.3.3              
## [11] HDF5Array_1.20.0          vipor_0.4.5              
## [13] DBI_1.1.1                 colorspace_2.0-1         
## [15] rhdf5filters_1.4.0        withr_2.4.2              
## [17] tidyselect_1.1.1          gridExtra_2.3            
## [19] compiler_4.1.0            cli_2.5.0                
## [21] rvest_1.0.0               BiocNeighbors_1.10.0     
## [23] xml2_1.3.2                DelayedArray_0.18.0      
## [25] sass_0.4.0                scales_1.1.1             
## [27] digest_0.6.27             rmarkdown_2.8            
## [29] R.utils_2.10.1            XVector_0.32.0           
## [31] pkgconfig_2.0.3           htmltools_0.5.1.1        
## [33] sparseMatrixStats_1.4.0   dbplyr_2.1.1             
## [35] limma_3.48.0              htmlwidgets_1.5.3        
## [37] readxl_1.3.1              rlang_0.4.11             
## [39] rstudioapi_0.13           DelayedMatrixStats_1.14.0
## [41] jquerylib_0.1.4           generics_0.1.0           
## [43] jsonlite_1.7.2            R.oo_1.24.0              
## [45] RCurl_1.98-1.3            magrittr_2.0.1           
## [47] BiocSingular_1.8.1        GenomeInfoDbData_1.2.6   
## [49] Matrix_1.3-4              Rcpp_1.0.6               
## [51] ggbeeswarm_0.6.0          munsell_0.5.0            
## [53] Rhdf5lib_1.14.1           fansi_0.5.0              
## [55] viridis_0.6.1             lifecycle_1.0.0          
## [57] R.methodsS3_1.8.1         stringi_1.6.2            
## [59] yaml_2.2.1                edgeR_3.34.0             
## [61] zlibbioc_1.38.0           rhdf5_2.36.0             
## [63] dqrng_0.3.0               crayon_1.4.1             
## [65] lattice_0.20-44           haven_2.4.1              
## [67] beachmat_2.8.0            hms_1.1.0                
## [69] locfit_1.5-9.4            knitr_1.33               
## [71] pillar_1.6.1              codetools_0.2-18         
## [73] ScaledMatrix_1.0.0        reprex_2.0.0             
## [75] glue_1.4.2                evaluate_0.14            
## [77] modelr_0.1.8              vctrs_0.3.8              
## [79] cellranger_1.1.0          gtable_0.3.0             
## [81] assertthat_0.2.1          xfun_0.23                
## [83] rsvd_1.0.5                broom_0.7.7              
## [85] viridisLite_0.4.0         beeswarm_0.4.0           
## [87] ellipsis_0.3.2

CRUK CI Summer School 2021 - introduction to single-cell RNA-seq analysis

Quality Control - Exercise

Stephane Ballereau, Ashley Sawle

Introduction

Load packages

Load the data

Sample meta data

The scRNAseq count data from CellRanger

Parallelisation

Load the data

QC

Check samples in the data set

Filter undetected genes

Quality control

Add per cell QC metrics

QC metric distribution

Library Size

Number of Genes detected

Percentage of UMIs mapped to mitochondrial transcripts

Identification of low-quality cells with adaptive thresholds

Add the filters

Number of Genes detected

Percentage of UMIs mapped to mitochondrial transcripts

Visualise results

Library Size

Number of Genes detected

Percentage of UMIs mapped to mitochondrial transcripts

Considering experimental factors when filtering

Filter the data

Optional: Filter genes

Modify the sample names for PBMMC_1

Export the final obect

Session information