Preamble

This tutorial is primarily concerned with understanding how to quality control and normalise single cell RNA sequencing data. We will be using data from a droplet-based platform (10X Genomics). Many of the principals and steps, such as quality control and normalization factor estimation, will also hold true for data derived from plate-based or microfluidics technologies, such as SMART-seq2.

We will use data derived from the paper: Single-cell profiling of breast cancer T cells reveals a tissue-resident memory subset associated with improved prognosis. These are specifically CD3+CD45+ T cells derived from two patients with triple negative breast cancer. There are more specific details in the paper, but they have sequenced ~6000 cells. For the sake of computational cost and time we will only use a subset of ~1500 cells from these data.

The raw ingredients

We have already undertaken the first steps in processing the data for the sake of time. Namely, getting from the raw BCL files, converting to FASTQ files and then to a matrix of gene expression. Normally we would work initially from FASTQ files, but in this case only the BCL files were made publicly available. We have used the cellranger pipeline, which is executed from the command line. Specifically we used the following sets of commands to convert the BCL files to FASTQ, then to get the gene expression matrix from the FASTQ files:

cellranger mkfastq \
    --run=/ \
    --id=  \
    --sample-sheet= \
    --localcores=24 \
    --localmem=48 \
    --maxjobs=12 \

The first part cellranger mkfastq calls the cellranger binary tells it that we will be using the mkfastq tool. The full documentation for the cellranger pipeline is available here. Most of the time this first step isn’t required as you are likely to receive your data in FASTQ format. We then define the name of the sample (--id) and the location of the BCL files as an absolute path (--run). It is necessary to provide a mapping between the different BCL file IDs and the various sample indexes used in the experiment (--samplesheet). The extra flags are to control the amount of cluster resources used to convert to FASTQ files.

cellranger count --sample="" \
                 --id="" \
                 --transcriptome="" \
                 --fastqs="" \
                 --lanes="" \
                 --localcores=24 \
                 --localmem=48 \
                 --maxjobs=12

Once we have FASTQ files (3 per sample, i.e. Read 1, Read 2 and Indices), we can begin the task of aligning the reads, deduplication and quantifying gene expression using the UMIs. The first part cellranger count calls the cellranger binary and tells it that we will be using the count tool, we can also pass an ID to cellranger that makes sense to us as the user (--id). We have to provide cellranger with the location of both our genome and the genome annotation used so we can assign reads to genes (--transcriptome), and the location of the FASTQ files (--fastqs).

The final output of the cellranger pipeline, amongst other things, is a folder which contains the raw and filtered data. The raw data contains all cell barcodes that were included for that sample on the 10X chip, whilst the filtered data contains only data for cells which have been called valid by the cellranger pipeline.
There is some argument to using alternative methods to reclaim more cells as the the cellranger algorithm is generally conservative. However, more cells may come at the expense of lower quality. For now we’ll trust the cellranger algorithm to call the cells properly, and use the filtered data.

The directory structure of the pipeline output is as follows:

<SAMPLE>
   | outs/
      -- filtered_gene_bc_matrices_h5.h5
      -- metrics_summary.csv
      -- molecule_info.h5
      -- possorted_genome_bam.bam
      -- possorted_genome_bam.bam.bai
      -- raw_gene_bc_matrices_h5.h5
      -- web_summary.html
      | analysis/
      | filtered_gene_bc_matrices/
          | <genome>/
              -- barcodes.tsv
              -- genes.tsv
              -- matrix.mtx
      | raw_gene_bc_matrices/
          | <genome>/
              -- barcodes.tsv
              -- genes.tsv
              -- matrix.mtx

The parts we are most interested in are contained in the directory filtered_gene_bc_matrices. There are also a couple of other files that makes use of a format called HDF5. This is an efficient file format for storing large amounts of data on disk in a directory-like structure (NB: the structure is internal to the file so you’ll never see it explicitly in the standard filesystem). The standard cellrangerRkit library can access these data from the HDF5 format, or we can use an alternative library that uses the files in the filtered_gene_bc_matrices subfolder directory. We’ll opt for the latter as it is a more flexible option.

Setting up our R session

Before we begin on our single-cell journey, we need to load in several packages. These come from either the Bioconductor community, or directly from CRAN.

library(scran)
library(scater)
library(DropletUtils)
library(ggplot2)
library(SingleCellExperiment)
library(igraph)
library(Matrix)
library(biomaRt)
library(irlba)

sample1.path <- "~/CI_filesystem/mnt/nas2-data/jmlab/shared_folders/bioinformatics/20180516_FernandesM_ME_crukBioinfSumSch2018/HumanBreastTCells/GRCh38/Tils20_1/outs/filtered_gene_bc_matrices/GRCh38/"
sample2.path <- "~/CI_filesystem/mnt/nas2-data/jmlab/shared_folders/bioinformatics/20180516_FernandesM_ME_crukBioinfSumSch2018/HumanBreastTCells/GRCh38/Tils20_2/outs/filtered_gene_bc_matrices/GRCh38/"
sample3.path <- "~/CI_filesystem/mnt/nas2-data/jmlab/shared_folders/bioinformatics/20180516_FernandesM_ME_crukBioinfSumSch2018/HumanBreastTCells/GRCh38/Tils32/outs/filtered_gene_bc_matrices/GRCh38/"

Loading the data

To load the data into our R session, we will make use of the DropletUtils package developed by Aaron Lun, Jonny Griffiths and Davis McCarthy. This has several useful functions, the one which we are interested in is read10XCounts(). This will create an object called a SingleCellExperiment, that is based on the ExpressionSet objects you may have used for other bulk RNAseq analyses. It is a convenient way for us to store information on a single experiment, including meta-data on the samples and cells, as well as QC information, and feature annotations.

We’ll read in each sample separately, as we will show the effects of normalizing the samples together, and separately. Samples Tils20_1 and Tils20_2 are separate runs from the same donor, so it makes sense to combine them for normalization, but is this also true for our other sample, Tils32?

sample1.sce <- read10xCounts(sample1.path)
sample2.sce <- read10xCounts(sample2.path)
sample3.sce <- read10xCounts(sample3.path)

# we can inspect the SingleCellExperiment object
sample1.sce
## class: SingleCellExperiment 
## dim: 33694 2452 
## metadata(0):
## assays(1): counts
## rownames(33694): ENSG00000243485 ENSG00000237613 ...
##   ENSG00000277475 ENSG00000268674
## rowData names(2): ID Symbol
## colnames: NULL
## colData names(2): Sample Barcode
## reducedDimNames(0):
## spikeNames(0):

Internally there is a sparse matrix representation of the data, where the columns are the cells (cell barcodes), and the rows are the genes. In this case we make use of the Ensembl gene IDs as they are unique to each gene annotation. Specifically the sparse matrix is a dgCMatrix object from the Matrix package. The basis for using this object is that 0’s are not stored explicitly, therefore reducing the overall memory burden for large data.

NB: As soon as we start to normalize and transform the matrix, it becomes a dense matrix because we will add a small pseudocount value to each cell (the logarithm of 0 is undefined).

For the purposes of QC, we almost certainly want to combine our samples together, so that they all have the same genes, and that we apply the sample thresholds across samples. The easiest way to do this is to create a single large matrix for all of the cells, and create the relevant meta-data for these samples as well.
This meta-data can then be saved and used for later data-exploration if needed.

samp1.counts <- counts(sample1.sce)
samp2.counts <- counts(sample2.sce)
samp3.counts <- counts(sample3.sce)

# first check that all of these samples have the genes in the same order
all(c(rownames(samp1.counts) == rownames(samp2.counts), rownames(samp2.counts) == rownames(samp3.counts)))
## [1] TRUE

As the rownames and of sample 1 and sample 2 are identical (and in the same order), and the rownames of smaple 2 and sample 3 are identical (and in the same order), it logically follows that the rownames of samples 1 and 3 are also identical, and in the same order. This massively simplifies merging the three matrices together. We could specifically merge the matrices together based on their rownames, if they were not in the same order, but this is computationally more expensive than just gluing the columns together. NB: Always check that the rownames match up properly before merging, this has the potential to introduce some truly horrific problems if they are not.

# we'll iteratively combine the matrix columns together, and create a new SingleCellExperiment object
all.genes <- rownames(samp1.counts)
all.counts <- do.call(cbind,
                      list("Tils20_1"=samp1.counts,
                           "Tils20_2"=samp2.counts,
                           "Tils32"=samp3.counts))
# set the rownames as the gene IDs
rownames(all.counts) <- all.genes

# create a bit of meta-data for our cells
sample.id <- c(rep("Tils20_1", ncol(samp1.counts)), rep("Tils20_2", ncol(samp2.counts)), rep("Tils32", ncol(samp3.counts)))
sample.cells <- c(colData(sample1.sce)$Barcode, colData(sample2.sce)$Barcode, colData(sample3.sce)$Barcode)

tcell.sce <- SingleCellExperiment(list(counts=all.counts), 
                                  colData=data.frame("Sample"=sample.id, "Barcode"=as.character(sample.cells)))
rowData(tcell.sce) <- data.frame("Gene"=all.genes)
rowData(tcell.sce)$Gene <- as.character(rowData(tcell.sce)$Gene)
tcell.sce
## class: SingleCellExperiment 
## dim: 33694 6322 
## metadata(0):
## assays(1): counts
## rownames(33694): ENSG00000243485 ENSG00000237613 ...
##   ENSG00000277475 ENSG00000268674
## rowData names(1): Gene
## colnames: NULL
## colData names(2): Sample Barcode
## reducedDimNames(0):
## spikeNames(0):

We now have a single object for all ~8000 cells and 33,000 genes. First we want to remove superflous genes, i.e. those with no expression at all.

not.expressed <- rowSums(counts(tcell.sce)) == 0
table(not.expressed)
## not.expressed
## FALSE  TRUE 
## 19148 14546

We can straight away remove ~18,000 genes which are completely useless to us.

# store the cell-wise information
cols.meta <- colData(tcell.sce)
rows.meta <- rowData(tcell.sce)

nz.counts <- counts(tcell.sce)[!not.expressed, ]
tcell.sce <- SingleCellExperiment(list(counts=nz.counts))

# reset the column data on the new object
colData(tcell.sce) <- cols.meta
rowData(tcell.sce) <- rows.meta[!not.expressed, ]

We want to provide some information on each gene, for instance where its genome position is and which chromosome it is encoded on. We can also add a mapping between the Ensembl gene IDs and the more familiar HGNC gene symbols. We’ll use biomaRt to retrieve the relevant information from the Ensembl database.

# retrieve the feature information
gene.info <- rowData(tcell.sce)

# setup the biomaRt connection to Ensembl using the correct species genome (hsapiens_gene_ensembl)
ensembl <- useEnsembl(biomart='ensembl', dataset='hsapiens_gene_ensembl')

# retrieve the attributes of interest from biomaRt using the Ensembl gene ID as the key
# beware that this will only retrieve information for matching IDs
gene_symbol <- getBM(attributes=c('ensembl_gene_id', 'external_gene_name', 'chromosome_name', 'start_position', 'end_position', 'strand'),
                     filters='ensembl_gene_id', mart=ensembl,
                     values=gene.info[, 1])

# create a new data frame of the feature information
gene.merge <- merge(gene_symbol, gene.info, by.x=c('ensembl_gene_id'), by.y=c('value'), all.y=TRUE)
rownames(gene.merge) <- gene.merge$ensembl_gene_id

# set the order for the same as the original gene information
gene.merge <- gene.merge[gene.info[, 1], ]

# reset the rowdata on the SCE object to contain all of this information
rowData(tcell.sce) <- gene.merge

We can quickly inspect the distribution of gene expression across the different chromosomes:

table(rowData(tcell.sce)$chromosome_name)
## 
##          1         10         11         12         13         14 
##       1928        684       1022       1033        328        687 
##         15         16         17         18         19          2 
##        635        942       1197        300       1373       1256 
##         20         21         22          3          4          5 
##        492        219        490       1043        655        842 
##          6          7          8          9 GL000009.2 GL000194.1 
##        990        893        674        711          1          2 
## GL000195.1 GL000218.1 GL000219.1 KI270711.1 KI270721.1 KI270734.1 
##          1          1          1          1          1          1 
##         MT          X          Y 
##         13        617          2

Most of the chromosome names will be familiar, i.e. 1-22, X, Y, MT. However, there are also a couple of other chromosomes with strange names. These are alternative or unfinished contigs. Quite often these represent complex regions of the genome, such as the great diversity of different haplotypes across the MHC region. Generally speaking we do not use these genes in our analyses. For the moment we will retain them, but they will most likely be removed during the following quality control steps.

Quality Control

There are 2 principal steps to quality control on single cell data:

Single-cell RNA sequencing data are inherently sparse, that means that for any given gene there will be many cells where there is no observed expression. This can arise for 2 reasons: transcriptional bursting, or stochastic dropout. The first of these is an interesting biological phenomenon, the other is a consequence of the low input material (each single cell contains ~10-30pg of RNA). In order to properly estimate the normalization factors for each cell we need to reduce the number of 0’s contained in each cell, without discarding too many genes as this is precious information. The easiest way is to just remove genes with all 0-values, i.e. no evidence of expression in any cell. We can also set a more conservative threshold, where a gene must be expressed in at least N cells.

We can judge the quality of a cell by several metrics:

To aid us in our quality control, we will use some functionality from scater, namely the calculateQCMetrics() function.

tcell.sce <- calculateQCMetrics(tcell.sce, feature_controls=list(Mito=which(rowData(tcell.sce)$chromosome_name == "MT")))
## Note that the names of some metrics have changed, see 'Renamed metrics' in ?calculateQCMetrics.
## Old names are currently maintained for back-compatibility, but may be removed in future releases.

The easiest way to assess the quality of our data is to graphicaly visualise them. We can plot the distribution of library sizes (i.e. total number of reads per cell), and the mitochondrial content of cells.

par(mfrow=c(1, 2))
hist(tcell.sce$log10_total_counts, breaks=20, col="grey80", xlab="Log-total UMI count", main="")

hist(tcell.sce$pct_counts_Mito, breaks=20, col="grey80", xlab="Proportion of reads in mitochondrial genes", main="")
abline(v=20, lty=2, col='purple')

We can see that the distribution of UMI counts (left plot) is unimodal and fairly well spread between 3.0 and ~4.0, which means that most cells contain > 1000 and < 10,000 UMIs. I don’t seen much call for removing cells on the basis that they are not sequenced deeply enough here. For other data sets there is a strong argument for removing cells based on sequencing depth; you don’t wan’t to retain cells with very little RNA or information as these may bias down-stream analyses. The right-hand plot shows the distribution of UMIs assigned to mitochondrial genes. Remember that high mitochondrial content is indicative of poor-quality cells. The vast majority of cells have quite a low content, so we’ll use a cut-off at 20% to eliminate poor quality cells (shown by the vertical purple line).

cell_sparsity <- apply(counts(tcell.sce) == 0, 2, sum)/nrow(counts(tcell.sce))
gene_sparsity <- apply(counts(tcell.sce) == 0, 1, sum)/ncol(counts(tcell.sce))

par(mfrow=c(1, 2))
hist(cell_sparsity, breaks=20, col="grey80", xlab="Cell sparsity", main="")

hist(gene_sparsity, breaks=20, col="grey80", xlab="Gene sparsity", main="")
abline(v=40, lty=2, col='purple')

The plot on the left shows the number of 0’s per cell. We can see that broadly cells have between 75% and 99% 0’s, this is a typical distribution, we only need to remove a few cells with > 98% 0’s. The gene-wise picture is a little different, namely there are many, many genes with almost no observations. We can instantly discard any gene with all 0’s, but what about some other genes? For the sake of reducing the computational burden, and increasing the stability of the normalization factor estimation, we’ll require a gene to be expressed in at least 10 cells.

So now that we have our QC criteria and thresholds we can remove problematic cells and genes.

# the order of cells and genes is the same as it is in the SCE object
sparse.cells <- cell_sparsity > 0.98
mito.cells <- tcell.sce$pct_counts_Mito > 20

min.cells <- 1 - 10/length(cell_sparsity)
sparse.genes <- gene_sparsity < min.cells

# add an annotation to the SCE object to remove bad cells
tcell.sce <- tcell.sce[, !sparse.cells | !mito.cells]

# remove the sparse genes, then re-set the counts and row data accordingly
nz.counts <- counts(tcell.sce)[sparse.genes, ]
nz.sce <- SingleCellExperiment(assays=list(counts=nz.counts))
colData(nz.sce) <- colData(tcell.sce)
rowData(nz.sce) <- rowData(tcell.sce)[sparse.genes, ]
nz.sce
## class: SingleCellExperiment 
## dim: 13010 6312 
## metadata(0):
## assays(1): counts
## rownames(13010): ENSG00000279457 ENSG00000228463 ...
##   ENSG00000278384 ENSG00000276345
## rowData names(16): ensembl_gene_id external_gene_name ...
##   n_cells_counts pct_dropout_counts
## colnames: NULL
## colData names(46): Sample Barcode ... total_features_Mito
##   log10_total_features_Mito
## reducedDimNames(0):
## spikeNames(0):

We now have an SCE object that contains the cells and genes that pass QC, in this case 13010 genes and 6312 cells.

Challenge: What are the consequences of altering the different QC thresholds? For instance, if we keep more cells with sparse data, how does this affect the size factor estimation and normalization?

Challenge: The calculateQCMetrics() function creates a data frame in the colData() slot of the SCE object. Explore some of the other QC metrices that it calculates. Can you find other metrics that you think are important for removing poor quality cells?

Normalization

We are going to apply the deconvolution method for estimating the size factors for normalization, found in Pooling across cells to normalize single-cell RNA sequencing data with many zero counts. First the cells will be roughly clustered together, then we will repeateadly calculate within-pool size factors. This creates a system of linear equations that can solved to deconvolute the contribution of each cell to the pooled size factors. In this way we can get an estimate for the size factor for each cell, whilst remaining robust to the presence of many 0’s. If we tried to directly calculate size factors, for instance as is used in DESeq, we would have factors distorted by the large numbers of 0’s present in each cell.

The first step is to do the quick clustering. Because we have many cells, classical hierarchical clustering algorithms can be a little slow at this stage.
Consequently, we’ll use a graph-based algorithm that is much faster (this will be described in part 2). These functions are all found in the scran package.

We’re going to do this in two parts, first normalizing all of the cells together, then normalizing the samples separately. What do you think will be the consequence of these two different approaches?

clusters <- quickCluster(nz.sce, method="igraph")
table(clusters)
## clusters
##    1    2    3 
## 1596 2976 1740
nz.sce <- computeSumFactors(nz.sce, min.mean=0.1, cluster=clusters)
summary(sizeFactors(nz.sce))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1819  0.5136  0.7296  1.0000  1.1909 10.2789

We can check that the size factors are well correlated with the library sizes. We can also look at the effect of trying to estimate the size factors using the approaches designed for bulk RNA sequencing.

par(mfrow=c(1, 2))
plot(nz.sce$total_counts, sizeFactors(nz.sce), log="xy",
     xlab="Total library size", ylab="Deconvolution size factor")

hist(sizeFactors(nz.sce), breaks=20, col="grey80", xlab="Deconvolution size factors")

There should be a very strong correlation between the size factors and the total library size, and very few size factors should be close to 0. Now compare this to the bulk RNA-sequencing approach for calculating size factors.

library(DESeq2)
# set up a DESeq object
dds <- DESeqDataSetFromMatrix(countData=as(counts(nz.sce), "matrix"), colData=colData(nz.sce), 
                              design=~total_counts_endogenous)
dds <- estimateSizeFactors(dds)

par(mfrow=c(1, 2))
plot(nz.sce$total_counts, sizeFactors(dds),
     xlab="Total library size", ylab="Bulk size factor")

hist(sizeFactors(dds), breaks=20, col="grey80", xlab="Bulk size factors")