The first step in the analysis of single cell RNAseq data is to align the sequenced reads against a genomic reference and then use a transcriptome annotation to generate read counts for each feature of interest. Typically for scRNAseq the features of interest are genes.
There are a variety of tools for doing this and your choice will depend in part on the method by which the library was generated. For data generated using the 10x-Chromium method data the most common approach is to use the Cell Ranger tool provided by 10x. This not only carries out the alignment and feature counting, but will also:
Cell Ranger is computationally very intensive, you will not be able to run it on a laptop or standard desktop computer. You will need access to, for example, a high performance computing (HPC) cluster, a server or other cloud-based computational resource with sufficient power - talk to your local IT support.
Alternative methods include:
For the purposes of this course, seeing as we are working with 10x-Chromium derived data, we will use Cell Ranger. As detailed instructions are available on the Cell Ranger pages of the 10x website, this chapter will not be comprehensive in terms of all options, but should provide a brief overview.
Cell Ranger incorporates a number of tools for handling different components of the single cell RNAseq analysis. In this chapter we will be looking at the count
tool, which is used to align reads, quantify gene expression and call cells. Later in the course you will encounter the aggr
(aggregate) tool, which can be used to merge multiple samples into a single experiment-level data set.
In addition to the analysis tools, Cell Ranger also includes the mkref
tool to generate a custom Cell Ranger reference from genomic and transcriptomic references.
Cell Ranger runs on Linux, and full installation instructions can be found on the 10x website.
Cell Ranger, like other aligners, requires the information about the genome and transcriptome of interest to be provided in a specific format. If you are working with the standard genome/transcriptome of human or mouse then you can download prebuilt references from the 10x website (there is even a combined human-mouse genome reference, which is useful for patient-derived xenograft experiments).
If you require a reference for the genome of another species, or wish to add custom transcripts to the gene annotation (e.g. fusion genes), then you will need to build your own index using the Cell Ranger mkref
tool. In order to prepare our reference for Cell Ranger, we need two files:
These files can be downloaded from public repositories such as Ensembl (see EnsemblGenomes for non-vertebrate genomes). As an example for our course, we have downloaded one of the chromosomes from the human genome page. In a real-life scenario you would download the full genome, conventionally named species.version.dna.toplevel.fa.gz
(for example, Homo_sapiens.GRCh38.dna.toplevel.fa.gz
). We have also downloaded the GTF file with transcript annotation for our genome from the same site.
To index our genome, we can use the cellranger mkref
tool as (replacing {...}
with the relevant piece of information):
cellranger mkref \
--nthreads={CPUS} \
--genome={OUTPUT FOLDER FOR INDEX} \
--fasta={GENOME FASTA} \
--genes={ANNOTATION GTF}
One thing of note is that mkref
does not work with files compressed with gzip
(the format that Ensembl provides them as). Therefore, make sure to decompress the files first (you can do this using the gunzip file.gz
command).
Open VS Code and create a shell script named 01-prepare_reference.sh
to index the reference genome for Cell Ranger. The output from mkref
should be named cellranger_index
. The start of the script is given here, so you can copy/paste this to a new file and work from there.
#!/bin/bash
# change to directory with the FASTA and GTF files
cd Data/reference/
# run mkref
cellranger mkref FIXME
Run your script from the terminal using bash
(make sure you’re running it from the ~/Course_Materials
directory.
The complete script would be:
#!/bin/bash
# change to directory with the FASTA and GTF files
cd Data/reference/
# run mkref
cellranger mkref \
--nthreads=8 \
--genome=cellranger_index \
--fasta=Homo_sapiens.GRCh38.dna.chromosome.21.fa \
--genes=Homo_sapiens.GRCh38.104.chr21.gtf
If we save this script in our scripts
folder as 01-prepare_reference.sh
, we could then run it from the terminal with:
$ bash scripts/01-prepare_reference.sh
count
Full details for running the cellranger count
tool can be found on the 10x website. Here, we go through a typical analysis workflow with this tool.
To run Cell Ranger count
the fastq files for samples to be processed should be placed in a single directory. Cell Ranger will be run separately on each sample. You will need to provide the software with the sample name (e.g. SRR9264343) of the sample to be processed. Cell Ranger expects the file names of the fastq files to follow a specific convention so that it can correctly identify the files for the specific sample. If the files names of your fastqs do not match this convention you will need to rename them.
The convention is the default convention used by bcl2fastq
(the tool that converts the raw sequencing data in bcl
format into fastq
files):
<SampleName>_S<SampleNumber>_L00<Lane>_<Read>_001.fastq.gz
bcl2fastq
. This is not important for Cell Ranger, other than it indicates the end of the Sample Name, you can set all your samples to S1.e.g. for a single sample in the Caron data set we have:
SRR9264343_S0_L001_I1_001.fastq.gz
SRR9264343_S0_L001_R1_001.fastq.gz
SRR9264343_S0_L001_R2_001.fastq.gz
cellranger count
The minimum information require to run cellranger count
is:
This will process all fastq files in the --fastqs
directory into a single sample. If you have multiple samples in a single directory then you need to add:
In addition, Cell Ranger is very computationally intensive, you will usually be wanting to run it on a high performance cluster or server, and it is advisable to set limits to the resources it will use:
A complete command for processing on of the Caron samples might be:
cellranger count --id={OUTPUT_SAMPLE_NAME} \
--transcriptome={DIRECTORY_WITH_REFERENCE} \
--fastqs={DIRECTORY_WITH_FASTQ_FILES} \
--sample={NAME_OF_SAMPLE_IN_FASTQ_FILES} \
--localcores={NUMBER_OF_CPUS} \
--localmem={RAM_MEMORY}
Open VS Code and create a shell script named 02-cellranger.sh
to obtain gene counts per cell for one of our samples. Our FASTQ files are named with the “SRR9264343” prefix, which corresponds to one of the replicates from a “ETV6/RUNX1” cancer subtype. Run cellranger count
with the output named “ETV6_RUNX1_rep1”.
The start of the script is given here, so you can copy/paste this to a new file and work from there.
#!/bin/bash
# make output directory
mkdir -p results/cellranger
# change into that directory
cd results/cellranger
# run cellranger count (maximum CPUs 8; maximum RAM 24GB)
cellranger count FIXME
Here is a bit more detail about writing the cellranger count
command:
cellranger count \
--id=FIXME \
--transcriptome=../../Data/reference/cellranger_index \
--fastqs=FIXME \
--sample=FIXME \
--localcores=8 \
--localmem=24
Note that we have to write the paths relative to the results folder we create within the script. See the example below for the --transcriptome
option and see if you can adapt it also for specifying the path to the FASTQ folder.
Run your script from the terminal using bash
.
The complete script would be:
#!/bin/bash
# make output directory
mkdir -p results/cellranger
# change into that directory
cd results/cellranger
# run cellranger count (maximum CPUs 8; maximum RAM 24GB)
cellranger count \
--id=ETV6_RUNX1_rep1 \
--transcriptome=../../Data/reference/cellranger_index \
--fastqs=../../Data/reads/ \
--sample=SRR9264343 \
--localcores=8 \
--localmem=24
If we save this script in our scripts
folder as 02-cellranger.sh
, we could then run it from the terminal with:
$ bash scripts/02-cellranger.sh
Cell Ranger will create a single results folder for each sample names. Each folder will be named according to the --id
option in the command. In the top level of the directory is a sub-directory called outs
, this contains the results of the analysis pipeline. There are also many intermediate and log files/directories that will not be of immediate interest.
The contents of the sample directory will look like this:
The contents of the outs
directory are:
The two count matrix directories each contain 3 files:
The count matrix directories and their corresponding HDF5 files contain the same information, they are just alternative formats for use depending on the tools that are used for analysis. In this course we will be loading the contents of the count matrix directories into R.
The filtered count matrix only contains droplets that have been called as cells by Cell Ranger.
The Cell Ranger summary report - web_summary.html
- is a very useful first port of call for assessing the quality of the data. An example is linked here (See also here)).
The first tab, Summary, contains various QC metrics about sequencing quality, mapping quality and cell calling. The second tab, Analysis, provides some basic analysis of droplets that Cell Ranger has identified as potentially containing cells, including some clustering and gene expression analysis. This report is interactive, allowing you to some very basic data exploration.
The report itself contains brief explanations of the contents of each section. These can be accessed by clicking the question mark icon in the header of the section. A more comprehensive explanation of the various components of the report can be found on the 10x website.
The cloupe.cloupe
file that Cell Ranger generates can be loaded using the Loupe Browser application provided by 10x. This provides an interactive visualization and analysis platform that allows you to extensively explore your data with a point and click interface.
The browser will allow you, for example, to find cells of interest through filtering and clustering tools, identify genes whose expression characterizes particular clusters of cells, or identify the type of cells in different clusters based on gene expression. While the browser is an excellent tool for data exploration and identification of potential avenues of analysis, it should not be used as a substitute for proper bioinformatic analysis. In particular, it should be remembered that Cell Ranger has not performed and QC on the contents of the droplets it has called as cells - in the next section of the course we will perform additional QC to eliminate poor quality cells.
Remember that for each read we have a sample barcode, a cell barcode, and a UMI. In addition we now have location on the genome and a corresponding gene annotation. The count for any gene for any cell is the number of unique reads detected for the combination:
sample barcode + cell barcode + UMI + gene ID
Thus we avoid counting PCR duplicate reads as these will have the same UMI. For this reason we more commonly talk about the “UMI count” rather than the “Read count”; the “Read count” could be higher than the “UMI count”
Cell Ranger performs some analysis on the data to distinguish between droplets that contain cells and empty droplets. The filtered counts matrices only contain data for droplets that Cell Ranger has determined to contain cells. It performs this filtering in two steps.
UMI threshold based filtering:
By default, Cell Ranger expects 3000 cells (you can change this in the arguments of the Cell Ranger command). Cell Ranger takes the top 3000 droplets by UMI counts. It then takes the UMI count at the 99th percentile of this set of cells. It then divides this number of UMIs by 10 to determine a hard UMI threshold. All droplets with a UMI count above this threshold are called as cells.
Comparison of droplet RNA profile with background:
This algorithm is based on one initially developed in the dropletUtils
R package. Cell Ranger selects all droplets with a UMI count of less than about 100 (this will usually be 100’s of thousands). This set of droplets is assumed to definitely not contain cells and as such any RNA detected is merely background within the media (due to damaged cells leaking RNA). An RNA profile is derived for this background. The RNA profile of all droplets with more than 100 UMIs but less than the hard UMI threshold is then compared to this background profile. Any droplet with a profile that differs significantly from the background profile is called as a cell.
A detailed description of the Cell Calling algorithm, along with explanations of the other algorithms used in the alignment and countings steps, can be found on the 10x Cell Ranger pages
The plot below shows a “barcode rank plot”. This is taken from the Cell Ranger web_summary.html
report.
This plot shows the UMI counts for all droplets, the droplets have been ranked by UMI count and coloured according to whether they have been called as cells or not. The light blue region, indicates a region where some droplets have been called as cells, but other have not - depending on the outcome of the RNA profiling step of the algorithm.