1 Introduction

The first step in the analysis of single cell RNAseq data is to align the sequenced reads against a genomic reference and then use a transcriptome annotation to generate read counts for each feature of interest. Typically for scRNAseq the features of interest are genes.

There are a variety of tools for doing this and your choice will depend in part on the method by which the library was generated. For data generated using the 10x-Chromium method data the most common approach is to use the Cell Ranger tool provided by 10x. This not only carries out the alignment and feature counting, but will also:

Cell Ranger is computationally very intensive, you will not be able to run it on a laptop or standard desktop computer. You will need access to, for example, a high performance computing (HPC) cluster, a server or other cloud-based computational resource with sufficient power - talk to your local IT support.

Alternative methods include:

For the purposes of this course, seeing as we are working with 10x-Chromium derived data, we will use Cell Ranger. As detailed instructions are available on the Cell Ranger pages of the 10x website, this chapter will not be comprehensive in terms of all options, but should provide a brief overview.

2 10x Cell Ranger pipeline in brief

Cell Ranger incorporates a number of tools for handling different components of the single cell RNAseq analysis. In this chapter we will be looking at the count tool, which is used to align reads, quantify gene expression and call cells. Later in the course you will encounter the aggr (aggregate) tool, which can be used to merge multiple samples into a single experiment-level data set.

In addition to the analysis tools, Cell Ranger also includes the mkref tool to generate a custom Cell Ranger reference from genomic and transcriptomic references.

2.1 Installing Cell Ranger

Cell Ranger runs on Linux, and full installation instructions can be found on the 10x website.

2.2 Cell Ranger references

Cell Ranger, like other aligners, requires the information about the genome and transcriptome of interest to be provided in a specific format. If you are working with the standard genome/transcriptome of human or mouse then you can download prebuilt references from the 10x website (there is even a combined human-mouse genome reference, which is useful for patient-derived xenograft experiments).

If you require a reference for the genome of another species, or wish to add custom transcripts to the gene annotation (e.g. fusion genes), then you will need to build your own index using the Cell Ranger mkref tool. In order to prepare our reference for Cell Ranger, we need two files:

  • A reference genome in FASTA format
  • A transcript annotation file in GTF format

These files can be downloaded from public repositories such as Ensembl (see EnsemblGenomes for non-vertebrate genomes).

As an example for our course, we have downloaded one of the chromosomes from the human genome page. In a real-life scenario you would download the full genome, conventionally named species.version.dna.toplevel.fa.gz (for example, Homo_sapiens.GRCh38.dna.toplevel.fa.gz). We have also downloaded the GTF file with transcript annotation for our genome from the same site.

You can find these reference files in the directory Data/references.

To index our genome, we can use the cellranger mkref tool with the following arguments (replacing {...} with the relevant piece of information):

cellranger mkref \
  --fasta={GENOME FASTA} \
  --genes={ANNOTATION GTF} \
  --genome={OUTPUT FOLDER FOR INDEX} \
  --nthreads={CPUS}

where:

  • GENOME FASTA is a file containing the reference genome in FASTA format
  • ANNOTATION GTF is a file containing the transcript annotation file in GTF format
  • OUTPUT FOLDER FOR INDEX is a name for the output folder containing the new reference package (you do not need to create this folder)
  • CPUS - Is the number of CPUs we would like CellRanger to use. The more CPUs CellRanger can use, the faster the job (up to a point).

One thing of note is that mkref does not work with files compressed with gzip (the format that Ensembl provides them as). Therefore, make sure to decompress the files first (you can do this using the gunzip file.gz command).

Open Visual Studio Code and create a shell script named 01_prepare_reference.sh to index the reference genome for Cell Ranger.

You’ll need to give CellRanger the names of the two reference files in Data/references.

The output from mkref should be named cellranger_index.

You should tell CellRanger to use 7 CPUs.

The start of the script is given here, so you can copy this to a new file and work from there. You’ll need to replace the <<_YOUR_CODE_HERE_>> with the necessary arguments for the cellranger mkref command.

#!/bin/bash

# change to directory with the FASTA and GTF files
cd Data/references/

# run mkref
cellranger mkref\
  --fasta=<<_YOUR_CODE_HERE_>> \
  --genes=<<_YOUR_CODE_HERE_>> \
  --genome=<<_YOUR_CODE_HERE_>> \
  --nthreads=<<_YOUR_CODE_HERE_>>

Run your script from the terminal using bash (make sure you’re running it from the ~/Course_Materials directory).

Answer

The complete script would be:

#!/bin/bash

# change to directory with the FASTA and GTF files
cd Data/references/

# run mkref
cellranger mkref \
  --fasta=Homo_sapiens.GRCh38.dna.chromosome.21.fa \
  --genes=gencode.v41.primary_assembly.annotation.chr21.gtf \
  --genome=cellranger_index \
  --nthreads=7

If we save this script in our scripts folder as 01-prepare_reference.sh, we could then run it from the terminal with:

$ bash 01_prepare_reference.sh

2.3 Cell Ranger count

Full details for running the cellranger count tool can be found on the 10x website. Here, we go through a typical analysis workflow with this tool.

2.3.1 Preparing the raw fastq files

To run Cell Ranger count the fastq files for samples to be processed should be placed in a single directory. Cell Ranger will be run separately on each sample. You will need to provide the software with the sample name (e.g. SRR9264343) of the sample to be processed. Cell Ranger expects the file names of the fastq files to follow a specific convention so that it can correctly identify the files for the specific sample. If the files names of your fastqs do not match this convention you will need to rename them.

The convention is the default convention used by bcl2fastq (the tool that converts the raw sequencing data in bcl format into fastq files):

<SampleName>_S<SampleNumber>_L00<Lane>_<Read>_001.fastq.gz
  • <SampleName> - An identifier for the sample, this is what Cell Ranger uses to determine which fastq files to combine into a single sample.
  • <SampleNumber> - This is the sample number based on the order that samples were listed in the sample sheet used when running bcl2fastq. This is not important for Cell Ranger, other than it indicates the end of the Sample Name, you can set all your samples to S1.
  • <Lane> - The lane number. If your sequencing was run across multiple lanes, then you may have multiple sets of fastqs for a single sample with different lane numbers.
  • <Read> - The read type: R1 for Read 1, R2 for Read 2, and index reads are I1 and I2.
  • 001 - The last segment is always 001.

e.g. for a single sample in the Caron data set we have:

    SRR9264343_S0_L001_I1_001.fastq.gz
    SRR9264343_S0_L001_R1_001.fastq.gz
    SRR9264343_S0_L001_R2_001.fastq.gz

2.3.2 Running cellranger count

The minimum information require to run cellranger count is:

  • --id - A sample ID. This is used for naming the outputs
  • --transcriptome - the directory containing the Cell Ranger reference
  • --fastqs - the directory containing the fastq files

This will process all fastq files in the --fastqs directory into a single sample. If you have multiple samples in a single directory then you need to add:

  • --sample - the SampleName from the fastq files as above.

In addition, Cell Ranger is very computationally intensive, you will usually be wanting to run it on a high performance cluster or server. It will greedily attempt to use all resources it can find available, and so it is advisable to set limits to the resources it will use:

  • --localcores - the number of processors Cell Ranger should use
  • --localmem - the amount of memory, in Gigabytes, Cell Ranger should use.

A complete command for processing on of the Caron samples might be:

cellranger count --id={OUTPUT_SAMPLE_NAME} \
                 --transcriptome={DIRECTORY_WITH_REFERENCE} \
                 --fastqs={DIRECTORY_WITH_FASTQ_FILES} \
                 --sample={NAME_OF_SAMPLE_IN_FASTQ_FILES} \
                 --localcores={NUMBER_OF_CPUS} \
                 --localmem={RAM_MEMORY}

Open Visual Studio Code and create a shell script named 02_cellranger_count.sh to obtain gene counts per cell for one of our samples. Our FASTQ files are named with the “SRR9264343” prefix, which corresponds to one of the replicates from a “ETV6_RUNX1” cancer subtype, and are located in Data/fastq.

Run cellranger count using the fastq files for this sample and have Cell Ranger name the output directory “ETV6_RUNX1_rep1”.

The start of the script is given here, so you can copy/paste this to a new file and work from there.

You will also need to give Cell Ranger the path to the cell_ranger index that we just generated.

Finally, set the number of cores for Cell Ranger to use to 7 and the local memory limit to 24 (Gb).

#!/bin/bash

# run cellranger count (maximum CPUs 8; maximum RAM 24GB)
cellranger count <<_YOUR_CODE_HERE_>>
Hint

Here is a bit more detail about writing the cellranger count command:

cellranger count \
  --id=<<_OUTPUT_SAMPLE_NAME_>> \
  --transcriptome=Data/references/cellranger_index \
  --fastqs=<<_DIRECTORY_WITH_FASTQ_FILES_>> \
  --sample=<<_NAME_OF_SAMPLE_IN_FASTQ_FILES_>> \
  --localcores=8 \
  --localmem=24

Note that we have to write the paths relative to the results folder we create within the script. See the example below for the --transcriptome option and see if you can adapt it also for specifying the path to the FASTQ folder.

Run your script from the terminal using bash.

Answer

The complete script would be:

#!/bin/bash

# run cellranger count (maximum CPUs 8; maximum RAM 24GB)
cellranger count \
  --id=ETV6_RUNX1_rep1 \
  --transcriptome=Data/references/cellranger_index \
  --fastqs=Data/fastq/ \
  --sample=SRR9264343 \
  --localcores=8 \
  --localmem=24

If we save this script in our scripts folder as 02-cellranger.sh, we could then run it from the terminal with:

$ bash 02_cellranger_count.sh

3 Cell Ranger outputs

Cell Ranger will create a single results folder for each sample names. Each folder will be named according to the --id option in the command. In the top level of the directory is a sub-directory called outs, this contains the results of the analysis pipeline. There are also many intermediate and log files/directories that will not be of immediate interest.

The contents of the sample directory will look like this:

The contents of the outs directory are:

The two count matrix directories each contain 3 files:

The count matrix directories and their corresponding HDF5 files contain the same information, they are just alternative formats for use depending on the tools that are used for analysis. In this course we will be loading the contents of the count matrix directories into R.

The filtered count matrix only contains droplets that have been called as cells by Cell Ranger.

3.1 The summary report

The Cell Ranger summary report - web_summary.html - is a very useful first port of call for assessing the quality of the data. An example is linked here (See also here).

The first tab, Summary, contains various QC metrics about sequencing quality, mapping quality and cell calling. The second tab, Analysis, provides some basic analysis of droplets that Cell Ranger has identified as potentially containing cells, including some clustering and gene expression analysis. This report is interactive, allowing you to some very basic data exploration.

The report itself contains brief explanations of the contents of each section. These can be accessed by clicking the question mark icon in the header of the section. A more comprehensive explanation of the various components of the report can be found on the 10x website.

3.2 The cloupe file and the Loupe browser

The cloupe.cloupe file that Cell Ranger generates can be loaded using the Loupe Browser application provided by 10x. This provides an interactive visualization and analysis platform that allows you to extensively explore your data with a point and click interface.

The browser will allow you, for example, to find cells of interest through filtering and clustering tools, identify genes whose expression characterizes particular clusters of cells, or identify the type of cells in different clusters based on gene expression. While the browser is an excellent tool for data exploration and identification of potential avenues of analysis, it should not be used as a substitute for proper bioinformatic analysis. In particular, it should be remembered that Cell Ranger has not performed and QC on the contents of the droplets it has called as cells - in the next section of the course we will perform additional QC to eliminate poor quality cells.

3.3 The count matrices

3.3.1 What are the counts?

Remember that for each read we have a sample barcode, a cell barcode, and a UMI. In addition we now have location on the genome and a corresponding gene annotation. The count for any gene for any cell is the number of unique reads detected for the combination:

sample barcode + cell barcode + UMI + gene ID

Thus we avoid counting PCR duplicate reads as these will have the same UMI. For this reason we more commonly talk about the “UMI count” rather than the “Read count”; the “Read count” could be higher than the “UMI count”

3.3.2 Filtered vs. Raw matrices

Cell Ranger performs some analysis on the data to distinguish between droplets that contain cells and empty droplets. The filtered counts matrices only contain data for droplets that Cell Ranger has determined to contain cells. It performs this filtering in two steps.

  1. UMI threshold based filtering:
    By default, Cell Ranger expects 3000 cells (you can change this in the arguments of the Cell Ranger command). Cell Ranger takes the top 3000 droplets by UMI counts. It then takes the UMI count at the 99th percentile of this set of cells. It then divides this number of UMIs by 10 to determine a hard UMI threshold. All droplets with a UMI count above this threshold are called as cells.

  2. Comparison of droplet RNA profile with background:
    This algorithm is based on one initially developed in the dropletUtils R package. Cell Ranger selects all droplets with a UMI count of less than 500 (this will usually be 100’s of thousands). This set of droplets is assumed to definitely not contain cells and as such any RNA detected is merely background within the media (due to damaged cells leaking RNA). An RNA profile is derived for this background. The RNA profile of all droplets with more than 500 UMIs but less than the hard UMI threshold is then compared to this background profile. Any droplet with a profile that differs significantly from the background profile is called as a cell.

A detailed description of the Cell Calling algorithm, along with explanations of the other algorithms used in the alignment and countings steps, can be found on the 10x Cell Ranger pages

The plot below shows a “barcode rank plot”. This is taken from the Cell Ranger web_summary.html report.

This plot shows the UMI counts for all droplets, the droplets have been ranked by UMI count and coloured according to whether they have been called as cells or not. The light blue region, indicates a region where some droplets have been called as cells, but other have not - depending on the outcome of the RNA profiling step of the algorithm.