1 Introduction

The first step in the analysis of single cell RNAseq data is to align the sequenced reads against a genomic reference and then use a transcriptome annotation to generate read counts for each feature of interest. Typically for scRNAseq the features of interest are genes.

There are a variety of tools for doing this and your choice will depend in part on the method by which the library was generated. For data generated using the 10x-Chromium method data the most common approach is to use the Cell Ranger tool provided by 10x. This not only carries out the alignment and feature counting, but will also:

Cell Ranger is computationally very intensive, you will not be able to run it on a laptop or standard desktop computer. You will need access to, for example, a high performance computing (HPC) cluster, a server or other cloud-based computational resource with sufficient power - talk to your local IT support.

Alternative methods include:

For the purposes of this course, seeing as we are working with 10x-Chromium derived data, we will use Cell Ranger. As detailed instructions are available on the Cell Ranger pages of the 10x website, this chapter will not be comprehensive in terms of all options, but should provide a brief overview.

2 10x Cell Ranger pipeline in brief

Cell Ranger incorporates a number of tools for handling different components of the single cell RNAseq analysis. In this chapter we will be looking at the count tool, which is used to align reads, quantify gene expression and call cells. Later in the course you will encounter the aggr (aggregate) tool, which can be used to merge multiple samples into a single experiment-level data set.

In addition to the analysis tools, Cell Ranger also includes the mkref tool to generate a custom Cell Ranger reference from genomic and transcriptomic references.

2.1 Installing Cell Ranger

Cell Ranger runs on Linux, and full installation instructions can be found on the 10x website.

2.2 Cell Ranger references

Cell Ranger, like other aligners, requires the information about the genome and transcriptome of interest to be provided in a specific format. If you are working with the standard genome/transcriptome of human or mouse then you can download prebuilt references from the 10x website (there is even a combined human-mouse genome reference, which is useful for patient-derived xenograft experiments).

If you require a reference for the genome of another species, or wish to add custom transcripts to the gene annotation (e.g. fusion genes), then you will need to build your own index using the Cell Ranger mkref tool. In order to prepare our reference for Cell Ranger, we need two files:

  • A reference genome in FASTA format
  • A transcript annotation file in GTF format

These files can be downloaded from public repositories such as Ensembl (see EnsemblGenomes for non-vertebrate genomes). As an example for our course, we have downloaded one of the chromosomes from the human genome page. In a real-life scenario you would download the full genome, conventionally named species.version.dna.toplevel.fa.gz (for example, Homo_sapiens.GRCh38.dna.toplevel.fa.gz). We have also downloaded the GTF file with transcript annotation for our genome from the same site.

To index our genome, we can use the cellranger mkref tool as (replacing {...} with the relevant piece of information):

cellranger mkref \
  --nthreads={CPUS} \
  --genome={OUTPUT FOLDER FOR INDEX} \
  --fasta={GENOME FASTA} \
  --genes={ANNOTATION GTF}

One thing of note is that mkref does not work with files compressed with gzip (the format that Ensembl provides them as). Therefore, make sure to decompress the files first (you can do this using the gunzip file.gz command).

Open VS Code and create a shell script named 01-prepare_reference.sh to index the reference genome for Cell Ranger. The output from mkref should be named cellranger_index. The start of the script is given here, so you can copy/paste this to a new file and work from there.

#!/bin/bash

# change to directory with the FASTA and GTF files
cd Data/reference/

# run mkref
cellranger mkref FIXME

Run your script from the terminal using bash (make sure you’re running it from the ~/Course_Materials directory.

Answer

The complete script would be:

#!/bin/bash

# change to directory with the FASTA and GTF files
cd Data/reference/

# run mkref
cellranger mkref \
  --nthreads=8 \
  --genome=cellranger_index \
  --fasta=Homo_sapiens.GRCh38.dna.chromosome.21.fa \
  --genes=Homo_sapiens.GRCh38.104.chr21.gtf

If we save this script in our scripts folder as 01-prepare_reference.sh, we could then run it from the terminal with:

$ bash scripts/01-prepare_reference.sh

2.3 Cell Ranger count

Full details for running the cellranger count tool can be found on the 10x website. Here, we go through a typical analysis workflow with this tool.

2.3.1 Preparing the raw fastq files

To run Cell Ranger count the fastq files for samples to be processed should be placed in a single directory. Cell Ranger will be run separately on each sample. You will need to provide the software with the sample name (e.g. SRR9264343) of the sample to be processed. Cell Ranger expects the file names of the fastq files to follow a specific convention so that it can correctly identify the files for the specific sample. If the files names of your fastqs do not match this convention you will need to rename them.

The convention is the default convention used by bcl2fastq (the tool that converts the raw sequencing data in bcl format into fastq files):

<SampleName>_S<SampleNumber>_L00<Lane>_<Read>_001.fastq.gz
  • <SampleName> - An identifier for the sample, this is what Cell Ranger uses to determine which fastq files to combine into a single sample.
  • <SampleNumber> - This is the sample number based on the order that samples were listed in the sample sheet used when running bcl2fastq. This is not important for Cell Ranger, other than it indicates the end of the Sample Name, you can set all your samples to S1.
  • <Lane> - The lane number. If your sequencing was run across multiple lanes, then you may have multiple sets of fastqs for a single sample with different lane numbers.
  • <Read> - The read type: R1 for Read 1, R2 for Read 2, and index reads are I1 and I2.
  • 001 - The last segment is always 001.

e.g. for a single sample in the Caron data set we have:

    SRR9264343_S0_L001_I1_001.fastq.gz
    SRR9264343_S0_L001_R1_001.fastq.gz
    SRR9264343_S0_L001_R2_001.fastq.gz

2.3.2 Running cellranger count

The minimum information require to run cellranger count is:

  • --id - A sample ID. This is used for naming the outputs
  • --transcriptome - the directory containing the Cell Ranger reference
  • --fastqs - the directory containing the fastq files

This will process all fastq files in the --fastqs directory into a single sample. If you have multiple samples in a single directory then you need to add:

  • --sample - the SampleName from the fastq files as above.

In addition, Cell Ranger is very computationally intensive, you will usually be wanting to run it on a high performance cluster or server, and it is advisable to set limits to the resources it will use:

  • --localcores - the number of processors Cell Ranger should use
  • --localmem - the amount of memory, in Gigabytes, Cell Ranger should use.

A complete command for processing on of the Caron samples might be:

cellranger count --id={OUTPUT_SAMPLE_NAME} \
                 --transcriptome={DIRECTORY_WITH_REFERENCE} \
                 --fastqs={DIRECTORY_WITH_FASTQ_FILES} \
                 --sample={NAME_OF_SAMPLE_IN_FASTQ_FILES} \
                 --localcores={NUMBER_OF_CPUS} \
                 --localmem={RAM_MEMORY}

Open VS Code and create a shell script named 02-cellranger.sh to obtain gene counts per cell for one of our samples. Our FASTQ files are named with the “SRR9264343” prefix, which corresponds to one of the replicates from a “ETV6/RUNX1” cancer subtype. Run cellranger count with the output named “ETV6_RUNX1_rep1”.

The start of the script is given here, so you can copy/paste this to a new file and work from there.

#!/bin/bash

# make output directory
mkdir -p results/cellranger

# change into that directory
cd results/cellranger

# run cellranger count (maximum CPUs 8; maximum RAM 24GB)
cellranger count FIXME
Hint

Here is a bit more detail about writing the cellranger count command:

cellranger count \
  --id=FIXME \
  --transcriptome=../../Data/reference/cellranger_index \
  --fastqs=FIXME \
  --sample=FIXME \
  --localcores=8 \
  --localmem=24

Note that we have to write the paths relative to the results folder we create within the script. See the example below for the --transcriptome option and see if you can adapt it also for specifying the path to the FASTQ folder.

Run your script from the terminal using bash.

Answer

The complete script would be:

#!/bin/bash

# make output directory
mkdir -p results/cellranger

# change into that directory
cd results/cellranger

# run cellranger count (maximum CPUs 8; maximum RAM 24GB)
cellranger count \
  --id=ETV6_RUNX1_rep1 \
  --transcriptome=../../Data/reference/cellranger_index \
  --fastqs=../../Data/reads/ \
  --sample=SRR9264343 \
  --localcores=8 \
  --localmem=24

If we save this script in our scripts folder as 02-cellranger.sh, we could then run it from the terminal with:

$ bash scripts/02-cellranger.sh

3 Cell Ranger outputs

Cell Ranger will create a single results folder for each sample names. Each folder will be named according to the --id option in the command. In the top level of the directory is a sub-directory called outs, this contains the results of the analysis pipeline. There are also many intermediate and log files/directories that will not be of immediate interest.

The contents of the sample directory will look like this: