The first step in the analysis of single cell RNAseq data is to align the sequenced reads against a genomic reference and then use a transcriptome annotation to generate read counts for each feature of interest. Typically for scRNAseq the features of interest are genes.
There are a variety of tools for doing this and your choice will depend in part on the method by which the library was generated. For data generated using the 10x-Chromium method data the most common approach is to use the Cell Ranger tool provided by 10x. This not only carries out the alignment and feature counting, but will also:
Cell Ranger is computationally very intensive, you will not be able to run it on a laptop or standard desktop computer. You will need access to, for example, a high performance computing (HPC) cluster, a server or other cloud-based computational resource with sufficient power - talk to your local IT support.
Alternative methods include:
For the purposes of this course, seeing as we are working with 10x-Chromium derived data, we will use Cell Ranger. As detailed instructions are available on the Cell Ranger pages of the 10x website, this chapter will not be comprehensive in terms of all options, but should provide a brief overview.
Cell Ranger incorporates a number of tools for handling different components of the single cell RNAseq analysis. In this chapter we will be looking at the count
tool, which is used to align reads, quantify gene expression and call cells. Later in the course you will encounter the aggr
(aggregate) tool, which can be used to merge multiple samples into a single experiment-level data set.
In addition to the analysis tools, Cell Ranger also includes the mkref
tool to generate a custom Cell Ranger reference from genomic and transcriptomic references.
Cell Ranger runs on Linux, and full installation instructions can be found on the 10x website.
Cell Ranger, like other aligners, requires the information about the genome and transcriptome of interest to be provided in a specific format. If you are working with the standard genome/transcriptome of human or mouse then you can download prebuilt references from the 10x website (there is even a combined human-mouse genome reference, which is useful for patient-derived xenograft experiments).
If you require a reference for the genome of another species, or wish to add custom transcripts to the gene annotation (e.g. fusion genes), then you will need to build your own index using the Cell Ranger mkref
tool. In order to prepare our reference for Cell Ranger, we need two files:
These files can be downloaded from public repositories such as Ensembl (see EnsemblGenomes for non-vertebrate genomes). As an example for our course, we have downloaded one of the chromosomes from the human genome page. In a real-life scenario you would download the full genome, conventionally named species.version.dna.toplevel.fa.gz
(for example, Homo_sapiens.GRCh38.dna.toplevel.fa.gz
). We have also downloaded the GTF file with transcript annotation for our genome from the same site.
To index our genome, we can use the cellranger mkref
tool as (replacing {...}
with the relevant piece of information):
cellranger mkref \
--nthreads={CPUS} \
--genome={OUTPUT FOLDER FOR INDEX} \
--fasta={GENOME FASTA} \
--genes={ANNOTATION GTF}
One thing of note is that mkref
does not work with files compressed with gzip
(the format that Ensembl provides them as). Therefore, make sure to decompress the files first (you can do this using the gunzip file.gz
command).
Open VS Code and create a shell script named 01-prepare_reference.sh
to index the reference genome for Cell Ranger. The output from mkref
should be named cellranger_index
. The start of the script is given here, so you can copy/paste this to a new file and work from there.
#!/bin/bash
# change to directory with the FASTA and GTF files
cd Data/reference/
# run mkref
cellranger mkref FIXME
Run your script from the terminal using bash
(make sure you’re running it from the ~/Course_Materials
directory.
The complete script would be:
#!/bin/bash
# change to directory with the FASTA and GTF files
cd Data/reference/
# run mkref
cellranger mkref \
--nthreads=8 \
--genome=cellranger_index \
--fasta=Homo_sapiens.GRCh38.dna.chromosome.21.fa \
--genes=Homo_sapiens.GRCh38.104.chr21.gtf
If we save this script in our scripts
folder as 01-prepare_reference.sh
, we could then run it from the terminal with:
$ bash scripts/01-prepare_reference.sh
count
Full details for running the cellranger count
tool can be found on the 10x website. Here, we go through a typical analysis workflow with this tool.
To run Cell Ranger count
the fastq files for samples to be processed should be placed in a single directory. Cell Ranger will be run separately on each sample. You will need to provide the software with the sample name (e.g. SRR9264343) of the sample to be processed. Cell Ranger expects the file names of the fastq files to follow a specific convention so that it can correctly identify the files for the specific sample. If the files names of your fastqs do not match this convention you will need to rename them.
The convention is the default convention used by bcl2fastq
(the tool that converts the raw sequencing data in bcl
format into fastq
files):
<SampleName>_S<SampleNumber>_L00<Lane>_<Read>_001.fastq.gz
bcl2fastq
. This is not important for Cell Ranger, other than it indicates the end of the Sample Name, you can set all your samples to S1.e.g. for a single sample in the Caron data set we have:
SRR9264343_S0_L001_I1_001.fastq.gz
SRR9264343_S0_L001_R1_001.fastq.gz
SRR9264343_S0_L001_R2_001.fastq.gz
cellranger count
The minimum information require to run cellranger count
is:
This will process all fastq files in the --fastqs
directory into a single sample. If you have multiple samples in a single directory then you need to add:
In addition, Cell Ranger is very computationally intensive, you will usually be wanting to run it on a high performance cluster or server, and it is advisable to set limits to the resources it will use:
A complete command for processing on of the Caron samples might be:
cellranger count --id={OUTPUT_SAMPLE_NAME} \
--transcriptome={DIRECTORY_WITH_REFERENCE} \
--fastqs={DIRECTORY_WITH_FASTQ_FILES} \
--sample={NAME_OF_SAMPLE_IN_FASTQ_FILES} \
--localcores={NUMBER_OF_CPUS} \
--localmem={RAM_MEMORY}
Open VS Code and create a shell script named 02-cellranger.sh
to obtain gene counts per cell for one of our samples. Our FASTQ files are named with the “SRR9264343” prefix, which corresponds to one of the replicates from a “ETV6/RUNX1” cancer subtype. Run cellranger count
with the output named “ETV6_RUNX1_rep1”.
The start of the script is given here, so you can copy/paste this to a new file and work from there.
#!/bin/bash
# make output directory
mkdir -p results/cellranger
# change into that directory
cd results/cellranger
# run cellranger count (maximum CPUs 8; maximum RAM 24GB)
cellranger count FIXME
Here is a bit more detail about writing the cellranger count
command:
cellranger count \
--id=FIXME \
--transcriptome=../../Data/reference/cellranger_index \
--fastqs=FIXME \
--sample=FIXME \
--localcores=8 \
--localmem=24
Note that we have to write the paths relative to the results folder we create within the script. See the example below for the --transcriptome
option and see if you can adapt it also for specifying the path to the FASTQ folder.
Run your script from the terminal using bash
.
The complete script would be:
#!/bin/bash
# make output directory
mkdir -p results/cellranger
# change into that directory
cd results/cellranger
# run cellranger count (maximum CPUs 8; maximum RAM 24GB)
cellranger count \
--id=ETV6_RUNX1_rep1 \
--transcriptome=../../Data/reference/cellranger_index \
--fastqs=../../Data/reads/ \
--sample=SRR9264343 \
--localcores=8 \
--localmem=24
If we save this script in our scripts
folder as 02-cellranger.sh
, we could then run it from the terminal with:
$ bash scripts/02-cellranger.sh
Cell Ranger will create a single results folder for each sample names. Each folder will be named according to the --id
option in the command. In the top level of the directory is a sub-directory called outs
, this contains the results of the analysis pipeline. There are also many intermediate and log files/directories that will not be of immediate interest.
The contents of the sample directory will look like this: