The first step in the analysis of single cell RNAseq data is to align the sequenced reads against a genomic reference and then use a transcriptome annotation to generate read counts for each feature of interest. Typically for scRNAseq the features of interest are genes.
There are a variety of tools for doing this and your choice will depend in part on the method by which the library was generated. For data generated using the 10x-Chromium method data the most common approach is to use the Cell Ranger tool provided by 10x. This not only carries out the alignment and feature counting, but will also:
Cell Ranger is computationally very intensive, you will not be able to run it on a laptop or standard desktop computer. You will need access to, for example, a high performance computing (HPC) cluster, a server or other cloud-based computational resource with sufficient power - talk to your local IT support.
Alternative methods include:
For the purposes of this course, seeing as we are working with 10x-Chromium derived data, we will use Cell Ranger. As detailed instructions are available on the Cell Ranger pages of the 10x website, this chapter will not be comprehensive in terms of all options, but should provide a brief overview.
Cell Ranger incorporates a number of tools for handling different
components of the single cell RNAseq analysis. In this chapter we will
be looking at the count
tool, which is used to align reads,
quantify gene expression and call cells. Later in the course you will
encounter the aggr
(aggregate) tool, which can be used to
merge multiple samples into a single experiment-level data set.
In addition to the analysis tools, Cell Ranger also includes the
mkref
tool to generate a custom Cell Ranger reference from
genomic and transcriptomic references.
Cell Ranger runs on Linux, and full installation instructions can be found on the 10x website.
Cell Ranger, like other aligners, requires the information about the genome and transcriptome of interest to be provided in a specific format. If you are working with the standard genome/transcriptome of human or mouse then you can download prebuilt references from the 10x website (there is even a combined human-mouse genome reference, which is useful for patient-derived xenograft experiments).
If you require a reference for the genome of another species, or wish
to add custom transcripts to the gene annotation (e.g. fusion genes),
then you will need to build your own index using the Cell Ranger
mkref
tool. In order to prepare our reference for Cell
Ranger, we need two files:
These files can be downloaded from public repositories such as Ensembl (see EnsemblGenomes for non-vertebrate genomes).
As an example for our course, we have downloaded one of the
chromosomes from the human genome
page. In a real-life scenario you would download the full genome,
conventionally named species.version.dna.toplevel.fa.gz
(for example, Homo_sapiens.GRCh38.dna.toplevel.fa.gz
). We
have also downloaded the GTF file with transcript annotation for our
genome from the same site.
You can find these reference files in the directory
Data/references
.
To index our genome, we can use the cellranger mkref
tool with the following arguments (replacing {...}
with the
relevant piece of information):
cellranger mkref \
--fasta={GENOME FASTA} \
--genes={ANNOTATION GTF} \
--genome={OUTPUT FOLDER FOR INDEX} \
--nthreads={CPUS}
where:
GENOME FASTA
is a file containing the reference genome
in FASTA formatANNOTATION GTF
is a file containing the transcript
annotation file in GTF formatOUTPUT FOLDER FOR INDEX
is a name for the output folder
containing the new reference package (you do not need to create this
folder)CPUS
- Is the number of CPUs we would like CellRanger
to use. The more CPUs CellRanger can use, the faster the job (up to a
point).One thing of note is that mkref
does not work with files
compressed with gzip
(the format that Ensembl provides them
as). Therefore, make sure to decompress the files first (you can do this
using the gunzip file.gz
command).
Open Visual Studio Code and create a shell script named
01_prepare_reference.sh
to index the reference genome for
Cell Ranger.
You’ll need to give CellRanger the names of the two reference files
in Data/references
.
The output from mkref
should be named
cellranger_index
.
You should tell CellRanger to use 7 CPUs.
The start of the script is given here, so you can copy this to a new
file and work from there. You’ll need to replace the
<<_YOUR_CODE_HERE_>> with the necessary
arguments for the cellranger mkref
command.
#!/bin/bash
# change to directory with the FASTA and GTF files
cd Data/references/
# run mkref
cellranger mkref\
--fasta=<<_YOUR_CODE_HERE_>> \
--genes=<<_YOUR_CODE_HERE_>> \
--genome=<<_YOUR_CODE_HERE_>> \
--nthreads=<<_YOUR_CODE_HERE_>>
Run your script from the terminal using bash
(make sure
you’re running it from the ~/Course_Materials
directory).
The complete script would be:
#!/bin/bash
# change to directory with the FASTA and GTF files
cd Data/references/
# run mkref
cellranger mkref \
--fasta=Homo_sapiens.GRCh38.dna.chromosome.21.fa \
--genes=gencode.v41.primary_assembly.annotation.chr21.gtf \
--genome=cellranger_index \
--nthreads=7
If we save this script in our scripts
folder as
01-prepare_reference.sh
, we could then run it from the
terminal with:
$ bash 01_prepare_reference.sh
count
Full details for running the cellranger count
tool can
be found on the 10x
website. Here, we go through a typical analysis workflow with this
tool.
To run Cell Ranger count
the fastq files for samples to
be processed should be placed in a single directory. Cell Ranger will be
run separately on each sample. You will need to provide the software
with the sample name (e.g. SRR9264343) of the sample to be processed.
Cell Ranger expects the file names of the fastq files to follow a
specific convention so that it can correctly identify the files for the
specific sample. If the files names of your fastqs do not match this
convention you will need to rename them.
The convention is the default convention used by
bcl2fastq
(the tool that converts the raw sequencing data
in bcl
format into fastq
files):
<SampleName>_S<SampleNumber>_L00<Lane>_<Read>_001.fastq.gz
bcl2fastq
. This is not important for Cell
Ranger, other than it indicates the end of the Sample Name, you can set
all your samples to S1.e.g. for a single sample in the Caron data set we have:
SRR9264343_S0_L001_I1_001.fastq.gz
SRR9264343_S0_L001_R1_001.fastq.gz
SRR9264343_S0_L001_R2_001.fastq.gz
cellranger count
The minimum information require to run cellranger count
is:
This will process all fastq files in the --fastqs
directory into a single sample. If you have multiple samples in a single
directory then you need to add:
In addition, Cell Ranger is very computationally intensive, you will usually be wanting to run it on a high performance cluster or server. It will greedily attempt to use all resources it can find available, and so it is advisable to set limits to the resources it will use:
A complete command for processing on of the Caron samples might be:
cellranger count --id={OUTPUT_SAMPLE_NAME} \
--transcriptome={DIRECTORY_WITH_REFERENCE} \
--fastqs={DIRECTORY_WITH_FASTQ_FILES} \
--sample={NAME_OF_SAMPLE_IN_FASTQ_FILES} \
--localcores={NUMBER_OF_CPUS} \
--localmem={RAM_MEMORY}
Open Visual Studio Code and create a shell script named
02_cellranger_count.sh
to obtain gene counts per cell for
one of our samples. Our FASTQ files are named with the “SRR9264343”
prefix, which corresponds to one of the replicates from a “ETV6_RUNX1”
cancer subtype, and are located in Data/fastq
.
Run cellranger count
using the fastq files for this
sample and have Cell Ranger name the output directory
“ETV6_RUNX1_rep1”.
The start of the script is given here, so you can copy/paste this to a new file and work from there.
You will also need to give Cell Ranger the path to the cell_ranger index that we just generated.
Finally, set the number of cores for Cell Ranger to use to
7
and the local memory limit to 24
(Gb).
#!/bin/bash
# run cellranger count (maximum CPUs 8; maximum RAM 24GB)
cellranger count <<_YOUR_CODE_HERE_>>
Here is a bit more detail about writing the
cellranger count
command:
cellranger count \
--id=<<_OUTPUT_SAMPLE_NAME_>> \
--transcriptome=Data/references/cellranger_index \
--fastqs=<<_DIRECTORY_WITH_FASTQ_FILES_>> \
--sample=<<_NAME_OF_SAMPLE_IN_FASTQ_FILES_>> \
--localcores=8 \
--localmem=24
Note that we have to write the paths relative to the results folder
we create within the script. See the example below for the
--transcriptome
option and see if you can adapt it also for
specifying the path to the FASTQ folder.
Run your script from the terminal using bash
.
The complete script would be:
#!/bin/bash
# run cellranger count (maximum CPUs 8; maximum RAM 24GB)
cellranger count \
--id=ETV6_RUNX1_rep1 \
--transcriptome=Data/references/cellranger_index \
--fastqs=Data/fastq/ \
--sample=SRR9264343 \
--localcores=8 \
--localmem=24
If we save this script in our scripts
folder as
02-cellranger.sh
, we could then run it from the terminal
with:
$ bash 02_cellranger_count.sh
Cell Ranger will create a single results folder for each sample
names. Each folder will be named according to the --id
option in the command. In the top level of the directory is a
sub-directory called outs
, this contains the results of the
analysis pipeline. There are also many intermediate and log
files/directories that will not be of immediate interest.
The contents of the sample directory will look like this:
The contents of the outs
directory are:
The two count matrix directories each contain 3 files:
The count matrix directories and their corresponding HDF5 files contain the same information, they are just alternative formats for use depending on the tools that are used for analysis. In this course we will be loading the contents of the count matrix directories into R.
The filtered count matrix only contains droplets that have been called as cells by Cell Ranger.
The Cell Ranger summary report - web_summary.html
- is a
very useful first port of call for assessing the quality of the data. An
example is linked here (See also here).
The first tab, Summary, contains various QC metrics about sequencing quality, mapping quality and cell calling. The second tab, Analysis, provides some basic analysis of droplets that Cell Ranger has identified as potentially containing cells, including some clustering and gene expression analysis. This report is interactive, allowing you to some very basic data exploration.
The report itself contains brief explanations of the contents of each section. These can be accessed by clicking the question mark icon in the header of the section. A more comprehensive explanation of the various components of the report can be found on the 10x website.
The cloupe.cloupe
file that Cell Ranger generates can be
loaded using the Loupe
Browser application provided by 10x. This provides an interactive
visualization and analysis platform that allows you to extensively
explore your data with a point and click interface.
The browser will allow you, for example, to find cells of interest through filtering and clustering tools, identify genes whose expression characterizes particular clusters of cells, or identify the type of cells in different clusters based on gene expression. While the browser is an excellent tool for data exploration and identification of potential avenues of analysis, it should not be used as a substitute for proper bioinformatic analysis. In particular, it should be remembered that Cell Ranger has not performed and QC on the contents of the droplets it has called as cells - in the next section of the course we will perform additional QC to eliminate poor quality cells.
Remember that for each read we have a sample barcode, a cell barcode, and a UMI. In addition we now have location on the genome and a corresponding gene annotation. The count for any gene for any cell is the number of unique reads detected for the combination:
sample barcode + cell barcode + UMI + gene ID
Thus we avoid counting PCR duplicate reads as these will have the same UMI. For this reason we more commonly talk about the “UMI count” rather than the “Read count”; the “Read count” could be higher than the “UMI count”
Cell Ranger performs some analysis on the data to distinguish between droplets that contain cells and empty droplets. The filtered counts matrices only contain data for droplets that Cell Ranger has determined to contain cells. It performs this filtering in two steps.
UMI threshold based filtering:
By default, Cell Ranger expects 3000 cells (you can change this in the
arguments of the Cell Ranger command). Cell Ranger takes the top 3000
droplets by UMI counts. It then takes the UMI count at the 99th
percentile of this set of cells. It then divides this number of UMIs by
10 to determine a hard UMI threshold. All droplets with a UMI count
above this threshold are called as cells.
Comparison of droplet RNA profile with background:
This algorithm is based on one initially developed in the
dropletUtils
R package. Cell Ranger selects all droplets
with a UMI count of less than 500 (this will usually be 100’s of
thousands). This set of droplets is assumed to definitely not contain
cells and as such any RNA detected is merely background within the media
(due to damaged cells leaking RNA). An RNA profile is derived for this
background. The RNA profile of all droplets with more than 500 UMIs but
less than the hard UMI threshold is then compared to this background
profile. Any droplet with a profile that differs significantly from the
background profile is called as a cell.
A detailed description of the Cell Calling algorithm, along with explanations of the other algorithms used in the alignment and countings steps, can be found on the 10x Cell Ranger pages
The plot below shows a “barcode rank plot”. This is taken from the
Cell Ranger web_summary.html
report.
This plot shows the UMI counts for all droplets, the droplets have been ranked by UMI count and coloured according to whether they have been called as cells or not. The light blue region, indicates a region where some droplets have been called as cells, but other have not - depending on the outcome of the RNA profiling step of the algorithm.