Once we are happy with the data quality, we can start to analyse the data. Usually, the first step into the analysis requires mapping the RNA-seq reads to the genome. There are numerous tools we could use to perform short read alignment and the choice should be made carefully according to the analysis goals and requirements. For RNAseq gene expression analysis HISAT2 is a very fast tool that has been shown to have a good performance on published benchmarks.
We are going to be aligning the reads from the fastq against the GRCm38 reference genome from Ensembl. This can be obtained here: ftp://ftp.ensembl.org/pub/release-97/fasta/mus_musculus/dna/Mus_musculus.GRCm38.dna.primary_assembly.fa.gz
Typically our genome reference will be in FASTA format. Before we can start mapping RNA-seq reads to the genome, we need to create an index to the genome. This index allows HISAT2 to quickly search the genome for possible mapping positions for each read. It is analogous to an index in a book: if you what to find out where in the book a particular word occurs it is quicker to look at the index than to search through the book page by page, then you can jump straight to the correct page to find the exact line and position of the word.
The command to index the genome for HISAT2 is hisat2-build
.
Access the help page to find the basic usage an other options:
hisat2-build --help
The Usage provides an outline of the way to use the command:
hisat2-build [options]* <reference_in> <ht2_index_base>
reference_in comma-separated list of files with ref sequences
hisat2_index_base write ht2 data to files with this dir/basename
This means that we have to use the command hisat2-build
to run the programme.
Then we need to replace <reference_in>
with the name of our reference genome files and replace <hisat2_index_base>
with a ‘basename’ for the index files that the command will generate.
HISAT2 is going to generate a number of files for the index. The ‘basename’ is what all the name of these files will start with. e.g is the ‘basename’ we give is references/my_index
then we will get a number of files called myindex.1.ht2
,
myindex.2.ht2
, myindex.3.ht2
…etc. inside the directory references
(this directory must already exist).
- Check your current working directory and if necessary navigate to the
Course_Materials/
directory using the commandcd
(change directory).- Use
ls
to list the contents of the directory.
There should be areferences
directory. This will contain various reference materials that we will need throughout analysis, such as the mouse genome and gene annotations.- Use
ls references
to list the contents of thereferences
directory.
There should be file calledMus_musculus.GRCm38.chr1.fa
. This is the reference genome sequence for chromosome 1 in FASTA format. We are just going to work with chromosome 1 for this exercise as indexing the entire genome would take too long. You’ll also notice a directory calledhisat2_index
, this is the index for the entire genome, which has already been generated. We’ll be using this later.- Make a directory (
mkdir
) inside thereferences
directory calledhisat2_index_chr1
:
mkdir references/hisat2_index_chr1
This is where we will create our chr1 index. 5. To create the hisat2 index run the following command:
hisat2-build -p 7 references/Mus_musculus.GRCm38.chr1.fa references/hisat2_index_chr1/grcm38
With this command we are:
- providing the fasta file references/Mus_musculus.GRCm38.chr1.fa for the
<reference_in>
- setting the
<hisat2_index_base>
to references/hisat2_index_chr1/mmu.GRCm38, so all the files will be created in the directory references/hisat2_index_chr1 and their names will start withmmu.GRCm38
.
- Why do we use
-p 7
? Take a look athisat2-build
help.- How many files are created?
To map the reads to the genome we need to run hisat2
.
We will need to provide the command with three pieces of information:
We should also instruct HISAT2 how many threads (processors) it should use (these machines have 8 processors, so we should let HISAT2 use 7 of them, keeping 1 free).
Take a quick look to HISAT2’s description
hisat2 --help
The usage is:
Usage:
hisat2 [options]* -x <ht2-idx> {-1 <m1> -2 <m2> | -U <r> | --sra-acc <SRA accession number>} [-S <sam>]
<ht2-idx> Index filename prefix (minus trailing .X.ht2).
<m1> Files with #1 mates, paired with files in <m2>.
Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).
<m2> Files with #2 mates, paired with files in <m1>.
Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).
<r> Files with unpaired reads.
Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).
<SRA accession number> Comma-separated list of SRA accession numbers, e.g. --sra-acc SRR353653,SRR353654.
<sam> File for SAM output (default: stdout)
<m1>, <m2>, <r> can be comma-separated lists (no whitespace) and can be
specified many times. E.g. '-U file1.fq,file2.fq -U file3.fq'.
[options]*
refers the many optional parameters listed in the following Options
section of the help. We do not need to worry about these for now.
After the options there are three arguments.
The first required argument is the name of the hisat file index, which should be preceded by the -x
flag:
`-x <ht2-idx>`
The usage tells us that the filename prefix should be provided, that is the filename without the .X.ht2
suffix:
`<ht2-idx> Index filename prefix (minus trailing .X.ht2).`
The chromosome 1 index that we created in Exercise 1 had 8 files:
references/hisat2_index_chr1/grcm38.1.ht2
references/hisat2_index_chr1/grcm38.2.ht2
...
references/hisat2_index_chr1/grcm38.8.ht2
So the index filename prefix would be references/hisat2_index_chr1/grcm38
.
The next argumnet is the raw read data to be aligned. The curly brackets ({...|...}
) indicate that we must provide this is one of three ways:
-1
and -2
to give two fastq files when we have paired end reads-U
flag to just give one fastq if our reads are single end (as is the case today)-r
flag to provide an SRA accession number to download and process data directly from the NCBI’s Sequence Read Archive (SRA).The third argument is in square brackets ([...]
), which indicate that argument -S
is entirely optional. This argument is used to provide a file to write the results to. If we do not provide this the default is to write to stdout
, which refers to the “standard output” and means the results would just be printed on the screen.
So we need to provide:
-x
flag-U
flag-S
flag. The output file format is SAM, so the output filename should end .sam
.There many additional options that allow us to tweak the parameters used in the alignment. As a general rule, unless you really know what you are doing you should stick to the defaults.
- Make a directory called
bam
(BAM will be the final format of our aligned data, but we have one more step after alignment to get there).- Use HISAT2 to align the fastq file. Use the following parameters
- Index (the full genome this time) -
references/hisat2_index/mmu.GRCm38
- Fastq file with unpaired reads -
fastq/MCL1.DL.fastq.gz
- Output file -
bam/MCL1.DL.sam
- Set the number of threads (number of processors to use) to 7 - check the help page to find the appropriate flag
- Add the
-t
flag so that HISAT2 will print a time log to the consoleNote: The alignment may take 5-10 minutes - go get a nice cup of tea.
The output of HISAT2 is a SAM file. This is the standardized format for presenting aligned sequence data. You can read details of the specifications here:
https://samtools.github.io/hts-specs/SAMv1.pdf
Unfortunately, this is a plain text file and they tend to be very large. BAM files are the binary (compressed) version of SAM files. They are much smaller and can be indexed making them quicker to access and process.
We can transform from SAM to BAM using samtools
. samtools
is a toolkit that provides a number of useful tools for working with SAM/BAM files. In this case we will use the view
command to transform the SAM file into a BAM file.
The general command is:
samtools view -b my_sample.sam > my_sample.bam
view
is the tool for viewing a sam or bam file.
Normally it outputs in SAM format so that the results can be read by a human, in this case the -b
flag tells it to output in BAM format.
By default samtools view
outputs it’s results directly to the console so that we can view them. The >
redirects the output to the file my_sample.bam
.
After we have transformed the SAM file to a BAM file we will want to index it, but first we should sort the file so that the reads are in order with regard to chromosome number and position.
To sort the file the general command is:
samtools sort my_sample.bam > my_sample.sorted.bam
Finally, we should index the bam with the index
command. This makes accessing accessing the data quicker for downstream tools.
To index the file the general command is:
samtools index my_sample.sorted.bam
This would create a new file called my_sample.sorted.bam.bai
(bam index).
- Transform your aligned SAM file in to a BAM file called
MCL1.DL.bam
. Use the option-@ 7
to use 7 cores. This vastly speeds up the compression.- Sort the BAM file to a create a bam files called
MCL1.DL.sorted.bam
. Again use the-@ 7
options to use 7 cores.- Index the sorted BAM file
- Now that you have a bam file have a look at the header of the file with the command:
samtools view -H <my_sample.sorted.bam>
Replacing “<my_sample.sorted.bam>
” with your bam file.- View the first few aligned sequences with the command:
samtools view my_sample.sorted.bam | head
The|
symbol is known as the “pipe”, it “pipes” the output of the first command into the next command. Take a look at the SAM format specifications here. Jump to section 1.4 " The alignment section: mandatory fields" and see if you can begin to interpret the the first 11 fields of the alignment information in your bam file. Don’t forget the website Explain SAM flags which helps with decoding the flag field of the alignment.