Mapping reads to a reference genome

Once we are happy with the data quality, we can start to analyse the data. Usually, the first step into the analysis requires mapping the RNA-seq reads to the genome. There are numerous tools we could use to perform short read alignment and the choice should be made carefully according to the analysis goals and requirements. For RNAseq gene expression analysis HISAT2 is a very fast tool that has been shown to have a good performance on published benchmarks.

We are going to be aligning the reads from the fastq against the GRCm38 reference genome from Ensembl. This can be obtained here: ftp://ftp.ensembl.org/pub/release-97/fasta/mus_musculus/dna/Mus_musculus.GRCm38.dna.primary_assembly.fa.gz

1 Indexing the genome for Hisat2

Typically our genome reference will be in FASTA format. Before we can start mapping RNA-seq reads to the genome, we need to create an index to the genome. This index allows HISAT2 to quickly search the genome for possible mapping positions for each read. It is analogous to an index in a book: if you what to find out where in the book a particular word occurs it is quicker to look at the index than to search through the book page by page, then you can jump straight to the correct page to find the exact line and position of the word.

The command to index the genome for HISAT2 is hisat2-build.

Access the help page to find the basic usage an other options:

hisat2-build --help

The Usage provides an outline of the way to use the command:

hisat2-build [options]* <reference_in> <ht2_index_base>
    reference_in            comma-separated list of files with ref sequences
    hisat2_index_base       write ht2 data to files with this dir/basename

This means that we have to use the command hisat2-build to run the programme.

Then we need to replace <reference_in> with the name of our reference genome files and replace <hisat2_index_base> with a ‘basename’ for the index files that the command will generate.

HISAT2 is going to generate a number of files for the index. The ‘basename’ is what all the name of these files will start with. e.g is the ‘basename’ we give is references/my_index then we will get a number of files called myindex.1.ht2,
myindex.2.ht2, myindex.3.ht2 …etc. inside the directory references (this directory must already exist).

Exercise 1

  1. Check your current working directory and if necessary navigate to the Course_Materials/ directory using the command cd (change directory).
  2. Use ls to list the contents of the directory.
    There should be a references directory. This will contain various reference materials that we will need throughout analysis, such as the mouse genome and gene annotations.
  3. Use ls references to list the contents of the references directory.
    There should be file called Mus_musculus.GRCm38.chr1.fa. This is the reference genome sequence for chromosome 1 in FASTA format. We are just going to work with chromosome 1 for this exercise as indexing the entire genome would take too long. You’ll also notice a directory called hisat2_index, this is the index for the entire genome, which has already been generated. We’ll be using this later.
  4. Make a directory (mkdir) inside the references directory called hisat2_index_chr1:

mkdir references/hisat2_index_chr1

This is where we will create our chr1 index. 5. To create the hisat2 index run the following command:

hisat2-build -p 7 references/Mus_musculus.GRCm38.chr1.fa references/hisat2_index_chr1/grcm38

With this command we are:

  • providing the fasta file references/Mus_musculus.GRCm38.chr1.fa for the <reference_in>
  • setting the <hisat2_index_base> to references/hisat2_index_chr1/mmu.GRCm38, so all the files will be created in the directory references/hisat2_index_chr1 and their names will start with mmu.GRCm38.
  1. Why do we use -p 7? Take a look at hisat2-build help.
  2. How many files are created?

2 Align with HISAT2

To map the reads to the genome we need to run hisat2.

We will need to provide the command with three pieces of information:

  • The path to the index files - we do this by just supplying the basename for the index as in the previous command
  • A fastq file containing our unaligned reads
  • A name for the output file

We should also instruct HISAT2 how many threads (processors) it should use (these machines have 8 processors, so we should let HISAT2 use 7 of them, keeping 1 free).

Take a quick look to HISAT2’s description

hisat2 --help

The usage is:

Usage:
  hisat2 [options]* -x <ht2-idx> {-1 <m1> -2 <m2> | -U <r> | --sra-acc <SRA accession number>} [-S <sam>]

  <ht2-idx>  Index filename prefix (minus trailing .X.ht2).
  <m1>       Files with #1 mates, paired with files in <m2>.
             Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).
  <m2>       Files with #2 mates, paired with files in <m1>.
             Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).
  <r>        Files with unpaired reads.
             Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).
  <SRA accession number>        Comma-separated list of SRA accession numbers, e.g. --sra-acc SRR353653,SRR353654.
  <sam>      File for SAM output (default: stdout)

  <m1>, <m2>, <r> can be comma-separated lists (no whitespace) and can be
  specified many times.  E.g. '-U file1.fq,file2.fq -U file3.fq'.

[options]* refers the many optional parameters listed in the following Options section of the help. We do not need to worry about these for now.

After the options there are three arguments.

The first required argument is the name of the hisat file index, which should be preceded by the -x flag:

`-x <ht2-idx>`

The usage tells us that the filename prefix should be provided, that is the filename without the .X.ht2 suffix:

`<ht2-idx>  Index filename prefix (minus trailing .X.ht2).`

The chromosome 1 index that we created in Exercise 1 had 8 files:

references/hisat2_index_chr1/grcm38.1.ht2
references/hisat2_index_chr1/grcm38.2.ht2
...
references/hisat2_index_chr1/grcm38.8.ht2

So the index filename prefix would be references/hisat2_index_chr1/grcm38 .

The next argumnet is the raw read data to be aligned. The curly brackets ({...|...}) indicate that we must provide this is one of three ways:

  • with the flags -1 and -2 to give two fastq files when we have paired end reads
  • with the -U flag to just give one fastq if our reads are single end (as is the case today)
  • with the -r flag to provide an SRA accession number to download and process data directly from the NCBI’s Sequence Read Archive (SRA).

The third argument is in square brackets ([...]), which indicate that argument -S is entirely optional. This argument is used to provide a file to write the results to. If we do not provide this the default is to write to stdout, which refers to the “standard output” and means the results would just be printed on the screen.

So we need to provide:

  • the prefix of the index file after the -x flag
  • the path to out fastq file after the -U flag
  • the name for the output SAM file after the -S flag. The output file format is SAM, so the output filename should end .sam.

There many additional options that allow us to tweak the parameters used in the alignment. As a general rule, unless you really know what you are doing you should stick to the defaults.

Exercise 2

  1. Make a directory called bam (BAM will be the final format of our aligned data, but we have one more step after alignment to get there).
  2. Use HISAT2 to align the fastq file. Use the following parameters
    • Index (the full genome this time) - references/hisat2_index/mmu.GRCm38
    • Fastq file with unpaired reads - fastq/MCL1.DL.fastq.gz
    • Output file - bam/MCL1.DL.sam
    • Set the number of threads (number of processors to use) to 7 - check the help page to find the appropriate flag
    • Add the -t flag so that HISAT2 will print a time log to the console

Note: The alignment may take 5-10 minutes - go get a nice cup of tea.

3 Convert the SAM output to BAM

The output of HISAT2 is a SAM file. This is the standardized format for presenting aligned sequence data. You can read details of the specifications here:

https://samtools.github.io/hts-specs/SAMv1.pdf

Unfortunately, this is a plain text file and they tend to be very large. BAM files are the binary (compressed) version of SAM files. They are much smaller and can be indexed making them quicker to access and process.

3.1 SAM to BAM with samtools

We can transform from SAM to BAM using samtools. samtools is a toolkit that provides a number of useful tools for working with SAM/BAM files. In this case we will use the view command to transform the SAM file into a BAM file.

The general command is:

samtools view -b my_sample.sam > my_sample.bam

view is the tool for viewing a sam or bam file.

Normally it outputs in SAM format so that the results can be read by a human, in this case the -b flag tells it to output in BAM format.

By default samtools view outputs it’s results directly to the console so that we can view them. The > redirects the output to the file my_sample.bam.

3.2 Sorting a bam file

After we have transformed the SAM file to a BAM file we will want to index it, but first we should sort the file so that the reads are in order with regard to chromosome number and position.

To sort the file the general command is:

samtools sort my_sample.bam > my_sample.sorted.bam

3.3 Indexing a bam file

Finally, we should index the bam with the index command. This makes accessing accessing the data quicker for downstream tools.

To index the file the general command is:

samtools index my_sample.sorted.bam

This would create a new file called my_sample.sorted.bam.bai (bam index).

Exercise 3

  1. Transform your aligned SAM file in to a BAM file called MCL1.DL.bam. Use the option -@ 7 to use 7 cores. This vastly speeds up the compression.
  2. Sort the BAM file to a create a bam files called MCL1.DL.sorted.bam. Again use the -@ 7 options to use 7 cores.
  3. Index the sorted BAM file
  4. Now that you have a bam file have a look at the header of the file with the command:
    samtools view -H <my_sample.sorted.bam>
    Replacing “<my_sample.sorted.bam>” with your bam file.
  5. View the first few aligned sequences with the command: samtools view my_sample.sorted.bam | head
    The | symbol is known as the “pipe”, it “pipes” the output of the first command into the next command. Take a look at the SAM format specifications here. Jump to section 1.4 " The alignment section: mandatory fields" and see if you can begin to interpret the the first 11 fields of the alignment information in your bam file. Don’t forget the website Explain SAM flags which helps with decoding the flag field of the alignment.