Short Read Alignment

September 2019

Differential Gene Expression Analysis Workflow

Alignment

AIM: Given a reference sequence and a set of short reads, align each read to the reference sequence finding the most likely origin of the read sequence.

Alignment - Gap aware alignment

Aligners: STAR, HISAT2

SAM format

Sequence Alignment/Map (SAM) format is the standard format for files containing aligned reads.

Definition of the format is avaiable at https://samtools.github.io/hts-specs/SAMv1.pdf.

Two main parts:

Header
- contains meta data (source of the reads, reference genome, aligner, etc.)
- Header lines start with “@”.
- Header fields have standardized two-letter codes
Alignment section
- 1 line for each alignment
- Contains details of alignment position, mapping, base quality etc.
- 11 required fields, but other content may vary depending on aligner and other tools used to the create the file

SAM format - header

SAM format - alignment

Explain SAM flags

HISAT2

Fast and good performance in published benchmark tests.

First need to generate an index for the reference genome with the hisat2-build command

Indexing is where all the work takes place and so is computationally intensive

Then we can align reads to the genome with hisat2

Practical

Create an index to the genome with HISAT2
Align reads to the genome with HISAT2 –> SAM file
Convert the SAM file to BAM with samtools
Index the BAM file with samtools