April 2021

Differential Gene Expression Analysis Workflow


Alignment

AIM: Given a reference sequence and a set of short reads, align each read to the reference sequence finding the most likely origin of the read sequence.

Alignment - Gap aware alignment

Aligners: STAR, HISAT2

SAM format

Sequence Alignment/Map (SAM) format is the standard format for files containing aligned reads.

Definition of the format is available at https://samtools.github.io/hts-specs/SAMv1.pdf.

Two main parts:

  • Header
    • contains meta data (source of the reads, reference genome, aligner, etc.)
    • header lines start with “@”
    • header fields have standardized two-letter codes
  • Alignment section
    • 1 line for each alignment
    • contains details of alignment position, mapping, base quality etc.
    • 11 required fields, but other content may vary depending on aligner and other tools used to create the file

SAM format - header


……………………..
……………………..

SAM format - alignment

SAM format - alignment

SAM format - alignment

SAM format - alignment

SAM format - alignment

SAM format - alignment

HISAT2

Fast and good performance in published benchmark tests

First need to generate an index for the reference genome with the hisat2-build command

Indexing is where all the work takes place and so is computationally intensive

Then we can align reads to the genome with hisat2

Practical

  1. Create an index to the genome with HISAT2

  2. Align reads to the genome with HISAT2 and store outcome in a SAM file

  3. Convert the SAM file (human readable text) to BAM (binary) with samtools

  4. Index the BAM file with samtools