October 2024
AIM: Given a reference sequence and a set of short reads, align each read to the reference sequence finding the most likely origin of the read sequence.
Aligners: STAR, HISAT2
Sequence Alignment/Map (SAM) format is the standard format for files containing aligned reads.
Definition of the format is available at https://samtools.github.io/hts-specs/SAMv1.pdf.
Two main parts:
@RG
for read group, used for merging BAMs togetherSwitch to quasi-mapping (Salmon) or pseudo-alignment (Kallisto)
If we had mapped our reads to the genome (rather than the transcript sequences), our mapping would look like this:
We also know the locations of exons of genes on the genome, from an annotation file (e.g. GFF or GTF)
So the simplest approach is to count how many reads overlap each gene.
However, Salmon does not work this way. We have mapped to the transcript sequences, not the genome. Quantification is performed as part of the quasi-mapping process.
Salmon also takes account of biases:
Multimapping: Reads which map equally well to multiple locations
GC bias: Higher GC content sequences are less likely to be observed as PCR is not efficient with high GC content sequences.
Positional bias: for most sequencing methods, the 3 prime end of transcripts are more likely to be observed.
Complexity bias: some sequences are easier to be bound and amplified than others.
Sequence-based bias: Bias in read start positions arising from the differential binding efficiency of random hexamer primers
Fragment length bias: Induced by size selection
Because salmon searches transcription, not genome, it’s not the right tool for finding new genes or isoforms
Two essential steps