November 2024

RNAseq Workflow

Library Preparation

Sequencing

Bioinformatics Analysis

Image adapted from: Wang, Z., et al. (2009), Nature Reviews Genetics, 10, 57–63.

RNAseq Workflow

Library Preparation

Sequencing

Bioinformatics Analysis

Image adapted from: Wang, Z., et al. (2009), Nature Reviews Genetics, 10, 57–63.

Library preparation

- Ribosomal RNA

- Poly-A transcripts

- Other RNAs e.g. tRNA, miRNA etc.

Total RNA extraction

Library preparation

Poly-A Selection

Poly-A transcripts e.g.:

  • mRNAs
  • immature miRNAs
  • snoRNA

Ribominus selection

Poly-A transcripts + Other mRNAs e.g.:

  • tRNAs
  • mature miRNAs
  • piRNAs

Library preparation

  1. RNA → Reverse Transcription → ctDNA …
  2. Fragmentation - short fragments ~200-300 nt …
  3. Adapter and Index binding …
  4. PCR Amplification.

RNAseq Workflow

Library Preparation

Sequencing

Bioinformatics Analysis

Image adapted from: Wang, Z., et al. (2009), Nature Reviews Genetics, 10, 57–63.

RNAseq Workflow

Library Preparation

Sequencing

Bioinformatics Analysis

RNAseq Workflow

Library Preparation

Sequencing

Bioinformatics Analysis

Differential Gene Expression Analysis Workflow


Differential Gene Expression Analysis Workflow


Differential Gene Expression Analysis Workflow


Fastq file format

QC is important

Per base sequence quality

Good Data

Bad Data

Per base sequence content

Good Data

Bad Data

Per sequence GC content

Good Data

Bad Data

Differential Gene Expression Analysis Workflow


Alignment

AIM: Given a reference sequence and a set of short reads, align each read to the reference sequence finding the most likely origin of the read sequence.

Alignment - Splicing aware alignment

Aligners: STAR, HISAT2

Quantification

  • Broadly classified into two types:
    • Alignment based
    • Quasi-mapping or pseudoalignment based

Quantification

  • Broadly classified into two types:
    • Alignment based:
      • Reads must be mapped to Genome prior to quantification
      • quantifies using simple counting procedure
      • Pros: Intuitive
      • Cons: Slow and can not correct biases in RNAseq data
      • Tools: HTseq, SubRead etc.
    • Quasi-mapping or pseudoalignment based …

Alignment based quantification

  • Traditional alignment is (relatively) slow and computationally intensive

Alignment based quantification

  • Traditional alignment is (relatively) slow and computationally intensive

Alignment based quantification

  • Traditional alignment is (relatively) slow and computationally intensive

Alignment based quantification

  • Traditional alignment is (relatively) slow and computationally intensive

Alignment based quantification

  • Traditional alignment is (relatively) slow and computationally intensive

Alignment based quantification

Counting: How many reads have come from a genomic feature?
* genomic feature can be gene or transcript or exon, but usually gene

Once the reads are mapped we know where on the genome the RNA fragment originated.

We also know the locations of exons of genes on the genome.

So the simplest approach is to count how many reads overlap each gene.

Quantification

  • Broadly classified into two types …
    • Alignment based:
      • Reads must be mapped to Genome prior to quantification
      • quantifies using simple counting procedure
      • Pros: Intuitive
      • Cons: Slow and can not correct biases in RNAseq data
      • Tools: HTseq, SubRead etc.
    • Quasi-mapping (or pseudoalignment) based:
      • Starts from raw reads and base-to-base alignment of the reads is avoided
      • Reads aligned to Transcriptome
      • Pros: Very fast and removes biases
      • Cons: Not intuitive
      • Tools: Kallisto, Sailfish, Salmon etc

Quantification with Quasi-mapping (Salmon)

  • Pseudoalignment/Quasi-aligments methods are much faster than traditional mapping
  • Unlike alignment based methods, pseudo-alignment methods focus on transcriptome (~2% of genome in human)
  • Use exact kmer matching rather than aligning whole reads with mismatches and indels

Quantification with Quasi-mapping (Salmon)

Salmon does not simply count reads, but uses a dual-phase parallel modelling and inference algorithm along with bias models to estimate expression at the transcript level.

Salmon also takes account of biases:

  • Multimapping: Reads which map equally well to multiple locations

  • GC bias: Higher GC content sequences are less likely to be observed as PCR is not efficient with high GC content sequences.

  • Positional bias: for most sequencing methods, the 3 prime end of transcripts are more likely to be observed.

  • Complexity bias: some sequences are easier to be bound and amplified than others.

  • Sequence-based bias: Bias in read start positions arising from the differential binding efficiency of random hexamer primers

  • Fragment length bias: Induced by size selection

  • Methods like Salmon attempt to mitigate the effect of technical biases by estimating sample-specific bias parameters.

Quantification with Quasi-mapping (Salmon)

  • Salmon outputs one folder for each sample.
  • The main quantification output is called “quant.sf

Quantification with Quasi-mapping (Salmon)

  • The output “quant.sf” contains:

    • Name — The name of the target transcript.
    • Length — Length of the target transcript in nucleotides.
    • EffectiveLength — Effective length of the target transcript.
    • TPM — Estimate of the relative abundance of this transcript in Transcripts Per Million (TPM).
    • NumReads — Estimate of the number of reads mapping to each transcript.

                  https://salmon.readthedocs.io/en/latest/file_formats.html

QC of aligned reads

QC of aligned reads - Alignment Rate

  • Depends on:
    • Quality of Reference Genome
    • Quality of library prep and sequencing
    • For human and mouse > 95%

QC of aligned reads - Duplication Rate

  • Duplicate reads (identical sequences or alignments) could be:
    • “Real” - coming from different original RNAs
    • PCA duplicates
    • Optical duplicates (not really with latest technologies)
  • Human exome is ~30 Mb therefore, therefore there are ~60 million possible reads
  • Duplication rates in RNAseq can be > 60% depending on depth of sequencing
  • Using Paired End reads (typical these days) greatly reduces duplication rates

QC of aligned reads - Insert Size

  • Insert size is the length of the fragment of mRNA from which the reads are derived

QC of aligned reads - Transcript coverage


Case Study