Introduction to Differential Gene Expression Analysis with Bulk RNAseq

November 2024

RNAseq Workflow

Library Preparation

Sequencing

Bioinformatics Analysis

Image adapted from: Wang, Z., et al. (2009), Nature Reviews Genetics, 10, 57–63.

RNAseq Workflow

Library Preparation

Sequencing

Bioinformatics Analysis

Image adapted from: Wang, Z., et al. (2009), Nature Reviews Genetics, 10, 57–63.

Library preparation

- Ribosomal RNA

- Poly-A transcripts

- Other RNAs e.g. tRNA, miRNA etc.

Total RNA extraction

Library preparation

Poly-A Selection

Poly-A transcripts e.g.:

mRNAs
immature miRNAs
snoRNA

Ribominus selection

Poly-A transcripts + Other mRNAs e.g.:

tRNAs
mature miRNAs
piRNAs

Library preparation

RNA → Reverse Transcription → ctDNA …
Fragmentation - short fragments ~200-300 nt …
Adapter and Index binding …
PCR Amplification.

RNAseq Workflow

Library Preparation

Sequencing

Bioinformatics Analysis

Image adapted from: Wang, Z., et al. (2009), Nature Reviews Genetics, 10, 57–63.

RNAseq Workflow

Library Preparation

Sequencing

Bioinformatics Analysis

RNAseq Workflow

Library Preparation

Sequencing

Bioinformatics Analysis

Differential Gene Expression Analysis Workflow

Fastq file format

QC is important

Check for any problems before we put time and effort into analysing potentially bad data

Start with FastQC
- Quick
- Outputs an easy to read html report

https://www.bioinformatics.babraham.ac.uk/projects/fastqc/

Per base sequence quality

Good Data

Bad Data

Per base sequence content

Good Data

Bad Data

Per sequence GC content

Good Data

Bad Data

Differential Gene Expression Analysis Workflow

Alignment

AIM: Given a reference sequence and a set of short reads, align each read to the reference sequence finding the most likely origin of the read sequence.

Alignment - Splicing aware alignment

Aligners: STAR, HISAT2

Quantification

Broadly classified into two types:
- Alignment based
- Quasi-mapping or pseudoalignment based

Quantification

Broadly classified into two types:
- Alignment based:
  - Reads must be mapped to Genome prior to quantification
  - quantifies using simple counting procedure
  - Pros: Intuitive
  - Cons: Slow and can not correct biases in RNAseq data
  - Tools: HTseq, SubRead etc.
- Quasi-mapping or pseudoalignment based …

Alignment based quantification

Traditional alignment is (relatively) slow and computationally intensive

Alignment based quantification

Traditional alignment is (relatively) slow and computationally intensive

Alignment based quantification

Traditional alignment is (relatively) slow and computationally intensive

Alignment based quantification

Traditional alignment is (relatively) slow and computationally intensive

Alignment based quantification

Traditional alignment is (relatively) slow and computationally intensive

Alignment based quantification

Counting: How many reads have come from a genomic feature?
* genomic feature can be gene or transcript or exon, but usually gene

Once the reads are mapped we know where on the genome the RNA fragment originated.

We also know the locations of exons of genes on the genome.

So the simplest approach is to count how many reads overlap each gene.

Quantification

Broadly classified into two types …
- Alignment based:
  - Reads must be mapped to Genome prior to quantification
  - quantifies using simple counting procedure
  - Pros: Intuitive
  - Cons: Slow and can not correct biases in RNAseq data
  - Tools: HTseq, SubRead etc.
- Quasi-mapping (or pseudoalignment) based:
  - Starts from raw reads and base-to-base alignment of the reads is avoided
  - Reads aligned to Transcriptome
  - Pros: Very fast and removes biases
  - Cons: Not intuitive
  - Tools: Kallisto, Sailfish, Salmon etc

Quantification with Quasi-mapping (Salmon)

Pseudoalignment/Quasi-aligments methods are much faster than traditional mapping
Unlike alignment based methods, pseudo-alignment methods focus on transcriptome (~2% of genome in human)
Use exact kmer matching rather than aligning whole reads with mismatches and indels

Quantification with Quasi-mapping (Salmon)

Salmon does not simply count reads, but uses a dual-phase parallel modelling and inference algorithm along with bias models to estimate expression at the transcript level.

Salmon also takes account of biases:

Multimapping: Reads which map equally well to multiple locations
GC bias: Higher GC content sequences are less likely to be observed as PCR is not efficient with high GC content sequences.
Positional bias: for most sequencing methods, the 3 prime end of transcripts are more likely to be observed.
Complexity bias: some sequences are easier to be bound and amplified than others.
Sequence-based bias: Bias in read start positions arising from the differential binding efficiency of random hexamer primers
Fragment length bias: Induced by size selection
Methods like Salmon attempt to mitigate the effect of technical biases by estimating sample-specific bias parameters.

Quantification with Quasi-mapping (Salmon)

Salmon outputs one folder for each sample.
The main quantification output is called “quant.sf”

Quantification with Quasi-mapping (Salmon)

The output “quant.sf” contains:
- Name — The name of the target transcript.
- Length — Length of the target transcript in nucleotides.
- EffectiveLength — Effective length of the target transcript.
- TPM — Estimate of the relative abundance of this transcript in Transcripts Per Million (TPM).
- NumReads — Estimate of the number of reads mapping to each transcript.

                  https://salmon.readthedocs.io/en/latest/file_formats.html

QC of aligned reads

Alignment Rate
Duplication Rate
Insert Size
Transcript coverage

Picard Tools:

https://broadinstitute.github.io/picard/

QC of aligned reads - Alignment Rate

Depends on:
- Quality of Reference Genome
- Quality of library prep and sequencing
- For human and mouse > 95%

QC of aligned reads - Duplication Rate

Duplicate reads (identical sequences or alignments) could be:
- “Real” - coming from different original RNAs
- PCA duplicates
- Optical duplicates (not really with latest technologies)
Human exome is ~30 Mb therefore, therefore there are ~60 million possible reads
Duplication rates in RNAseq can be > 60% depending on depth of sequencing
Using Paired End reads (typical these days) greatly reduces duplication rates

QC of aligned reads - Insert Size

Insert size is the length of the fragment of mRNA from which the reads are derived

RNAseq Workflow

RNAseq Workflow

Library preparation

Library preparation

Library preparation

RNAseq Workflow

RNAseq Workflow

RNAseq Workflow

Differential Gene Expression Analysis Workflow

Differential Gene Expression Analysis Workflow

Differential Gene Expression Analysis Workflow

Fastq file format

QC is important

Per base sequence quality

Per base sequence content

Per sequence GC content

Differential Gene Expression Analysis Workflow

Alignment

Alignment - Splicing aware alignment

Quantification

Quantification

Alignment based quantification

Alignment based quantification

Alignment based quantification

Alignment based quantification

Alignment based quantification

Alignment based quantification

Quantification

Quantification with Quasi-mapping (Salmon)

Quantification with Quasi-mapping (Salmon)

Quantification with Quasi-mapping (Salmon)

Quantification with Quasi-mapping (Salmon)

QC of aligned reads

QC of aligned reads - Alignment Rate

QC of aligned reads - Duplication Rate

QC of aligned reads - Insert Size

QC of aligned reads - Transcript coverage

Case Study