Introduction to RNAseq Methods

July 2020

HTS Applications - Overview

DNA Sequencing

Genome Assembly
SNPs/SVs/CNVs
DNA methylation
DNA-protein interactions (ChIPseq)
Chromatin Modification (ATAC-seq/ChIPseq)

RNA Sequencing

Transcriptome Assembly
Differential Gene Expression
Fusion Genes
Splice variants

Single-Cell

RNA/DNA
Low-level RNA/DNA detection
Cell-type classification
Dissection of heterogenous cell populations

RNAseq Workflow

Library Preparation

Sequencing

Bioinformatics Analysis

Image adapted from: Wang, Z., et al. (2009), Nature Reviews Genetics, 10, 57–63.

Coverage: How many reads do we need?

The coverage is defined as:

\(\frac{Read\,Length\;\times\;Number\,of\,Reads}{Length\,of\,Target\,Sequence}\)

The amount of sequencing needed for a given sample is determined by the goals of the experiment and the nature of the RNA sample.

For a general view of differential expression: 5–25 million reads per sample
For alternative splicing and lowly expressed genes: 30–60 million reads per sample.
In-depth view of the transcriptome/assemble new transcripts: 100–200 million reads
Targeted RNA expression requires fewer reads.
miRNA-Seq or Small RNA Analysis require even fewer reads.

Read length

Long or short reads? Paired or Single end?

The answer depends on the experiment:

Gene expression – a short single read, e.g. SE 50, is sufficient.
kmer-based quantification of Gene Expression (Salmon etc.) - benefits from PE.
Transcriptome Analysis – longer paired-end reads (such as 2 x 75 bp).
Small RNA Analysis – short single read, e.g. SE50 - will need trimming.

Sources of Noise

Sources of Noise - Sampling Bias

Sources of Noise - Transcript Length

The length of the transcript affects the number of RNA fragments present in the library from that gene.

Sources of Noise - Sequencing Artefacts

Capturing Variance - Replication

Biological Replication

Measures the biological variations between individuals
Accounts for sampling bias

Technical Replication

Measures the variation in response quantification due to imprecision in the technique
Accounts for technical noise

Capturing Variance - Replication

Biological Replication

Each replicate is from an indepent biological individual

In Vivo:
- Patients
- Mice
In Vitro:
- Different cell lines
- Different passages

Capturing Variance - Replication

Technical Replication

Replicates are from the same individual but processed separately

Experimental protocol
Measurement platform

More Depth or More Reps?

Liu et al. (2014) Bioinformatics

Controlling batch effects

Batch effects are sub-groups of measurements that have qualitatively different behavior across conditions and are unrelated to the biological or scientific variables in a study.
Batch effects are problematic if they are confounded with the experimental variable.
Batch effects that are randomly distributed across experimental variables can be controlled for.

Controlling batch effects

Batch effects are sub-groups of measurements that have qualitatively different behavior across conditions and are unrelated to the biological or scientific variables in a study.
Batch effects are problematic if they are confounded with the experimental variable.
Batch effects that are randomly distributed across experimental variables can be controlled for.
Randomise all technical steps in data generation in order to avoid batch effects

Controlling batch effects

Batch effects are sub-groups of measurements that have qualitatively different behavior across conditions and are unrelated to the biological or scientific variables in a study.
Batch effects are problematic if they are confounded with the experimental variable.
Batch effects that are randomly distributed across experimental variables can be controlled for.
Randomise all technical steps in data generation in order to avoid batch effects

Controlling batch effects

Batch effects are sub-groups of measurements that have qualitatively different behavior across conditions and are unrelated to the biological or scientific variables in a study.
Batch effects are problematic if they are confounded with the experimental variable.
Batch effects that are randomly distributed across experimental variables can be controlled for.
Randomise all technical steps in data generation in order to avoid batch effects

Controlling batch effects

Batch effects are sub-groups of measurements that have qualitatively different behavior across conditions and are unrelated to the biological or scientific variables in a study.
Batch effects are problematic if they are confounded with the experimental variable.
Batch effects that are randomly distributed across experimental variables can be controlled for.
Randomise all technical steps in data generation in order to avoid batch effects

Alignment

RNA does not contain the introns
When aligning to the genome we need to use a splice-aware aligner and provide gene definitions (GTF) e.g.:
- HISAT2 (uses Bowtie2)
- Star
- Or pseudoaligners e.g. Salmon or Kallisto

Counting/Summarisation

Genome-based features

Exon or gene boundaries
Isoform structures
Gene multireads

Transcript-based features

Transcript assembly
Novel structures
Isoform multireads

HTSeq or Subread

HTS Applications - Overview

RNAseq Workflow

Coverage: How many reads do we need?

Read length

Long or short reads? Paired or Single end?

Sources of Noise

Sources of Noise - Sampling Bias

Sources of Noise - Transcript Length

Sources of Noise - Sequencing Artefacts

Capturing Variance - Replication

Biological Replication

Technical Replication

Capturing Variance - Replication

Biological Replication

Capturing Variance - Replication

Technical Replication

More Depth or More Reps?

Controlling batch effects

Controlling batch effects

Controlling batch effects

Controlling batch effects

Controlling batch effects

Controlling batch effects

Controlling batch effects

Alignment

Counting/Summarisation

Differential Gene Expression Analysis Workflow