July 2020

HTS Applications - Overview

DNA Sequencing

  • Genome Assembly

  • SNPs/SVs/CNVs

  • DNA methylation

  • DNA-protein interactions (ChIPseq)

  • Chromatin Modification (ATAC-seq/ChIPseq)

RNA Sequencing

  • Transcriptome Assembly

  • Differential Gene Expression

  • Fusion Genes

  • Splice variants

Single-Cell

  • RNA/DNA

  • Low-level RNA/DNA detection

  • Cell-type classification

  • Dissection of heterogenous cell populations

RNAseq Workflow

Library Preparation

Sequencing

Bioinformatics Analysis

Image adapted from: Wang, Z., et al. (2009), Nature Reviews Genetics, 10, 57–63.

Coverage: How many reads do we need?


The coverage is defined as:

\(\frac{Read\,Length\;\times\;Number\,of\,Reads}{Length\,of\,Target\,Sequence}\)

The amount of sequencing needed for a given sample is determined by the goals of the experiment and the nature of the RNA sample.

  • For a general view of differential expression: 5–25 million reads per sample
  • For alternative splicing and lowly expressed genes: 30–60 million reads per sample.
  • In-depth view of the transcriptome/assemble new transcripts: 100–200 million reads
  • Targeted RNA expression requires fewer reads.
  • miRNA-Seq or Small RNA Analysis require even fewer reads.

Read length

Long or short reads? Paired or Single end?

The answer depends on the experiment:

  • Gene expression – a short single read, e.g. SE 50, is sufficient.
  • kmer-based quantification of Gene Expression (Salmon etc.) - benefits from PE.
  • Transcriptome Analysis – longer paired-end reads (such as 2 x 75 bp).
  • Small RNA Analysis – short single read, e.g. SE50 - will need trimming.

Sources of Noise

Sources of Noise - Sampling Bias

Sources of Noise - Transcript Length

The length of the transcript affects the number of RNA fragments present in the library from that gene.

Sources of Noise - Sequencing Artefacts

Capturing Variance - Replication

Biological Replication

  • Measures the biological variations between individuals

  • Accounts for sampling bias

Technical Replication

  • Measures the variation in response quantification due to imprecision in the technique

  • Accounts for technical noise

Capturing Variance - Replication

Biological Replication

Each replicate is from an indepent biological individual

  • In Vivo:

    • Patients
    • Mice
  • In Vitro:

    • Different cell lines
    • Different passages

Capturing Variance - Replication

Technical Replication

Replicates are from the same individual but processed separately

  • Experimental protocol
  • Measurement platform

More Depth or More Reps?


Liu et al. (2014) Bioinformatics

Controlling batch effects

  • Batch effects are sub-groups of measurements that have qualitatively different behavior across conditions and are unrelated to the biological or scientific variables in a study.

  • Batch effects are problematic if they are confounded with the experimental variable.

  • Batch effects that are randomly distributed across experimental variables can be controlled for.

Controlling batch effects

Controlling batch effects

Controlling batch effects

  • Batch effects are sub-groups of measurements that have qualitatively different behavior across conditions and are unrelated to the biological or scientific variables in a study.

  • Batch effects are problematic if they are confounded with the experimental variable.

  • Batch effects that are randomly distributed across experimental variables can be controlled for.

  • Randomise all technical steps in data generation in order to avoid batch effects

Controlling batch effects

  • Batch effects are sub-groups of measurements that have qualitatively different behavior across conditions and are unrelated to the biological or scientific variables in a study.

  • Batch effects are problematic if they are confounded with the experimental variable.

  • Batch effects that are randomly distributed across experimental variables can be controlled for.

  • Randomise all technical steps in data generation in order to avoid batch effects

Controlling batch effects

  • Batch effects are sub-groups of measurements that have qualitatively different behavior across conditions and are unrelated to the biological or scientific variables in a study.

  • Batch effects are problematic if they are confounded with the experimental variable.

  • Batch effects that are randomly distributed across experimental variables can be controlled for.

  • Randomise all technical steps in data generation in order to avoid batch effects

Controlling batch effects

  • Batch effects are sub-groups of measurements that have qualitatively different behavior across conditions and are unrelated to the biological or scientific variables in a study.

  • Batch effects are problematic if they are confounded with the experimental variable.

  • Batch effects that are randomly distributed across experimental variables can be controlled for.

  • Randomise all technical steps in data generation in order to avoid batch effects

Alignment

  • RNA does not contain the introns

  • When aligning to the genome we need to use a splice-aware aligner and provide gene definitions (GTF) e.g.:

    • HISAT2 (uses Bowtie2)
    • Star
    • Or pseudoaligners e.g. Salmon or Kallisto

Counting/Summarisation

Genome-based features

  • Exon or gene boundaries

  • Isoform structures

  • Gene multireads

Transcript-based features

  • Transcript assembly

  • Novel structures

  • Isoform multireads

HTSeq or Subread

Differential Gene Expression Analysis Workflow