July 2018

The many faces of RNA-seq

  • Different flavours:

    • mRNAseq

    • Targeted

    • Small RNA

    • Single Cell RNA-Seq

  • Discovery:

    • Transcripts

    • Isoforms

    • Splice junctions

    • Fusion genes

  • Differential expression:

    • Gene level expression changes

    • Relative isoform abundance

    • Splicing patterns

  • Variant calling

Workflow


Sources of Noise

Sources of Noise - Sampling Bias

Sources of Noise - Transcript Length

The length of the transcript affects the number of RNA fragments present in the library from that gene.

Sources of Noise - Sequencing Artefacts

The development of larger suites of unique dual-indexes should eliminate the index swapping issue.

Counting/Summarisation

Genome-based features

  • Exon or gene boundaries

  • Isoform structures

  • Gene multireads

Transcript-based features

  • Transcript assembly

  • Novel structures

  • Isoform multireads

HTSeq or Subread

Normalisation

  • Counting estimates the relative counts for each gene

  • Does this accurately represent the original population of RNAs?

  • The relationship between counts and RNA expression is not the same for all genes across all samples

Library Size

Differing sequencing depth

Gene properties

GC content, length, sequence

Library composition

Highly expressed genes overrepresented at the cost of lowly expressed genes

"Composition Bias"

Normalisation - scaling

Total Count

  • Normalise each sample by total number of reads sequenced.

  • Can also use another statistic similar to total count eg. median, upper quartile

  • Does not account for composition bias


Normalisation - Geometric mean scaling factor

  • Used by DESeq2
  1. For each gene calculate the geometric mean across all samples
  2. For each gene in each sample, normalise by dividing by the geometric mean for that gene
  3. For each sample calculate the scaling factor as the median of the normalised counts

Differential Expression

  • Comparing feature abundance under different conditions

  • Assumes linearity of signal

  • When feature=gene, well-established pre- and post-analysis strategies exist


Mortazavi, A. et al (2008) Nature Methods

Differential Expression

Simple difference in means

Replication introduces variation

Differential Expression - Modelling population distributions

  • Normal Distribution - t-test

  • Two parameters - mean and sd

  • Suitable for microarray data but not for RNAseq data

Differential Expression - Modelling population distributions

  • Count data - Poisson distribution

  • One parameter - mean \((\lambda)\)

  • variance = mean

Differential Expression - Modelling population distributions

  • RNAseq counts for lowly expressed genes vary more than for highly expressed genes

  • Use the Negative Binomial distribution

  • In the NB distribution mean not equal to variance

  • Two paramenters - mean and dispersion

Anders, S. & Huber, W. (2010) Genome Biology

Differential Expression - estimating dispersion

  • Estimating the dispersion parameter can be difficult with a small number of samples

  • DESeq2 models the variance as the sum of technical and biological variance

  • Esimate dispersion for each gene

  • ‘Share’ dispersion information between genes to obtain fitted estimate

  • Shrink gene-wise estimates towards the the fitted estimates

Towards biological meaning - hierachical clustering


Hamy et al. (2016) PLOS One

Towards biological meaning - Gene Set Enrichment Analysis


http://software.broadinstitute.org/gsea

Towards biological meaning - Network Analysis


Hamy et al. (2016) PLOS One

More Depth or More Reps?


Liu et al. (2014) Bioinformatics

Thank you