September 2018

Differential Gene Expression Analysis Workflow


Sources of Noise

Sources of Noise - Sampling Bias

Sources of Noise - Transcript Length

The length of the transcript affects the number of RNA fragments present in the library from that gene.

Sources of Noise - Sequencing Artefacts

The development of larger suites of unique dual-indexes should eliminate the index swapping issue.

Normalisation

  • Counting estimates the relative counts for each gene

  • Does this accurately represent the original population of RNAs?

  • The relationship between counts and RNA expression is not the same for all genes across all samples

Library Size

Differing sequencing depth

Gene properties

Length, GC content, sequence

Library composition

Highly expressed genes overrepresented at the cost of lowly expressed genes

"Composition Bias"

Normalisation - scaling

Library Size

  • Normalise each sample by total number of reads sequenced.

  • Can also use another statistic similar to total count eg. median, upper quartile

  • Does not account for composition bias


Normalisation - Geometric mean scaling factor

  • Used by DESeq2
  1. For each gene calculate the geometric mean across all samples
  2. For each gene in each sample, normalise by dividing by the geometric mean for that gene
  3. For each sample calculate the scaling factor as the median of the normalised counts

Differential Expression

  • Comparing feature abundance under different conditions

  • Assumes linearity of signal

  • When feature=gene, well-established pre- and post-analysis strategies exist


Mortazavi, A. et al (2008) Nature Methods

Differential Expression

Simple difference in means

Replication introduces variation

Differential Expression - Modelling population distributions

  • Normal (Gaussian) Distribution - t-test

  • Two parameters - \(mean\) and \(sd\) (\(sd^2 = variance\))

  • Suitable for microarray data but not for RNAseq data

Differential Expression - Modelling population distributions

  • Count data - Poisson distribution

  • One parameter - \(mean\) \((\lambda)\)

  • \(variance\) = \(mean\)

Differential Expression - Modelling population distributions

  • RNAseq counts for lowly expressed genes vary more than for highly expressed genes

  • Use the Negative Binomial distribution

  • In the NB distribution \(mean\) not equal to \(variance\)

  • Two paramenters - \(mean\) and \(dispersion\)

  • \(dispersion\) describes how \(variance\) changes with \(mean\)

Anders, S. & Huber, W. (2010) Genome Biology

Differential Expression - estimating dispersion

  • Estimating the dispersion parameter can be difficult with a small number of samples

  • DESeq2 models the variance as the sum of technical and biological variance

  • Esimate dispersion for each gene

  • ‘Share’ dispersion information between genes to obtain fitted estimate

  • Shrink gene-wise estimates towards the the fitted estimates

Differential Expression - linear models

  • Calculate coefficients describing change in gene expression

  • Linear Model \(\rightarrow\) General Linear Model

Towards biological meaning - hierachical clustering


Towards biological meaning - Gene Ontology testing

Towards biological meaning - Gene Set Enrichment Analysis


http://software.broadinstitute.org/gsea

Towards biological meaning - Pathway Analysis


More Depth or More Reps?


Liu et al. (2014) Bioinformatics

Thank you