November 2022

Differential Gene Expression Analysis Workflow


General idea behind RNAseq data analysis

General idea behind any statistical test

Sources of Noise (Variance)

Normalisation

  • Counting estimates the relative counts for each gene

  • Does this accurately represent the original population of RNAs?

  • The relationship between counts and RNA expression is not the same for all genes across all samples

Library Size

Differing sequencing depth

Gene properties

Length, GC content, sequence

Library composition

Quantification is relative - changes in relative abundance for one gene will affect the relative abundances of other genes

“Composition Bias”

General principle behind normalisation

  • Normalization has two steps
    • Scaling
      • First get size factors or normalization factors
      • Usually one size factor per sample
      • Scale the counts by divide the raw counts of a sample with sample specific size factor
  • Transformation: Transform the data after scaling
    • Per million
    • log2
    • square root transformation
    • Pearson residuals (eg. sctransform)
  • Normalization removes technical variance but not biological variance
  • Normalization helps in making two samples comparable

Normalization toy example

DESeq2 analysis workflow


DESeq2 Normalisation

  1. Geometric mean is calculated for each gene across all samples.
  2. The counts for a gene in each sample is then divided by this mean.
  3. The median of these ratios in a sample is the size factor (normalization factor) for that sample.
  4. DESEq2 normalization corrects for library size and RNA composition bias
  5. Composition bias: Arise for example when only a small number of genes are very highly expressed in one sample but not in the other.

Differential Expression

Simple difference in means

Replication introduces variation

Differential Expression - Modelling population distributions

  • Normal (Gaussian) Distribution - t-test

  • Two parameters - \(mean\) and \(sd\) (\(sd^2 = variance\))

  • Suitable for microarray data but not for RNAseq data

Differential Expression - Modelling population distributions

  • Count data - Poisson distribution

  • One parameter - \(mean\) \((\lambda)\)

  • \(variance\) = \(mean\)

Differential Expression - Modelling population distributions

  • Use the Negative Binomial distribution

  • In the NB distribution \(mean\) not equal to \(variance\)

  • Two paramenters - \(mean\) and \(dispersion\)

  • \(dispersion\) describes how \(variance\) changes with \(mean\)

Anders, S. & Huber, W. (2010) Genome Biology

Differential Expression - estimating dispersion

  • Estimating the dispersion parameter can be difficult with a small number of samples

  • DESeq2 models the variance as the sum of technical and biological variance

  • Esimate dispersion for each gene

  • ‘Share’ dispersion information between genes to obtain fitted estimate

  • Shrink gene-wise estimates towards the fitted estimates

Differential Expression - worrying dispersion plot examples


Bad dispersion plots from: https://github.com/hbctraining/DGE_workshop

Differential Expression - linear models

  • Calculate coefficients describing change in gene expression

  • Linear Model \(\rightarrow\) General Linear Model

Towards biological meaning - hierachical clustering


Towards biological meaning - Gene Ontology testing

Towards biological meaning - Gene Set Enrichment Analysis


http://software.broadinstitute.org/gsea

Towards biological meaning - Pathway Analysis


More Depth or More Reps?


Liu et al. (2014) Bioinformatics

Thank you