October 2024
This session covers:
Before trusting your results, make sure your data makes sense.
Exploratory Data Analysis (EDA) serves two complementary purposes:
Quality control
Early biological insight
Exploratory analysis - boxplots, PCA, clustering - help catch issues quickly.
RNA-seq produces count data: After alignment/quantification, each transcript/gene gets a count - the number of reads assigned to it.
Key properties of count data:
Raw counts have a very wide range - from zero to tens of thousands - and are heavily skewed.
A simple log transformation helps to compress the range and make distribution pattern more visible. We often add a pseudocount of 1 to avoid the issue of log(0) = -Inf.
In RNA-seq data, genes with higher average expression also tend to show higher variability between samples.
Clustering and PCA are distance-based - highly variable genes will dominate if not taken into account.
DESeq2 provides two transformations that simultaneously:
| Regularised log | Variance Stabilising Transformation | |
|---|---|---|
| Function | rlog() |
vst() |
| Speed | Slower | Faster |
| Best for | Small datasets, unequal library sizes | Large datasets (>30-50 samples) |
After transformation, genes across the expression spectrum contribute more “equally” to downstream analyses.
TPM and FPKM correct for gene length and sequencing depth - but have a critical flaw for comparing samples.
Normalising to the total count means a massively up-regulated gene makes all others appear artificially lower.
See:
Evans C, Hardin J, Stoebel DM (2018). Selecting between-sample RNA-Seq normalization methods from the perspective of their assumptions. Briefings in Bioinformatics. https://doi.org/10.1093/bib/bbx008
Zhao S, Ye Z, Stanton R (2020). Misuse of RPKM or TPM normalization when comparing across samples and sequencing protocols. RNA. https://doi.org/10.1261/rna.074922.120
Zhao Y, Li MC, Konaté M, Chen L, Das B, Karlovich C, Williams PM, Evrard YA, Doroshow JH, McShane LM (2021). TPM, FPKM, or Normalized Counts? A Comparative Study of Quantification Measures for the Analysis of RNA-seq Data from the NCI Patient-Derived Models Repository. Journal of Translational Medicine. https://doi.org/10.1186/s12967-021-02936-w
Fig. 1 in Evans et al. particularly illustrates the problem nicely.
DESeq2 calculates size factors using the median-of-ratios method - robust to highly expressed or DE genes.
How it works:
The key assumption: most genes are not differentially expressed, so the median ratio captures genuine library size differences.
| Gene | Sample A | Sample B | Geometric mean | Ratio A | Ratio B |
|---|---|---|---|---|---|
| G1 | 20 | 40 | 28.3 | 0.71 | 1.41 |
| G2 | 40 | 80 | 56.6 | 0.71 | 1.41 |
| G3 | 10 | 20 | 14.1 | 0.71 | 1.41 |
| G4 | 230 | 160 | 191.8 | 1.20 | 0.83 |
Notice how the highly expressed gene G4 has a different ratio, but it does not affect the size factor.
After dividing each sample’s counts by its size factor, samples are comparable - without assuming total counts should be equal.
VST/rlog can be used for:
Differential expression uses raw counts.
For cross-sample comparisons avoid TPM/FPKM.
PCA is an unsupervised dimensionality reduction method - it finds the directions of greatest variation in high-dimensional data.
PCA considers all genes simultaneously to capture major patterns into a small number of axes.
A well-behaved experiment:
Warning signs to look for:
Hierarchical clustering and sample-sample correlation heatmaps are also commonly used to explore sample relationships.
It is not uncommon for samples not to show clear clustering by treatment group in PCA or hierarchical clustering.
This can be due to: