Data Exploration

October 2024

Differential Gene Expression Analysis Workflow

Overview

This session covers:

Why explore your data first?
Properties of RNA-seq count data
Normalisation: why TPM/FPKM aren’t enough
Data transformations (rlog / VST)
Principal Component Analysis (PCA)
Hierarchical clustering
Interpreting (or not) clusters

EDA: Quality Control and Biological Insight

Before trusting your results, make sure your data makes sense.

Exploratory Data Analysis (EDA) serves two complementary purposes:

Quality control

Are the samples what we think they are?
Are there outliers, swapped labels, or batch effects?
Does the overall signal look as expected?

Early biological insight

Which experimental factors drive the largest variation?
Do replicates behave consistently?
Are there unexpected groupings that suggest confounders?

What Can Go Wrong?

Exploratory analysis - boxplots, PCA, clustering - help catch issues quickly.

Properties of Count Data

RNA-seq produces count data: After alignment/quantification, each transcript/gene gets a count - the number of reads assigned to it.

Key properties of count data:

Counts are non-negative integers
Distribution is right-skewed - a few genes with very high counts
Variance is not constant - it scales with the mean

Distribution Skew

Raw counts have a very wide range - from zero to tens of thousands - and are heavily skewed.

A simple log transformation helps to compress the range and make distribution pattern more visible. We often add a pseudocount of 1 to avoid the issue of log(0) = -Inf.

The Mean-Variance Problem

In RNA-seq data, genes with higher average expression also tend to show higher variability between samples.

Clustering and PCA are distance-based - highly variable genes will dominate if not taken into account.

Stabilising the Variance: rlog and VST

DESeq2 provides two transformations that simultaneously:

log-transform the data (compresses the range)
Normalise for library size (using size factors)
Stabilise variance of lowly expressed genes

	Regularised log	Variance Stabilising Transformation
Function	`rlog()`	`vst()`
Speed	Slower	Faster
Best for	Small datasets, unequal library sizes	Large datasets (>30-50 samples)

Before and After Transformation

After transformation, genes across the expression spectrum contribute more “equally” to downstream analyses.

Why Not Just Use TPM or FPKM?

TPM and FPKM correct for gene length and sequencing depth - but have a critical flaw for comparing samples.

Normalising to the total count means a massively up-regulated gene makes all others appear artificially lower.

See:

Evans C, Hardin J, Stoebel DM (2018). Selecting between-sample RNA-Seq normalization methods from the perspective of their assumptions. Briefings in Bioinformatics. https://doi.org/10.1093/bib/bbx008
Zhao S, Ye Z, Stanton R (2020). Misuse of RPKM or TPM normalization when comparing across samples and sequencing protocols. RNA. https://doi.org/10.1261/rna.074922.120
Zhao Y, Li MC, Konaté M, Chen L, Das B, Karlovich C, Williams PM, Evrard YA, Doroshow JH, McShane LM (2021). TPM, FPKM, or Normalized Counts? A Comparative Study of Quantification Measures for the Analysis of RNA-seq Data from the NCI Patient-Derived Models Repository. Journal of Translational Medicine. https://doi.org/10.1186/s12967-021-02936-w

Fig. 1 in Evans et al. particularly illustrates the problem nicely.

DESeq2’s Solution: Size Factors

DESeq2 calculates size factors using the median-of-ratios method - robust to highly expressed or DE genes.

How it works:

For each gene, compute the geometric mean count across all samples (a “pseudo-reference” sample)
For each sample, divide each gene’s count by its reference value (a ratio per gene per sample)
Take the median of all those ratios for each sample (the median resists the few genes that are truly DE)
That median is the size factor - divide all counts by it

The key assumption: most genes are not differentially expressed, so the median ratio captures genuine library size differences.

Size Factor Example

Gene	Sample A	Sample B	Geometric mean	Ratio A	Ratio B
G1	20	40	28.3	0.71	1.41
G2	40	80	56.6	0.71	1.41
G3	10	20	14.1	0.71	1.41
G4	230	160	191.8	1.20	0.83

Size factor A = median(Ratio A) = 0.71
Size factor B = median(Ratio B) = 1.41

Notice how the highly expressed gene G4 has a different ratio, but it does not affect the size factor.

After dividing each sample’s counts by its size factor, samples are comparable - without assuming total counts should be equal.

When to Use Transformed vs Raw Counts

VST/rlog can be used for:

PCA
Hierarchical clustering
Heatmaps
Other machine learning or multivariate methods

Differential expression uses raw counts.

For cross-sample comparisons avoid TPM/FPKM.

Principal component analysis (PCA)

PCA is an unsupervised dimensionality reduction method - it finds the directions of greatest variation in high-dimensional data.

PCA considers all genes simultaneously to capture major patterns into a small number of axes.

What PCA Can Reveal

A well-behaved experiment:

Replicates cluster tightly together
Groups are well separated along PC1 or PC2
PC axes correspond to the expected biology (e.g. treatment, time point)

Warning signs to look for:

A replicate far from its group - possible outlier or labelling error
Samples group by batch / processing date, not biology - batch effect

Alternative methods

Hierarchical clustering and sample-sample correlation heatmaps are also commonly used to explore sample relationships.

What if samples don’t cluster?

It is not uncommon for samples not to show clear clustering by treatment group in PCA or hierarchical clustering.

This can be due to:

The treatment has a subtle effect, affecting only a small number of genes.
There is high biological variability (e.g. very common in non-controlled experiments, such as human samples).
There are technical issues (e.g. batch effects, sample swaps, outliers) that are obscuring the biological signal.
There is no real biological signal in the data (e.g. the treatment has no effect).