January 2026
The list of differentially expressed genes is sometimes:
so long that its interpretation becomes cumbersome and time consuming,
or very short while some genes have low p-value yet higher than the given threshold.
There are many approaches to searching for biological meaning in the results of differential expression analysis.
Commonly we assess whether the differentially expressed genes (as a set) tend to relate to specific pathways or ontological groups of genes.
We will look at two methods:
Over Representation Analysis (ORA)
Gene Set Enrichment Analysis (GSEA)
Common sources of gene sets:
Manually curated gene lists
This method tests whether genes in a specific pathway are present in a subset of genes of interest in our data more than expected.
The genes of interest could be e.g. statistically significant differentially expressed genes or a cluster of genes from hierachical or k-means clustering.
We calculate the overlap between our genes of interest and genes in the pathway and assess if the resulting intersection size is due to chance based on the Fisher’s exact test (aka hypergeometric test).
Genes in the experiment (the gene universe or background) are split in two ways:
Contingency table:
Calculate the probability of seeing such contingency table given the gene universe size (=20,000) and the number of genes of interest (=100) and genes in the pathway (=320) with the hypergeometric/Fisher’s exact test.
For details see this blog post.
First proposed by Subramanian et al. in 2005 (original publication here).
This method ranks all our genes in the experiment by their association with the studied phenotype/treatment, thus not restricting the analysis to significant genes only (with p-value < \(\alpha\)).
The genes in a pathway are located in such ranking, thus identifying their association level with the phenotype/treatment.
Based on the ranking and where the pathway genes are within that ranking, an enrichment score and a corresponding p-value are computed to assess whether the pathway is positively or negatively affected in the experiment.
First rank all genes in our dataset by their decreasing association with the studied phenotype/treatment.
Example: ranking by decreasing signed log-fold change (\(logFC\)).
Identify the position of genes in the pathway in the ranking:
GSEA calculates an Enrichment Score based on the ranking:
GSEA calculates an Enrichment Score based on the ranking:
The step size for a gene in the pathway is proportional to its \(logFC\) (or whatever the ranking metric was) as we want to capture not only if the pathway gene is at the top/bottom but also to what extent it is affected. The gene’s step size is normalized by the sum of the metric of all genes in the pathway to account for the pathway size and the scale of the metric used for ranking.
The step size for those genes not in the pathway is the same decrease for all, normalized by the number of such genes outside the pathway.
The final enrichment score for the pathway represents the value in the walking sum that deviates the furthest from zero.
A different gene set enriched at the bottom of the gene ranking:
A random gene set:
The p-value of the enrichment score is computed by either:
In each case compute new enrichment scores. Do this many times to generate a “null” distribution, i.e. a distribution of enrichment scores for situations that have no biological meaning.
The p-value is the proportion of such scores that are equal or greater than the observed score.
Question: Do the differentially expressed genes tend to relate to specific pathways or ontological groups of genes?
For a given contrast and a given gene set.
Two methods:
Over Representation Analysis (ORA)
Gene Set Enrichment Analysis (GSEA)
Both methods are applicable to series of gene sets.