Introduction to Gene Set Testing in R

January 2026

Differential Gene Expression Analysis Workflow

Gene Set Testing - Overview

The list of differentially expressed genes is sometimes:

so long that its interpretation becomes cumbersome and time consuming,
or very short while some genes have low p-value yet higher than the given threshold.

There are many approaches to searching for biological meaning in the results of differential expression analysis.

Commonly we assess whether the differentially expressed genes (as a set) tend to relate to specific pathways or ontological groups of genes.

We will look at two methods:

Over Representation Analysis (ORA)
Gene Set Enrichment Analysis (GSEA)

Gene Set Testing - Resources

Common sources of gene sets:

KEGG pathways
Gene Ontologies
Reactome
Molecular Signature Database, MSigDB (GSEA)
Manually curated gene lists

Over Representation Analysis - Method

This method tests whether genes in a specific pathway are present in a subset of genes of interest in our data more than expected.
The genes of interest could be e.g. statistically significant differentially expressed genes or a cluster of genes from hierachical or k-means clustering.
We calculate the overlap between our genes of interest and genes in the pathway and assess if the resulting intersection size is due to chance based on the Fisher’s exact test (aka hypergeometric test).

Over Representation Analysis - Example

Genes in the experiment (the gene universe or background) are split in two ways:

annotated to the pathway or not
differentially expressed or not

Contingency table:

Calculate the probability of seeing such contingency table given the gene universe size (=20,000) and the number of genes of interest (=100) and genes in the pathway (=320) with the hypergeometric/Fisher’s exact test.

For details see this blog post.

Gene Set Enrichment Analysis (GSEA)

First proposed by Subramanian et al. in 2005 (original publication here).
This method ranks all our genes in the experiment by their association with the studied phenotype/treatment, thus not restricting the analysis to significant genes only (with p-value < \(\alpha\)).
The genes in a pathway are located in such ranking, thus identifying their association level with the phenotype/treatment.
Based on the ranking and where the pathway genes are within that ranking, an enrichment score and a corresponding p-value are computed to assess whether the pathway is positively or negatively affected in the experiment.

Gene Set Enrichment Analysis (GSEA) - Method

First rank all genes in our dataset by their decreasing association with the studied phenotype/treatment.

The ranking metric varies, but p-value and signed log-fold change (or both multiplied) are common choices.

Example: ranking by decreasing signed log-fold change (\(logFC\)).

Gene Set Enrichment Analysis (GSEA) - Method

Identify the position of genes in the pathway in the ranking:

if the pathway genes are up-regulated then they appear at the top (left) of the ranking
if they are down-regulated at they are at the bottom (right).

Gene Set Enrichment Analysis (GSEA) - Method

GSEA calculates an Enrichment Score based on the ranking:

The ES starts at 0
Walk throughout the ranking and for each gene:
- Increase the ES by 1 step if the gene is in the pathway.
- Decrease the ES by 1 step if not.

Gene Set Enrichment Analysis (GSEA) - Method

GSEA calculates an Enrichment Score based on the ranking:

The ES starts at 0
Walk throughout the ranking and for each gene:
- Increase the ES by 1 step if the gene is in the pathway.
- Decrease the ES by 1 step if not.

Gene Set Enrichment Analysis (GSEA) - Method

The step size for a gene in the pathway is proportional to its \(logFC\) (or whatever the ranking metric was) as we want to capture not only if the pathway gene is at the top/bottom but also to what extent it is affected. The gene’s step size is normalized by the sum of the metric of all genes in the pathway to account for the pathway size and the scale of the metric used for ranking.
The step size for those genes not in the pathway is the same decrease for all, normalized by the number of such genes outside the pathway.

Gene Set Enrichment Analysis (GSEA) - Method

The final enrichment score for the pathway represents the value in the walking sum that deviates the furthest from zero.

GSEA - Calculate the enrichment score

A different gene set enriched at the bottom of the gene ranking:

GSEA - Calculate the enrichment score

A random gene set:

Gene Set Enrichment Analysis (GSEA) - Method

The p-value of the enrichment score is computed by either:

permuting the sample phenotype/treatment labels to obtain new gene logFCs
drawing random gene sets of the same size as the pathway

In each case compute new enrichment scores. Do this many times to generate a “null” distribution, i.e. a distribution of enrichment scores for situations that have no biological meaning.

The p-value is the proportion of such scores that are equal or greater than the observed score.

Recap

Question: Do the differentially expressed genes tend to relate to specific pathways or ontological groups of genes?

For a given contrast and a given gene set.

Two methods:

Over Representation Analysis (ORA)
- split genes: in pathway or not, of interest or not
- Fisher’s exact test for ratio of ‘pathway’ odds in the two ‘interest’ classes
Gene Set Enrichment Analysis (GSEA)
- rank all genes using significance and/or log2FoldChange
- compute enrichment score
- compute its significance

Both methods are applicable to series of gene sets.