March 2023
The list of differentially expressed genes is sometimes:
so long that its interpretation becomes cumbersome and time consuming,
or very short while some genes have low p-value yet higher than the given threshold.
There are many approaches to searching for biological meaning in the results of differential expression analysis.
Commonly we assess whether the differentially expressed genes tend to relate to specific pathways or ontological groups of genes.
We will look at two methods:
Over Representation Analysis (ORA)
Gene Set Enrichment Analysis (GSEA)
Common sources of gene sets:
Manually curated gene lists
This method tests whether genes in a specific pathway are present in a subset of genes of interest in our data more than expected.
The genes of interest could be e.g. statistically significant genes or a cluster of genes from hierachical or k-means clustering.
Given the ratio of genes in the pathway to genes not in the pathway, is the number of genes in the pathway and in our subset statistically unlikely by chance.
Genes in the experiment are split in two ways:
Contingency table:
This method is based on ranking all genes in our dataset
If the gene set is significantly affected in our experiment, then the genes in the set should tend to be at one end or the other of our ranking.
The ranking method is arbitrary, but p-value and fold change are common choices.
GSEA calculates an enrichment score based on the ranking, and then uses permutation to calculate a p-value for how significant the enrichment score is.
Randomly permute the ranking and recalculate the enrichment score, repeat many times.
From a distribution of our permuted enrichment scores determine how likely our ES.
Question: Do the differentially expressed genes tend to relate to specific pathways or ontological groups of genes?
For a given contrast and a given gene set.
Two methods:
Over Representation Analysis (ORA)
Gene Set Enrichment Analysis (GSEA)
Both methods are applicable to series of gene sets