July 2020
There are many approaches to searching for biological meaning in the results of differential expression analysis.
Commonly we look to see if the differentially expressed genes tend to relate to specific pathways or ontological groups of genes.
We will look at two methods of doing this:
Over Representation Analysis (ORA)
Gene Set Enrichment Analysis (GSEA)
Common sources of Gene Sets:
KEGG pathways
Gene Ontologies
Reactome
MSigDB (GSEA)
Manually curated gene lists
This method tests whether genes in specific pathway are present in a subset of genes of interest in our data more than expected.
The genes of interest could be e.g. statistically significant genes or a cluster of genes from hierachical or k-means clustering.
Given the ratio of genes in the pathway to genes not in the pathway, is the number of genes in the pathway and in our subset statistically unlikely by chance.
Genes in the experiment are split in two ways:
Contingency table:
This method is based on ranking of all genes in our dataset
If the gene set is significantly affected in our experiment, then the genes in the set should tend to be at one end or the other of our ranking.
The ranking method is arbitrary, but p-value and fold change are common choices.
GSEA calculates an enrichment score based on the ranking, and then uses permutation to calculate a p-value for how significant the enrichment score is.
Randomly permute the ranking and recalculate the Enrichment Score.
From a distribution of our permuted Enrichment scores determine how likely our ES.