The list of differentially expressed genes is sometimes so long that its interpretation becomes cumbersome and time consuming. A common downstream procedure is gene set testing. It aims to understand which pathways or gene networks the differentially expressed genes are implicated in.

Various ways exist to test for enrichment of biological pathways.

GSEA analysis

Gene Set Enrichment Analysis GSEA was tests whether a set of genes of interest, e.g. genes (Subramanian et al. 2005). The software is distributed by the Broad Institute and is freely available for use by academic and non-profit organisations.

In addition to the GSEA software the Broad also provide a number of very well curated gene sets for testing against your data - the Molecular Signatures Database (MSigDB). Unfortunately, these are collections of human genes, however, these lists have been translated to mouse equivalents by the Walter+Eliza Hall Institutes Bioinformatics service and made avaialble for download.

The analysis is performed by:

commentary on GSEA. The article describing the original software is available here.

fgsea

The fgsea package (Sergushichev 2016) implements the same algorithm in R vignette “fast preranked gene set enrichment analysis (GSEA)”.

library(fgsea)
Loading required package: Rcpp
load("Robjects/Annotated_Results_LvV.RData")

Create ranks

Rank all genes based on their fold change. We need to exclude genes for which we do not have Entrez IDs. Also, we should use the shrunk LFC values.

gseaDat <- filter(shrinkLvV, !is.na(Entrez))
ranks <- gseaDat$logFC
names(ranks) <- gseaDat$Entrez
head(ranks)
    497097      19888      20671      27395      18777      21399 
 0.5851914 -1.0498457 -0.5893634  0.4313540  0.6478714 -0.1110589 

Plot the ranked fold changes.

barplot(sort(ranks, decreasing = T))

Load pathways

load("Robjects/mouse_H_v5.RData")
pathwaysH <- Mm.H

Conduct analysis

fgseaRes <- fgsea(pathwaysH, ranks, minSize=15, maxSize = 500, nperm=1000)
There are ties in the preranked stats (0.05% of the list).
The order of those tied genes will be arbitrary, which may produce unexpected results.

The warning produced indicates that there are few genes that have the same fold change and so are ranked equally. fgsea with arbitrarily order determine which comes first in the ranked list. As long as this number is small it shouldn’t significantly effect the results. If the number is large something is suspicious about the fold change results.

Lets look at the top 10 results.

head(fgseaRes[order(padj, -abs(NES)), ], n=10)

Enrichment score plot

plotEnrichment(pathwaysH[["HALLMARK_ESTROGEN_RESPONSE_EARLY"]], ranks)

Remember to check the GSEA article for the complete explanation.

GSEA table plot

The function plotGseaTable allows us to plot a summary figue showing the results for multiple pathways.

topUp <- fgseaRes %>% 
    filter(ES > 0) %>% 
    top_n(10, wt=-padj)
topDown <- fgseaRes %>% 
    filter(ES < 0) %>% 
    top_n(10, wt=-padj)
topPathways <- bind_rows(topUp, topDown) %>% 
    arrange(-ES)
plotGseaTable(pathwaysH[topPathways$pathway], 
              ranks, 
              fgseaRes, 
              gseaParam = 0.5)