The list of differentially expressed genes is sometimes so long that its interpretation becomes cumbersome and time consuming. A common downstream procedure is gene set testing. It aims to understand which pathways or gene networks the differentially expressed genes are implicated in.

Various ways exist to test for enrichment of biological pathways.

GSEA analysis

Gene Set Enrichment Analysis GSEA was tests whether a set of genes of interest, e.g. genes (Subramanian et al. 2005). The software is distributed by the Broad Institute and is freely available for use by academic and non-profit organisations.

In addition to the GSEA software the Broad also provide a number of very well curated gene sets for testing against your data - the Molecular Signatures Database (MSigDB). Unfortunately, these are collections of human genes, however, these lists have been translated to mouse equivalents by the Walter+Eliza Hall Institutes Bioinformatics service and made avaialble for download.

The analysis is performed by:

commentary on GSEA. The article describing the original software is available here.

fgsea

The fgsea package (Sergushichev 2016) implements the same algorithm in R vignette “fast preranked gene set enrichment analysis (GSEA)”.

library(fgsea)
## Loading required package: Rcpp
load("Robjects/Annotated_Results_LvV.RData")

Ranking Data

We need to provide fgsea a vector containing numeric data by which it should rank the genes. To start with we will simply use a rank based on their fold change. We do not need to rank the genes ourselves, fgsea will do this based on the data we provide.

We need to exclude genes for which we do not have Entrez IDs. Also, we should use the shrunk LFC values.

gseaDat <- filter(shrinkLvV, !is.na(Entrez))

rankData <- gseaDat$logFC
names(rankData) <- gseaDat$Entrez
head(rankData)
##     497097      19888      20671      27395      18777      21399 
##  0.5959404 -1.2525813 -0.5856117  0.4398234  0.6631350 -0.1187075

Load pathways

load("Robjects/mouse_H_v5.RData")
pathwaysH <- Mm.H

Conduct analysis

fgseaRes <- fgsea(pathwaysH, 
                  rankData, 
                  minSize=15, 
                  maxSize = 500, 
                  nperm=1000)
## Warning in fgsea(pathwaysH, rankData, minSize = 15, maxSize = 500, nperm = 1000): There are ties in the preranked stats (0.09% of the list).
## The order of those tied genes will be arbitrary, which may produce unexpected results.

The warning produced indicates that there are few genes that have the same fold change and so are ranked equally. fgsea with arbitrarily order determine which comes first in the ranked list. As long as this number is small it shouldn’t significantly effect the results. If the number is large something is suspicious about the fold change results.

Lets look at the top 10 results.

fgseaRes %>% 
    arrange(desc(abs(NES))) %>% 
    top_n(10, -padj)
## # A tibble: 14 x 8
##    pathway         pval    padj     ES   NES nMoreExtreme  size leadingEdge
##    <chr>          <dbl>   <dbl>  <dbl> <dbl>        <dbl> <int> <list>     
##  1 HALLMARK_CH… 0.00220 0.00787  0.609  2.31            0    73 <chr [33]> 
##  2 HALLMARK_OX… 0.00204 0.00787  0.514  2.30            0   196 <chr [127]>
##  3 HALLMARK_TG… 0.00189 0.00787 -0.591 -2.09            0    54 <chr [17]> 
##  4 HALLMARK_AD… 0.00206 0.00787  0.445  1.99            0   199 <chr [86]> 
##  5 HALLMARK_FA… 0.00203 0.00787  0.438  1.91            0   155 <chr [56]> 
##  6 HALLMARK_MT… 0.00207 0.00787  0.410  1.84            0   203 <chr [79]> 
##  7 HALLMARK_KR… 0.00196 0.00787 -0.415 -1.80            0   190 <chr [58]> 
##  8 HALLMARK_ES… 0.00194 0.00787 -0.411 -1.79            0   193 <chr [64]> 
##  9 HALLMARK_IN… 0.00196 0.00787 -0.398 -1.72            0   188 <chr [56]> 
## 10 HALLMARK_PE… 0.00215 0.00787  0.426  1.70            0    95 <chr [35]> 
## 11 HALLMARK_TN… 0.00194 0.00787 -0.388 -1.69            0   199 <chr [69]> 
## 12 HALLMARK_CO… 0.00190 0.00787 -0.406 -1.67            0   129 <chr [39]> 
## 13 HALLMARK_MY… 0.00194 0.00787 -0.364 -1.59            0   201 <chr [69]> 
## 14 HALLMARK_ES… 0.00195 0.00787 -0.349 -1.52            0   197 <chr [52]>

Enrichment score plot

plotEnrichment(pathwaysH[["HALLMARK_ESTROGEN_RESPONSE_EARLY"]], rankData)

Remember to check the GSEA article for the complete explanation.

GSEA table plot

The function plotGseaTable allows us to plot a summary figue showing the results for multiple pathways.

topPathways <- fgseaRes %>% 
    top_n(20, wt=-padj) %>% 
    arrange(-NES) %>% 
    pull(pathway)

plotGseaTable(pathwaysH[topPathways], 
              rankData, 
              fgseaRes, 
              gseaParam = 0.5)