library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.0     ✓ purrr   0.3.4
## ✓ tibble  3.0.1     ✓ dplyr   0.8.5
## ✓ tidyr   1.0.3     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
## ── Conflicts ──────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

The list of differentially expressed genes is sometimes so long that its interpretation becomes cumbersome and time consuming. A common downstream procedure is gene set testing. It aims to understand which pathways or gene networks the differentially expressed genes are implicated in.

Various ways exist to test for enrichment of biological pathways.

GSEA analysis

Gene Set Enrichment Analysis (GSEA) identifies gene sets that are related to the difference of interest between samples (Subramanian et al. 2005). A gene set comprises genes that share a biological function, chromosomal location, or any other relevant criterion.

The software is distributed by the Broad Institute and is freely available for use by academic and non-profit organisations. The Broad also provide a number of very well curated gene sets for testing against your data - the Molecular Signatures Database (MSigDB). These are collections of human genes, however. Fortunately, these lists have been translated to mouse equivalents by the Walter+Eliza Hall Institutes Bioinformatics service and made available for download.

The analysis is performed by:

The article describing the original software is available here, while this commentary on GSEA provides a shorter description.

fgsea

The fgsea package (Sergushichev 2016) implements the same algorithm in R vignette “fast preranked gene set enrichment analysis (GSEA)”.

library(fgsea)
load("Robjects/Annotated_Results_LvV.RData")

Ranking Data

We need to provide fgsea a vector containing numeric data by which it should rank the genes. To start with we will simply use a rank based on their fold change. We do not need to rank the genes ourselves, fgsea will do this based on the data we provide.

We need to exclude genes for which we do not have Entrez IDs. Also, we should use the shrunk LFC values.

gseaDat <- filter(shrinkLvV, !is.na(Entrez))

rankData <- gseaDat$logFC
names(rankData) <- gseaDat$Entrez
head(rankData)
##     497097      19888      20671      27395      18777      21399 
##  0.5959404 -1.2525813 -0.5856117  0.4398234  0.6631350 -0.1187075

Load pathways

load("Robjects/mouse_H_v5.RData")
pathwaysH <- Mm.H

Conduct analysis

fgseaRes <- fgsea(pathwaysH, 
                  rankData, 
                  minSize=15, 
                  maxSize = 500, 
                  nperm=1000)
## Warning in fgsea(pathwaysH, rankData, minSize = 15, maxSize = 500, nperm
## = 1000): You are trying to run fgseaSimple. It is recommended to use
## fgseaMultilevel. To run fgseaMultilevel, you need to remove the nperm argument
## in the fgsea function call.
## Warning in preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, : There are ties in the preranked stats (0.09% of the list).
## The order of those tied genes will be arbitrary, which may produce unexpected results.

The warning produced indicates that there are few genes that have the same fold change and so are ranked equally. fgsea arbitrarily determines which comes first in the ranked list. As long as this number is small it shouldn’t significantly affect the results. If the number is large something is suspicious about the fold change results.

Lets look at the top 10 results.

fgseaRes %>% 
    arrange(desc(abs(NES))) %>% 
    top_n(10, -padj)
## # A tibble: 15 x 8
##    pathway              pval    padj     ES   NES nMoreExtreme  size leadingEdge
##    <chr>               <dbl>   <dbl>  <dbl> <dbl>        <dbl> <int> <list>     
##  1 HALLMARK_CHOLEST… 0.00202 0.00714  0.609  2.37            0    73 <chr [33]> 
##  2 HALLMARK_OXIDATI… 0.00203 0.00714  0.514  2.33            0   196 <chr [127]>
##  3 HALLMARK_TGF_BET… 0.00196 0.00714 -0.591 -2.13            0    54 <chr [17]> 
##  4 HALLMARK_ADIPOGE… 0.00206 0.00714  0.445  2.02            0   199 <chr [86]> 
##  5 HALLMARK_NOTCH_S… 0.00191 0.00714 -0.605 -1.95            0    31 <chr [15]> 
##  6 HALLMARK_FATTY_A… 0.00209 0.00714  0.438  1.93            0   155 <chr [56]> 
##  7 HALLMARK_MTORC1_… 0.00209 0.00714  0.410  1.86            0   203 <chr [79]> 
##  8 HALLMARK_KRAS_SI… 0.00195 0.00714 -0.415 -1.83            0   190 <chr [58]> 
##  9 HALLMARK_ESTROGE… 0.00196 0.00714 -0.411 -1.81            0   193 <chr [64]> 
## 10 HALLMARK_MYC_TAR… 0.00193 0.00714 -0.493 -1.80            0    58 <chr [27]> 
## 11 HALLMARK_INFLAMM… 0.00195 0.00714 -0.398 -1.75            0   188 <chr [56]> 
## 12 HALLMARK_PEROXIS… 0.00214 0.00714  0.426  1.74            0    95 <chr [35]> 
## 13 HALLMARK_TNFA_SI… 0.00193 0.00714 -0.388 -1.72            0   199 <chr [69]> 
## 14 HALLMARK_COAGULA… 0.00190 0.00714 -0.406 -1.70            0   129 <chr [39]> 
## 15 HALLMARK_HEME_ME… 0.00205 0.00714  0.333  1.49            0   182 <chr [62]>

Enrichment score plot

plotEnrichment(pathwaysH[["HALLMARK_ESTROGEN_RESPONSE_EARLY"]], rankData)

Remember to check the GSEA article for the complete explanation.

GSEA table plot

The function plotGseaTable allows us to plot a summary figure showing the results for multiple pathways.

topPathways <- fgseaRes %>% 
    top_n(20, wt=-padj) %>% 
    arrange(-NES) %>% 
    pull(pathway)

plotGseaTable(pathwaysH[topPathways], 
              rankData, 
              fgseaRes, 
              gseaParam = 0.5)

Challenge 1

Another common way to rank the genes is to order by pvalue, but also, sorting so that upregulated genes are at start and downregulated at the other - you can do this combining the sign of the fold change and the pvalue.
1. Rank the genes by statisical significance - you will need to create a new ranking value using -log10({p value}) * sign({Fold Change})
2. Load the “C2” pathways from the the Robjects/mouse_c2_v5.RData file
3. Run fgsea using the new ranked genes and the C2 pathways
4. Run fgsea using the new ranked genes and the H pathways. How do these results differ from the ones we got when ranking by the fold change alone?

GO enrichment analysis

goseq

GOseq is a method to conduct Gene Ontology (GO) analysis suitable for RNA-seq data as it accounts for the gene length bias in detection of over-representation (Young et al. 2010).

From the GOseq vignette:

  • GOseq first needs to quantify the length bias present in the dataset under consideration.
  • This is done by calculating a Probability Weighting Function or PWF which can be thought of as a function which gives the probability that a gene will be differentially expressed (DE), based on its length alone.
  • The PWF is calculated by fitting a monotonic spline to the binary data series of differential expression (1=DE, 0=Not DE) as a function of gene length.
  • The PWF is used to weight the chance of selecting each gene when forming a null distribution for GO category membership.
  • The fact that the PWF is calculated directly from the dataset under consideration makes this approach robust, only correcting for the length bias present in the data.
##    Genome         Id  Id Description Lengths in geneLeneDataBase
## 1    mm10                                                  FALSE
## 2     mm7  knownGene  Entrez Gene ID                        TRUE
## 3     mm7    ensGene Ensembl gene ID                        TRUE
## 4     mm7 geneSymbol     Gene Symbol                        TRUE
## 5     mm8  knownGene  Entrez Gene ID                        TRUE
## 6     mm8    ensGene Ensembl gene ID                        TRUE
## 7     mm8 geneSymbol     Gene Symbol                        TRUE
## 8     mm9  knownGene  Entrez Gene ID                        TRUE
## 9     mm9    ensGene Ensembl gene ID                        TRUE
## 10    mm9 geneSymbol     Gene Symbol                        TRUE
##    GO Annotation Available
## 1                     TRUE
## 2                     TRUE
## 3                     TRUE
## 4                     TRUE
## 5                     TRUE
## 6                     TRUE
## 7                     TRUE
## 8                     TRUE
## 9                     TRUE
## 10                    TRUE
## # A tibble: 10 x 5
##    Genome Id        `Id Description` `Lengths in geneLeneD… `GO Annotation Avai…
##    <chr>  <chr>     <chr>            <lgl>                  <lgl>               
##  1 mm10   ""        ""               FALSE                  TRUE                
##  2 mm7    "knownGe… "Entrez Gene ID" TRUE                   TRUE                
##  3 mm7    "ensGene" "Ensembl gene I… TRUE                   TRUE                
##  4 mm7    "geneSym… "Gene Symbol"    TRUE                   TRUE                
##  5 mm8    "knownGe… "Entrez Gene ID" TRUE                   TRUE                
##  6 mm8    "ensGene" "Ensembl gene I… TRUE                   TRUE                
##  7 mm8    "geneSym… "Gene Symbol"    TRUE                   TRUE                
##  8 mm9    "knownGe… "Entrez Gene ID" TRUE                   TRUE                
##  9 mm9    "ensGene" "Ensembl gene I… TRUE                   TRUE                
## 10 mm9    "geneSym… "Gene Symbol"    TRUE                   TRUE

Create a list of differentially expressed genes

The input for goseq is a vector that indicates, for each gene, whether or not it is significantly differentially expressed. This should be a named vector, where the names are the gene ids and the values are 1 if the gene is significant and 0 if it is not.

In this case we can use the Ensembl gene IDs.

sigData <- as.integer(!is.na(shrinkLvV$FDR) & shrinkLvV$FDR < 0.01)
names(sigData) <- shrinkLvV$GeneID

Fit the Probability Weighting Function (PWF)

pwf <- nullp(sigData, "mm10", "ensGene", bias.data = shrinkLvV$medianTxLength)
## Warning in pcls(G): initial point very close to some inequality constraints

Conduct GO enrichment analysis

goResults <- goseq(pwf, "mm10","ensGene", test.cats=c("GO:BP"))

Plot the top 10

goResults %>% 
    top_n(10, wt=-over_represented_pvalue) %>% 
    mutate(hitsPerc=numDEInCat*100/numInCat) %>% 
    ggplot(aes(x=hitsPerc, 
               y=term, 
               colour=over_represented_pvalue, 
               size=numDEInCat)) +
        geom_point() +
        expand_limits(x=0) +
        labs(x="Hits (%)", y="GO term", colour="p value", size="Count")

Get the GO information for the GO accessions

library(GO.db)
GOTERM[[goResults$category[1]]]
## GOID: GO:0044281
## Term: small molecule metabolic process
## Ontology: BP
## Definition: The chemical reactions and pathways involving small
##     molecules, any low molecular weight, monomeric, non-encoded
##     molecule.
## Synonym: small molecule metabolism

Challenge 2

  1. Create a vector showing genes that are statistically significant at FDR < 0.01 and that are up-regulated by at least 4x (logFC>2) in lactating mice
  2. Run a goseq analysis on this gene list
  3. Plot the results
  4. How is this result different to the previous GO analysis?

KEGG pathway enrichment analysis

clusterProfiler

We can analyse for enrichment of KEGG pathways in much the same way as for GO terms. We could also use goseq for this, but this time we’re going to use clusterProfiler (Yu et al. 2012). clusterprofiler supports direct online access of the current KEGG database, rather than relying on R annotation packages, it also provides some nice visualisation options.

library(clusterProfiler)
search_kegg_organism('mmu', by='kegg_code')
##    kegg_code scientific_name common_name
## 14       mmu    Mus musculus       mouse

KEGG enrichment analysis

The input for the KEGG enrichment is list of gene IDs for significant genes.

This time we’ll only use genes with an absolute fold change greater than 2.

For this tool we need to use Entrez IDs, so once again we will have to eliminate the missing values.

sigGenes <- shrinkLvV %>% 
    filter(FDR < 0.05 & !is.na(FDR) & 
               abs(logFC) > 1 & 
               !is.na(Entrez)) %>% 
    pull(Entrez)

kk <- enrichKEGG(gene = sigGenes, organism = 'mmu')
head(kk, n=10)
## Reading KEGG annotation online:
## 
## Reading KEGG annotation online:
## # A tibble: 10 x 9
##    ID     Description  GeneRatio BgRatio  pvalue p.adjust  qvalue geneID   Count
##    <chr>  <chr>        <chr>     <chr>     <dbl>    <dbl>   <dbl> <chr>    <int>
##  1 mmu01… Fatty acid … 24/1026   61/8776 2.70e-8  8.45e-6 6.51e-6 11363/7…    24
##  2 mmu00… Steroid bio… 12/1026   20/8776 3.15e-7  4.93e-5 3.80e-5 15490/1…    12
##  3 mmu01… Biosynthesi… 15/1026   32/8776 7.73e-7  8.07e-5 6.22e-5 30963/1…    15
##  4 mmu03… PPAR signal… 27/1026   88/8776 1.39e-6  1.09e-4 8.37e-5 11363/7…    27
##  5 mmu00… Valine, leu… 18/1026   56/8776 3.95e-5  2.07e-3 1.60e-3 227095/…    18
##  6 mmu00… Glycerolipi… 19/1026   61/8776 3.97e-5  2.07e-3 1.60e-3 68393/5…    19
##  7 mmu04… Cell adhesi… 38/1026   172/87… 6.78e-5  2.87e-3 2.21e-3 12487/2…    38
##  8 mmu04… Calcium sig… 41/1026   192/87… 8.05e-5  2.87e-3 2.21e-3 12767/2…    41
##  9 mmu04… Cytokine-cy… 57/1026   296/87… 8.26e-5  2.87e-3 2.21e-3 16178/1…    57
## 10 mmu05… Rheumatoid … 23/1026   86/8776 9.41e-5  2.94e-3 2.27e-3 12487/2…    23

Visualise a pathway

In a browser

clusterProfile has a function browseKegg that allows you to view the Kegg pathway with the genes colours in in your browser.

browseKEGG(kk, 'mmu03320')

As a file

The package pathview (Luo et al. 2013) can be used to generate figures of KEGG pathways.

One advantage over clusterProfiles browser method is that the genes are coloured according to their fold change levels in our data. To do this we need to pass pathview a named vector of fold change data (actually you could colour by any numeric vector, e.g. p-value).

The package plots the KEGG pathway to a png file in the working directory.

mmu03320.pathview.png:

mmu03320 - PPAR signaling pathway

mmu03320 - PPAR signaling pathway

Challenge 3

  1. Use pathview to export a figure for “mmu04060”, but this time only use genes that are statistically significant at FDR < 0.01

References

Luo, Weijun, Brouwer, and Cory. 2013. “Pathview: An R/Bioconductor Package for Pathway-Based Data Integration and Visualization.” Bioinformatics 29 (14): 1830–1. https://doi.org/10.1093/bioinformatics/btt285.

Sergushichev, Alexey. 2016. “An Algorithm for Fast Preranked Gene Set Enrichment Analysis Using Cumulative Statistic Calculation.” bioRxiv. Cold Spring Harbor Labs Journals. https://doi.org/10.1101/060012.

Subramanian, Aravind, Pablo Tamayo, Vamsi K. Mootha, Sayan Mukherjee, Benjamin L. Ebert, Michael A. Gillette, Amanda Paulovich, et al. 2005. “Gene Set Enrichment Analysis: A Knowledge-Based Approach for Interpreting Genome-Wide Expression Profiles.” Proceedings of the National Academy of Sciences 102 (43). National Academy of Sciences: 15545–50. https://doi.org/10.1073/pnas.0506580102.

Young, Matthew D, Matthew J Wakefield, Gordon K Smyth, and Alicia Oshlack. 2010. “Gene Ontology Analysis for Rna-Seq: Accounting for Selection Bias.” Genome Biology 11: R14.

Yu, Guangchuang, Li-Gen Wang, Yanyan Han, and Qing-Yu He. 2012. “ClusterProfiler: An R Package for Comparing Biological Themes Among Gene Clusters.” OMICS: A Journal of Integrative Biology 16 (5): 284–87. https://doi.org/10.1089/omi.2011.0118.