The list of differentially expressed genes is sometimes so long that its interpretation becomes cumbersome and time consuming. It may also be very short while some genes have low p-value yet higher than the given threshold.

A common downstream procedure to combine information across genes is gene set testing. It aims at finding pathways or gene networks the differentially expressed genes play a role in.

Various ways exist to test for enrichment of biological pathways. We will look into over representation and gene set enrichment analysis.

A gene set comprises genes that share a biological function, chromosomal location, or any other relevant criterion.

Over-representation

Method

This method tests whether genes in a pathway are present in a subset of our data more than expected (explanations derived from the clusterProfiler manual).

Genes in the experiment are split in two ways:

annotated to the pathway or not
differentially expressed or not

We can then create a contingency table with:

rows: genes in pathway or not
columns: genes differentially expressed or not

A toy example

Let’s start with a experiment with 20,000 genes of which 100 are differentially expressed, and a pathway with 2000 genes.

contingencyTable <- data.frame(
  diffExpNo=c(1980, 17920),
  diffExpYes=c(20, 80))
row.names(contingencyTable) <- c("pathwayYes", "pathwayNo")

rowSums(contingencyTable)

## pathwayYes  pathwayNo 
##       2000      18000

colSums(contingencyTable)

##  diffExpNo diffExpYes 
##      19900        100

contingencyTable

##            diffExpNo diffExpYes
## pathwayYes      1980         20
## pathwayNo      17920         80

For a given pathway:

N: total number of genes in the background set, e.g. all genes tested - 20,000 in our example. This background set is sometimes referred to as the universe.
M: number of genes within that background that are annotated to the pathway - 2000 in our example
n: number of differentially expressed genes - 100 in our example
k: number of differentially expressed genes that are annotated to the pathway
20 in our example

Significance can then be assessed with a the hypergeometric distribution:

The test above is identical to the one-tailed Fisher’s exact test.

fisher.test(contingencyTable, alternative = "greater")

## 
##  Fisher's Exact Test for Count Data
## 
## data:  contingencyTable
## p-value = 0.9992
## alternative hypothesis: true odds ratio is greater than 1
## 95 percent confidence interval:
##  0.2874878       Inf
## sample estimates:
## odds ratio 
##  0.4419959

To save time and effort there are a number of packages that make applying this test to a large number of gene sets simpler, and which will import gene lists for testing from various sources.

Today we will use clusterProfiler.

`clusterProfiler`

clusterprofiler (Yu et al. 2012) supports direct online access of the current KEGG database, rather than relying on R annotation packages, it also provides some nice visualisation options (KEGG: Kyoto Encyclopedia of Genes and Genomes).

library(clusterProfiler)
library(tidyverse)

search_kegg_organism('mouse', by='common_name')

##    kegg_code scientific_name common_name
## 14       mmu    Mus musculus       mouse

KEGG enrichment analysis

The input for the KEGG enrichment is list of gene IDs for significant genes.

We now load the R object keeping the outcome of the differential expression analysis for the LvV contrast.

load("Robjects/Annotated_Results_LvV.RData")

We will only use genes that have an adjusted p-value (FDR) of less than 0.05 and an absolute fold change greater than 2. We need to remember to eliminate genes with missing values in the FDR as a result of the independent filtering by DESeq2.

For this tool we need to use Entrez IDs, so we will need to eliminate genes with missing values in ‘Entrez’.

sigGenes <- shrinkLvV %>% 
    drop_na(Entrez, FDR) %>% 
    filter(FDR < 0.05 & abs(logFC) > 1) %>% 
    pull(Entrez)

kk <- enrichKEGG(gene = sigGenes, organism = 'mmu')
head(kk, n=10)

## Reading KEGG annotation online:
## 
## Reading KEGG annotation online:

## # A tibble: 10 x 9
##    ID     Description  GeneRatio BgRatio  pvalue p.adjust  qvalue geneID   Count
##    <chr>  <chr>        <chr>     <chr>     <dbl>    <dbl>   <dbl> <chr>    <int>
##  1 mmu01… Fatty acid … 25/1037   62/8911 6.96e-9  2.19e-6 1.68e-6 11363/7…    25
##  2 mmu00… Steroid bio… 12/1037   20/8911 2.99e-7  4.72e-5 3.62e-5 15490/1…    12
##  3 mmu03… PPAR signal… 27/1037   89/8911 1.63e-6  1.50e-4 1.15e-4 11363/7…    27
##  4 mmu01… Biosynthesi… 15/1037   34/8911 1.91e-6  1.50e-4 1.15e-4 30963/1…    15
##  5 mmu00… Valine, leu… 19/1037   57/8911 1.26e-5  7.94e-4 6.10e-4 227095/…    19
##  6 mmu04… Calcium sig… 44/1037   200/89… 1.85e-5  9.72e-4 7.47e-4 12767/2…    44
##  7 mmu00… Glycerolipi… 19/1037   62/8911 4.80e-5  2.16e-3 1.66e-3 68393/5…    19
##  8 mmu00… Propanoate … 13/1037   34/8911 5.89e-5  2.32e-3 1.78e-3 227095/…    13
##  9 mmu04… Cytokine-cy… 58/1037   302/89… 6.83e-5  2.39e-3 1.84e-3 16178/1…    58
## 10 mmu04… PI3K-Akt si… 66/1037   359/89… 8.85e-5  2.61e-3 2.00e-3 12043/2…    66

Visualise a pathway in a browser

clusterProfile has a function browseKegg that allows you to view the KEGG pathway in in your browser with the genes that are in our gene highlighted.

browseKEGG(kk, 'mmu03320')

Visualise a pathway as a file

The package pathview (Luo et al. 2013) can be used to generate figures of KEGG pathways.

One advantage over the clusterProfiler browser method is that the genes can be coloured according to their fold change levels in our data. To do this we need to pass pathview a named vector of fold change data (actually you could colour by any numeric vector, e.g. p-value).

The package plots the KEGG pathway to a png file in the working directory.

library(pathview)
logFC <- shrinkLvV$logFC
names(logFC) <- shrinkLvV$Entrez
pathview(gene.data = logFC, 
         pathway.id = "mmu03320", 
         species = "mmu", 
         limit = list(gene=5, cpd=1))

mmu03320.pathview.png:

mmu03320 - PPAR signaling pathway

Challenge 1

Use pathview to export a figure for “mmu04060”, but this time only use genes that are statistically significant at FDR < 0.01

Challenge 2 - GO term enrichment analysis

clusterProfiler can also perform over-representation analysis on GO terms. using the commmand enrichGO. Look at the help page for the command enrichGO (?enrichGO) and have a look at the instructions in the clusterProfiler book.

Run the over-representation analysis for GO terms

Use genes that have an adjusted p-value (FDR) of less than 0.01 and an absolute fold change greater than 2.

For this analysis you can use Ensembl IDs rather then Entrez

You’ll need to provide the background (universe) genes, this should be all the genes in our analysis.

The mouse database package is called org.Mm.eg.db. You’ll need to load it using library before running the analysis.

As we are using Ensembl IDs, you’ll need to set the keyType parameter in the enrichGO command to indicate this.

Only test terms in the “Molecular Function” ontology

Use the dotplot function to visualise the results.

GSEA analysis

Gene Set Enrichment Analysis (GSEA) identifies gene sets that are related to the difference of interest between samples (Subramanian et al. 2005).

The software is distributed by the Broad Institute and is freely available for use by academic and non-profit organisations. The Broad also provide a number of very well curated gene sets for testing against your data - the Molecular Signatures Database (MSigDB). These are collections of human genes, however. Fortunately, these lists have been translated to mouse equivalents by the Walter+Eliza Hall Institute Bioinformatics service and made available for download.

Method

The analysis is performed by:

ranking all genes in the data set
identifying the rank positions of all members of the gene set in the ranked data set
calculating an enrichment score (ES) that represents the difference between the observed rankings and that which would be expected assuming a random rank distribution.

The article describing the original software is available here, while this commentary on GSEA provides a shorter description.

We will use the fgsea package (Sergushichev 2016) that implements the same algorithm in R. ‘fgsea’ stands for’ “fast preranked gene set enrichment analysis (GSEA)”.

library(fgsea)

Ranking Data

We need to provide fgsea with a vector containing numeric data by which it should rank the genes. To start with we will simply use a rank based on their fold change. We do not need to rank the genes ourselves, fgsea will do this based on the data we provide.

We must exclude genes for which we do not have Entrez IDs. Also, we should use the shrunk LFC values.

gseaDat <- filter(shrinkLvV, !is.na(Entrez))

rankData <- gseaDat$logFC
names(rankData) <- gseaDat$Entrez
head(rankData)

##     497097      19888      20671      27395      18777      21399 
##  0.5959404 -1.2525813 -0.5856117  0.4398234  0.6631350 -0.1187075

Load pathways

We will load the MSigDB Hallmark gene set. The pathways RData file loads a new object call Mm.H (H for Hallmark gene set), which is a list of gene sets. The names of the vectors in list are the names of gene sets and each vector contains Entrez IDs for the genes in the gene set.

load("Robjects/mouse_H_v5.RData")
head(names(Mm.H))

## [1] "HALLMARK_TNFA_SIGNALING_VIA_NFKB"    "HALLMARK_HYPOXIA"                   
## [3] "HALLMARK_CHOLESTEROL_HOMEOSTASIS"    "HALLMARK_MITOTIC_SPINDLE"           
## [5] "HALLMARK_WNT_BETA_CATENIN_SIGNALING" "HALLMARK_TGF_BETA_SIGNALING"

head(Mm.H[[1]])

## [1] "11303" "11796" "11797" "11839" "11852" "11910"

Conduct analysis

fgseaRes <- fgsea(pathways = Mm.H, 
                  stats = rankData, 
                  minSize = 15, 
                  maxSize = 500)

## Warning in preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, : There are ties in the preranked stats (0.09% of the list).
## The order of those tied genes will be arbitrary, which may produce unexpected results.

## Warning in fgseaMultilevel(...): For some pathways, in reality P-values are less
## than 1e-10. You can set the `eps` argument to zero for better estimation.

The warning produced indicates that there are few genes that have the same fold change and so are ranked equally. fgsea arbitrarily determines which comes first in the ranked list. As long as this number is small it shouldn’t significantly affect the results. If the number is large something is suspicious about the fold change results.

Lets look at the top 10 results.

fgseaRes %>% 
    arrange(desc(abs(NES))) %>% 
    top_n(10, -padj)

## # A tibble: 10 x 8
##    pathway                  pval     padj log2err     ES   NES  size leadingEdge
##    <chr>                   <dbl>    <dbl>   <dbl>  <dbl> <dbl> <int> <list>     
##  1 HALLMARK_CHOLESTE…   1.32e- 8  3.30e-7   0.748  0.609  2.36    73 <chr [33]> 
##  2 HALLMARK_OXIDATIV…   1.00e-10  5.00e-9  NA      0.514  2.29   196 <chr [127]>
##  3 HALLMARK_TGF_BETA…   7.45e- 6  4.65e-5   0.611 -0.591 -2.09    54 <chr [17]> 
##  4 HALLMARK_ADIPOGEN…   6.45e- 8  1.08e-6   0.705  0.445  1.99   199 <chr [86]> 
##  5 HALLMARK_FATTY_AC…   3.80e- 6  3.57e-5   0.627  0.438  1.91   155 <chr [56]> 
##  6 HALLMARK_MTORC1_S…   1.60e- 6  2.00e-5   0.644  0.410  1.84   203 <chr [79]> 
##  7 HALLMARK_KRAS_SIG…   5.00e- 6  3.57e-5   0.611 -0.415 -1.82   190 <chr [58]> 
##  8 HALLMARK_ESTROGEN…   4.62e- 6  3.57e-5   0.611 -0.411 -1.81   193 <chr [64]> 
##  9 HALLMARK_INFLAMMA…   4.69e- 5  2.58e-4   0.557 -0.398 -1.75   188 <chr [56]> 
## 10 HALLMARK_TNFA_SIG…   5.16e- 5  2.58e-4   0.557 -0.388 -1.71   199 <chr [69]>

Enrichment score plot

plotEnrichment(Mm.H[["HALLMARK_ESTROGEN_RESPONSE_EARLY"]], rankData)

Remember to check the GSEA article for the complete explanation.

Challenge 3

Another common way to rank the genes is to order by pvalue, but also, sorting so that upregulated genes are at start and downregulated at the other - you can do this combining the sign of the fold change and the pvalue.

Rank the genes by statisical significance - you will need to create a new ranking value using -log10({p value}) * sign({Fold Change})

Load the “C2” pathways from the the Robjects/mouse_c2_v5.RData file

Run fgsea using the new ranked genes and the C2 pathways

Run fgsea using the new ranked genes and the H pathways. How do these results differ from the ones we got when ranking by the fold change alone?

References

Luo, Weijun, Brouwer, and Cory. 2013. “Pathview: An R/Bioconductor Package for Pathway-Based Data Integration and Visualization.” Bioinformatics 29 (14): 1830–1. https://doi.org/10.1093/bioinformatics/btt285.

Sergushichev, Alexey. 2016. “An Algorithm for Fast Preranked Gene Set Enrichment Analysis Using Cumulative Statistic Calculation.” bioRxiv. https://doi.org/10.1101/060012.

Subramanian, Aravind, Pablo Tamayo, Vamsi K. Mootha, Sayan Mukherjee, Benjamin L. Ebert, Michael A. Gillette, Amanda Paulovich, et al. 2005. “Gene Set Enrichment Analysis: A Knowledge-Based Approach for Interpreting Genome-Wide Expression Profiles.” Proceedings of the National Academy of Sciences 102 (43): 15545–50. https://doi.org/10.1073/pnas.0506580102.

Yu, Guangchuang, Li-Gen Wang, Yanyan Han, and Qing-Yu He. 2012. “ClusterProfiler: An R Package for Comparing Biological Themes Among Gene Clusters.” OMICS: A Journal of Integrative Biology 16 (5): 284–87. https://doi.org/10.1089/omi.2011.0118.

RNA-seq analysis in R

Gene Set Testing for RNA-seq

Stephane Ballereau, Dominique-Laurent Couturier, Mark Dunning, Abbi Edwards, Ashley Sawle

Last modified: 20 Nov 2020

Over-representation

Method

A toy example

`clusterProfiler`

KEGG enrichment analysis

Visualise a pathway in a browser

Visualise a pathway as a file

Challenge 1

Challenge 2 - GO term enrichment analysis

GSEA analysis

Method

Ranking Data

Load pathways

Conduct analysis

Enrichment score plot

Challenge 3

References