The list of differentially expressed genes is sometimes so long that its interpretation becomes cumbersome and time consuming. It may also be very short while some genes have low p-value yet higher than the given threshold.
A common downstream procedure to combine information across genes is gene set testing. It aims at finding pathways or gene networks the differentially expressed genes play a role in.
Various ways exist to test for enrichment of biological pathways. We will look into over representation and gene set enrichment analysis.
A gene set comprises genes that share a biological function, chromosomal location, or any other relevant criterion.
This method tests whether genes in a pathway are present in a subset of our data more than expected (explanations derived from the clusterProfiler manual).
Genes in the experiment are split in two ways:
We can then create a contingency table with:
Let’s start with a experiment with 20,000 genes of which 100 are differentially expressed, and a pathway with 2000 genes.
contingencyTable <- data.frame(
diffExpNo=c(1980, 17920),
diffExpYes=c(20, 80))
row.names(contingencyTable) <- c("pathwayYes", "pathwayNo")
rowSums(contingencyTable)
## pathwayYes pathwayNo
## 2000 18000
colSums(contingencyTable)
## diffExpNo diffExpYes
## 19900 100
contingencyTable
## diffExpNo diffExpYes
## pathwayYes 1980 20
## pathwayNo 17920 80
For a given pathway:
Significance can then be assessed with a the hypergeometric distribution:
The test above is identical to the one-tailed Fisher’s exact test.
fisher.test(contingencyTable, alternative = "greater")
##
## Fisher's Exact Test for Count Data
##
## data: contingencyTable
## p-value = 0.9992
## alternative hypothesis: true odds ratio is greater than 1
## 95 percent confidence interval:
## 0.2874878 Inf
## sample estimates:
## odds ratio
## 0.4419959
To save time and effort there are a number of packages that make applying this test to a large number of gene sets simpler, and which will import gene lists for testing from various sources.
Today we will use clusterProfiler
.
clusterProfiler
clusterprofiler
(Yu et al. 2012) supports direct online access of the current KEGG database, rather than relying on R annotation packages, it also provides some nice visualisation options (KEGG: Kyoto Encyclopedia of Genes and Genomes).
library(clusterProfiler)
library(tidyverse)
search_kegg_organism('mouse', by='common_name')
## kegg_code scientific_name common_name
## 14 mmu Mus musculus mouse
The input for the KEGG enrichment is list of gene IDs for significant genes.
We now load the R object keeping the outcome of the differential expression analysis for the LvV contrast.
load("Robjects/Annotated_Results_LvV.RData")
We will only use genes that have an adjusted p-value (FDR) of less than 0.05 and an absolute fold change greater than 2. We need to remember to eliminate genes with missing values in the FDR as a result of the independent filtering by DESeq2.
For this tool we need to use Entrez IDs, so we will need to eliminate genes with missing values in ‘Entrez’.
sigGenes <- shrinkLvV %>%
drop_na(Entrez, FDR) %>%
filter(FDR < 0.05 & abs(logFC) > 1) %>%
pull(Entrez)
kk <- enrichKEGG(gene = sigGenes, organism = 'mmu')
head(kk, n=10)
## Reading KEGG annotation online:
##
## Reading KEGG annotation online:
## # A tibble: 10 x 9
## ID Description GeneRatio BgRatio pvalue p.adjust qvalue geneID Count
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <chr> <int>
## 1 mmu01… Fatty acid … 25/1037 62/8911 6.96e-9 2.19e-6 1.68e-6 11363/7… 25
## 2 mmu00… Steroid bio… 12/1037 20/8911 2.99e-7 4.72e-5 3.62e-5 15490/1… 12
## 3 mmu03… PPAR signal… 27/1037 89/8911 1.63e-6 1.50e-4 1.15e-4 11363/7… 27
## 4 mmu01… Biosynthesi… 15/1037 34/8911 1.91e-6 1.50e-4 1.15e-4 30963/1… 15
## 5 mmu00… Valine, leu… 19/1037 57/8911 1.26e-5 7.94e-4 6.10e-4 227095/… 19
## 6 mmu04… Calcium sig… 44/1037 200/89… 1.85e-5 9.72e-4 7.47e-4 12767/2… 44
## 7 mmu00… Glycerolipi… 19/1037 62/8911 4.80e-5 2.16e-3 1.66e-3 68393/5… 19
## 8 mmu00… Propanoate … 13/1037 34/8911 5.89e-5 2.32e-3 1.78e-3 227095/… 13
## 9 mmu04… Cytokine-cy… 58/1037 302/89… 6.83e-5 2.39e-3 1.84e-3 16178/1… 58
## 10 mmu04… PI3K-Akt si… 66/1037 359/89… 8.85e-5 2.61e-3 2.00e-3 12043/2… 66
clusterProfile
has a function browseKegg
that allows you to view the KEGG pathway in in your browser with the genes that are in our gene highlighted.
browseKEGG(kk, 'mmu03320')
The package pathview
(Luo et al. 2013) can be used to generate figures of KEGG pathways.
One advantage over the clusterProfiler
browser method is that the genes can be coloured according to their fold change levels in our data. To do this we need to pass pathview
a named vector of fold change data (actually you could colour by any numeric vector, e.g. p-value).
The package plots the KEGG pathway to a png
file in the working directory.
library(pathview)
logFC <- shrinkLvV$logFC
names(logFC) <- shrinkLvV$Entrez
pathview(gene.data = logFC,
pathway.id = "mmu03320",
species = "mmu",
limit = list(gene=5, cpd=1))
mmu03320.pathview.png:
Challenge 1
- Use
pathview
to export a figure for “mmu04060”, but this time only use genes that are statistically significant at FDR < 0.01
Challenge 2 - GO term enrichment analysis
clusterProfiler
can also perform over-representation analysis on GO terms. using the commmandenrichGO
. Look at the help page for the commandenrichGO
(?enrichGO
) and have a look at the instructions in the clusterProfiler book.
- Run the over-representation analysis for GO terms
- Use genes that have an adjusted p-value (FDR) of less than 0.01 and an absolute fold change greater than 2.
- For this analysis you can use Ensembl IDs rather then Entrez
- You’ll need to provide the background (
universe
) genes, this should be all the genes in our analysis.- The mouse database package is called
org.Mm.eg.db
. You’ll need to load it usinglibrary
before running the analysis.
- As we are using Ensembl IDs, you’ll need to set the
keyType
parameter in theenrichGO
command to indicate this.- Only test terms in the “Molecular Function” ontology
- Use the
dotplot
function to visualise the results.
Gene Set Enrichment Analysis (GSEA) identifies gene sets that are related to the difference of interest between samples (Subramanian et al. 2005).
The software is distributed by the Broad Institute and is freely available for use by academic and non-profit organisations. The Broad also provide a number of very well curated gene sets for testing against your data - the Molecular Signatures Database (MSigDB). These are collections of human genes, however. Fortunately, these lists have been translated to mouse equivalents by the Walter+Eliza Hall Institute Bioinformatics service and made available for download.
The analysis is performed by:
The article describing the original software is available here, while this commentary on GSEA provides a shorter description.
We will use the fgsea
package (Sergushichev 2016) that implements the same algorithm in R. ‘fgsea’ stands for’ “fast preranked gene set enrichment analysis (GSEA)”.
library(fgsea)
We need to provide fgsea
with a vector containing numeric data by which it should rank the genes. To start with we will simply use a rank based on their fold change. We do not need to rank the genes ourselves, fgsea
will do this based on the data we provide.
We must exclude genes for which we do not have Entrez IDs. Also, we should use the shrunk LFC values.
gseaDat <- filter(shrinkLvV, !is.na(Entrez))
rankData <- gseaDat$logFC
names(rankData) <- gseaDat$Entrez
head(rankData)
## 497097 19888 20671 27395 18777 21399
## 0.5959404 -1.2525813 -0.5856117 0.4398234 0.6631350 -0.1187075
We will load the MSigDB Hallmark gene set. The pathways RData file loads a new object call Mm.H (H
for Hallmark gene set), which is a list of gene sets. The names of the vectors in list are the names of gene sets and each vector contains Entrez IDs for the genes in the gene set.
load("Robjects/mouse_H_v5.RData")
head(names(Mm.H))
## [1] "HALLMARK_TNFA_SIGNALING_VIA_NFKB" "HALLMARK_HYPOXIA"
## [3] "HALLMARK_CHOLESTEROL_HOMEOSTASIS" "HALLMARK_MITOTIC_SPINDLE"
## [5] "HALLMARK_WNT_BETA_CATENIN_SIGNALING" "HALLMARK_TGF_BETA_SIGNALING"
head(Mm.H[[1]])
## [1] "11303" "11796" "11797" "11839" "11852" "11910"
fgseaRes <- fgsea(pathways = Mm.H,
stats = rankData,
minSize = 15,
maxSize = 500)
## Warning in preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, : There are ties in the preranked stats (0.09% of the list).
## The order of those tied genes will be arbitrary, which may produce unexpected results.
## Warning in fgseaMultilevel(...): For some pathways, in reality P-values are less
## than 1e-10. You can set the `eps` argument to zero for better estimation.
The warning produced indicates that there are few genes that have the same fold change and so are ranked equally. fgsea
arbitrarily determines which comes first in the ranked list. As long as this number is small it shouldn’t significantly affect the results. If the number is large something is suspicious about the fold change results.
Lets look at the top 10 results.
fgseaRes %>%
arrange(desc(abs(NES))) %>%
top_n(10, -padj)
## # A tibble: 10 x 8
## pathway pval padj log2err ES NES size leadingEdge
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <list>
## 1 HALLMARK_CHOLESTE… 1.32e- 8 3.30e-7 0.748 0.609 2.36 73 <chr [33]>
## 2 HALLMARK_OXIDATIV… 1.00e-10 5.00e-9 NA 0.514 2.29 196 <chr [127]>
## 3 HALLMARK_TGF_BETA… 7.45e- 6 4.65e-5 0.611 -0.591 -2.09 54 <chr [17]>
## 4 HALLMARK_ADIPOGEN… 6.45e- 8 1.08e-6 0.705 0.445 1.99 199 <chr [86]>
## 5 HALLMARK_FATTY_AC… 3.80e- 6 3.57e-5 0.627 0.438 1.91 155 <chr [56]>
## 6 HALLMARK_MTORC1_S… 1.60e- 6 2.00e-5 0.644 0.410 1.84 203 <chr [79]>
## 7 HALLMARK_KRAS_SIG… 5.00e- 6 3.57e-5 0.611 -0.415 -1.82 190 <chr [58]>
## 8 HALLMARK_ESTROGEN… 4.62e- 6 3.57e-5 0.611 -0.411 -1.81 193 <chr [64]>
## 9 HALLMARK_INFLAMMA… 4.69e- 5 2.58e-4 0.557 -0.398 -1.75 188 <chr [56]>
## 10 HALLMARK_TNFA_SIG… 5.16e- 5 2.58e-4 0.557 -0.388 -1.71 199 <chr [69]>
plotEnrichment(Mm.H[["HALLMARK_ESTROGEN_RESPONSE_EARLY"]], rankData)
Remember to check the GSEA article for the complete explanation.
Challenge 3
Another common way to rank the genes is to order by pvalue, but also, sorting so that upregulated genes are at start and downregulated at the other - you can do this combining the sign of the fold change and the pvalue.
- Rank the genes by statisical significance - you will need to create a new ranking value using
-log10({p value}) * sign({Fold Change})
- Load the “C2” pathways from the the
Robjects/mouse_c2_v5.RData
file
- Run
fgsea
using the new ranked genes and the C2 pathways
- Run
fgsea
using the new ranked genes and the H pathways. How do these results differ from the ones we got when ranking by the fold change alone?
Luo, Weijun, Brouwer, and Cory. 2013. “Pathview: An R/Bioconductor Package for Pathway-Based Data Integration and Visualization.” Bioinformatics 29 (14): 1830–1. https://doi.org/10.1093/bioinformatics/btt285.
Sergushichev, Alexey. 2016. “An Algorithm for Fast Preranked Gene Set Enrichment Analysis Using Cumulative Statistic Calculation.” bioRxiv. https://doi.org/10.1101/060012.
Subramanian, Aravind, Pablo Tamayo, Vamsi K. Mootha, Sayan Mukherjee, Benjamin L. Ebert, Michael A. Gillette, Amanda Paulovich, et al. 2005. “Gene Set Enrichment Analysis: A Knowledge-Based Approach for Interpreting Genome-Wide Expression Profiles.” Proceedings of the National Academy of Sciences 102 (43): 15545–50. https://doi.org/10.1073/pnas.0506580102.
Yu, Guangchuang, Li-Gen Wang, Yanyan Han, and Qing-Yu He. 2012. “ClusterProfiler: An R Package for Comparing Biological Themes Among Gene Clusters.” OMICS: A Journal of Integrative Biology 16 (5): 284–87. https://doi.org/10.1089/omi.2011.0118.