The list of differentially expressed genes is sometimes so long that its interpretation becomes cumbersome and time consuming. It may also be very short while some genes have low p-value yet higher than the given threshold.
A common downstream procedure to combine information across genes is gene set testing. It aims at finding pathways or gene networks the differentially expressed genes play a role in.
Various ways exist to test for enrichment of biological pathways. We will look into over representation and gene set enrichment analysis.
A gene set comprises genes that share a biological function, chromosomal location, or any other relevant criterion.
This method tests whether genes in a pathway are present in a subset of our data more than expected (explanations derived from the clusterProfiler manual).
Genes in the experiment are split in two ways:
We can then create a contingency table with:
Example:
d <- data.frame(
diffExpNo=c(1980, 17920),
diffExpYes=c(20, 80))
row.names(d) <- c("pathwayYes", "pathwayNo")
d
## diffExpNo diffExpYes
## pathwayYes 1980 20
## pathwayNo 17920 80
For a given pathway:
Significance can then be assessed with a the hypergeometric distribution:
The test above is identical to the one-tailed Fisher's exact test.
fisher.test(d, alternative = "greater")
##
## Fisher's Exact Test for Count Data
##
## data: d
## p-value = 0.9992
## alternative hypothesis: true odds ratio is greater than 1
## 95 percent confidence interval:
## 0.2874878 Inf
## sample estimates:
## odds ratio
## 0.4419959
To save time and effort there are a number of packages that make applying this test to a large number of gene sets simpler, and which will import gene lists for testing from various sources.
Today we will use clusterProfiler
.
clusterProfiler
clusterprofiler
(Yu et al. 2012) supports direct online access of the current KEGG database, rather than relying on R annotation packages, it also provides some nice visualisation options (KEGG: Kyoto Encyclopedia of Genes and Genomes).
library(clusterProfiler)
search_kegg_organism('mmu', by='kegg_code')
## kegg_code scientific_name common_name
## 14 mmu Mus musculus mouse
The input for the KEGG enrichment is list of gene IDs for significant genes.
We now load the R object keeping the outcome of the differential expression analysis for the LvV contrast.
load("Robjects/Annotated_Results_LvV.RData")
We will only use genes that have an adjusted p-value (FDR) of less than 0.04 and an absolute fold change greater than 2. We need to remember to eliminate genes with missing values in the FDR as a result of the independent filtering by DESeq2.
For this tool we need to use Entrez IDs, so we will need to eliminate genes with missing values in 'Entrez'.
sigGenes <- shrinkLvV %>%
drop_na(Entrez, FDR) %>%
filter(FDR < 0.05 & abs(logFC) > 1) %>%
pull(Entrez)
kk <- enrichKEGG(gene = sigGenes, organism = 'mmu')
head(kk, n=10)
## Reading KEGG annotation online:
##
## Reading KEGG annotation online:
## # A tibble: 10 x 9
## ID Description GeneRatio BgRatio pvalue p.adjust qvalue geneID Count
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <chr> <int>
## 1 mmu01… Fatty acid … 24/1031 61/8802 2.81e-8 8.78e-6 6.82e-6 11363/7… 24
## 2 mmu00… Steroid bio… 12/1031 20/8802 3.22e-7 5.03e-5 3.91e-5 15490/1… 12
## 3 mmu01… Biosynthesi… 15/1031 33/8802 1.30e-6 1.35e-4 1.05e-4 30963/1… 15
## 4 mmu03… PPAR signal… 27/1031 89/8802 1.84e-6 1.44e-4 1.12e-4 11363/7… 27
## 5 mmu04… Calcium sig… 44/1031 200/88… 2.17e-5 1.36e-3 1.06e-3 12767/2… 44
## 6 mmu00… Valine, leu… 18/1031 56/8802 4.05e-5 2.11e-3 1.64e-3 227095/… 18
## 7 mmu00… Glycerolipi… 19/1031 62/8802 5.25e-5 2.35e-3 1.82e-3 68393/5… 19
## 8 mmu04… Cell adhesi… 38/1031 173/88… 8.06e-5 3.03e-3 2.35e-3 12487/2… 38
## 9 mmu04… Cytokine-cy… 57/1031 296/88… 8.71e-5 3.03e-3 2.35e-3 16178/1… 57
## 10 mmu04… PI3K-Akt si… 65/1031 356/88… 1.49e-4 4.65e-3 3.61e-3 12043/2… 65
clusterProfile
has a function browseKegg
that allows you to view the KEGG pathway in in your browser with the genes that are in our gene highlighted.
browseKEGG(kk, 'mmu03320')
The package pathview
(Luo et al. 2013) can be used to generate figures of KEGG pathways.
One advantage over the clusterProfiler
browser method is that the genes can be coloured according to their fold change levels in our data. To do this we need to pass pathview
a named vector of fold change data (actually you could colour by any numeric vector, e.g. p-value).
The package plots the KEGG pathway to a png
file in the working directory.
library(pathview)
logFC <- annotLvV$logFC
names(logFC) <- annotLvV$Entrez
pathview(gene.data = logFC,
pathway.id = "mmu03320",
species = "mmu",
limit = list(gene=5, cpd=1))
mmu03320.pathview.png:
Challenge 1
- Use
pathview
to export a figure for "mmu04060", but this time only use genes that are statistically significant at FDR < 0.01
Gene Set Enrichment Analysis (GSEA) identifies gene sets that are related to the difference of interest between samples (Subramanian et al. 2005).
The software is distributed by the Broad Institute and is freely available for use by academic and non-profit organisations. The Broad also provide a number of very well curated gene sets for testing against your data - the Molecular Signatures Database (MSigDB). These are collections of human genes, however. Fortunately, these lists have been translated to mouse equivalents by the Walter+Eliza Hall Institute Bioinformatics service and made available for download.
The analysis is performed by:
The article describing the original software is available here, while this commentary on GSEA provides a shorter description.
We will use the fgsea
package (Sergushichev 2016) that implements the same algorithm in R. 'fgsea' stands for' "fast preranked gene set enrichment analysis (GSEA)".
library(fgsea)
We need to provide fgsea
with a vector containing numeric data by which it should rank the genes. To start with we will simply use a rank based on their fold change. We do not need to rank the genes ourselves, fgsea
will do this based on the data we provide.
We must exclude genes for which we do not have Entrez IDs. Also, we should use the shrunk LFC values.
gseaDat <- filter(shrinkLvV, !is.na(Entrez))
rankData <- gseaDat$logFC
names(rankData) <- gseaDat$Entrez
head(rankData)
## 497097 19888 20671 27395 18777 21399
## 0.5959404 -1.2525813 -0.5856117 0.4398234 0.6631350 -0.1187075
We will load the MSigDB Hallmark gene set. The pathways RData file loads a new object call Mm.H (H
for Hallmark gene set), which is a list of gene sets. The names of the vectors in list are the names of gene sets and each vector contains Entrez IDs for the genes in the gene set.
load("Robjects/mouse_H_v5.RData")
head(names(Mm.H))
## [1] "HALLMARK_TNFA_SIGNALING_VIA_NFKB" "HALLMARK_HYPOXIA"
## [3] "HALLMARK_CHOLESTEROL_HOMEOSTASIS" "HALLMARK_MITOTIC_SPINDLE"
## [5] "HALLMARK_WNT_BETA_CATENIN_SIGNALING" "HALLMARK_TGF_BETA_SIGNALING"
head(Mm.H[[1]])
## [1] "11303" "11796" "11797" "11839" "11852" "11910"
fgseaRes <- fgsea(Mm.H,
rankData,
minSize = 15,
maxSize = 500)
## Warning in preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, : There are ties in the preranked stats (0.09% of the list).
## The order of those tied genes will be arbitrary, which may produce unexpected results.
## Warning in fgseaMultilevel(...): For some pathways, in reality P-values are less
## than 1e-10. You can set the `eps` argument to zero for better estimation.
The warning produced indicates that there are few genes that have the same fold change and so are ranked equally. fgsea
arbitrarily determines which comes first in the ranked list. As long as this number is small it shouldn't significantly affect the results. If the number is large something is suspicious about the fold change results.
Lets look at the top 10 results.
fgseaRes %>%
arrange(desc(abs(NES))) %>%
top_n(10, -padj)
## # A tibble: 10 x 8
## pathway pval padj log2err ES NES size leadingEdge
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <list>
## 1 HALLMARK_CHOLESTE… 7.30e- 9 1.83e-7 0.761 0.609 2.32 73 <chr [33]>
## 2 HALLMARK_OXIDATIV… 1.00e-10 5.00e-9 NA 0.514 2.29 196 <chr [127]>
## 3 HALLMARK_TGF_BETA… 4.11e- 6 2.94e-5 0.611 -0.591 -2.14 54 <chr [17]>
## 4 HALLMARK_ADIPOGEN… 3.91e- 8 6.52e-7 0.720 0.445 1.99 199 <chr [86]>
## 5 HALLMARK_FATTY_AC… 2.23e- 6 2.78e-5 0.627 0.438 1.89 155 <chr [56]>
## 6 HALLMARK_MTORC1_S… 3.73e- 6 2.94e-5 0.627 0.410 1.84 203 <chr [79]>
## 7 HALLMARK_KRAS_SIG… 3.17e- 6 2.94e-5 0.627 -0.415 -1.84 190 <chr [58]>
## 8 HALLMARK_ESTROGEN… 5.26e- 6 3.29e-5 0.611 -0.411 -1.83 193 <chr [64]>
## 9 HALLMARK_INFLAMMA… 3.45e- 5 1.73e-4 0.557 -0.398 -1.75 188 <chr [56]>
## 10 HALLMARK_TNFA_SIG… 2.53e- 5 1.40e-4 0.576 -0.388 -1.72 199 <chr [69]>
plotEnrichment(Mm.H[["HALLMARK_ESTROGEN_RESPONSE_EARLY"]], rankData)
Remember to check the GSEA article for the complete explanation.
Challenge 2
Another common way to rank the genes is to order by pvalue, but also, sorting so that upregulated genes are at start and downregulated at the other - you can do this combining the sign of the fold change and the pvalue.
1. Rank the genes by statisical significance - you will need to create a new ranking value using-log10({p value}) * sign({Fold Change})
2. Load the "C2" pathways from the theRobjects/mouse_c2_v5.RData
file
3. Runfgsea
using the new ranked genes and the C2 pathways
4. Runfgsea
using the new ranked genes and the H pathways. How do these results differ from the ones we got when ranking by the fold change alone?
Luo, Weijun, Brouwer, and Cory. 2013. “Pathview: An R/Bioconductor Package for Pathway-Based Data Integration and Visualization.” Bioinformatics 29 (14): 1830–1. doi:10.1093/bioinformatics/btt285.
Sergushichev, Alexey. 2016. “An Algorithm for Fast Preranked Gene Set Enrichment Analysis Using Cumulative Statistic Calculation.” bioRxiv. Cold Spring Harbor Labs Journals. doi:10.1101/060012.
Subramanian, Aravind, Pablo Tamayo, Vamsi K. Mootha, Sayan Mukherjee, Benjamin L. Ebert, Michael A. Gillette, Amanda Paulovich, et al. 2005. “Gene Set Enrichment Analysis: A Knowledge-Based Approach for Interpreting Genome-Wide Expression Profiles.” Proceedings of the National Academy of Sciences 102 (43). National Academy of Sciences: 15545–50. doi:10.1073/pnas.0506580102.
Yu, Guangchuang, Li-Gen Wang, Yanyan Han, and Qing-Yu He. 2012. “ClusterProfiler: An R Package for Comparing Biological Themes Among Gene Clusters.” OMICS: A Journal of Integrative Biology 16 (5): 284–87. doi:10.1089/omi.2011.0118.