The list of differentially expressed genes is sometimes so long that its interpretation becomes cumbersome and time consuming. A common downstream procedure is gene set testing. It aims to understand which pathways or gene networks the differentially expressed genes are implicated in.
Various ways exist to test for enrichment of biological pathways.
Gene Set Enrichment Analysis GSEA was tests whether a set of genes of interest, e.g. genes (Subramanian et al. 2005). The software is distributed by the Broad Institute and is freely available for use by academic and non-profit organisations.
In addition to the GSEA software the Broad also provide a number of very well curated gene set
s for testing against your data - the Molecular Signatures Database (MSigDB). Unfortunately, these are collections of human genes, however, these lists have been translated to mouse equivalents by the Walter+Eliza Hall Institutes Bioinformatics service and made avaialble for download.
The analysis is performed by:
data set
gene set
in the ranked data set
commentary on GSEA. The article describing the original software is available here.
fgsea
The fgsea
package (Sergushichev 2016) implements the same algorithm in R vignette “fast preranked gene set enrichment analysis (GSEA)”.
library(fgsea)
## Loading required package: Rcpp
load("Robjects/Annotated_Results_LvV.RData")
We need to provide fgsea
a vector containing numeric data by which it should rank the genes. To start with we will simply use a rank based on their fold change. We do not need to rank the genes ourselves, fgsea
will do this based on the data we provide.
We need to exclude genes for which we do not have Entrez IDs. Also, we should use the shrunk LFC values.
gseaDat <- filter(shrinkLvV, !is.na(Entrez))
rankData <- gseaDat$logFC
names(rankData) <- gseaDat$Entrez
head(rankData)
## 497097 19888 20671 27395 18777 21399
## 0.5959404 -1.2525813 -0.5856117 0.4398234 0.6631350 -0.1187075
load("Robjects/mouse_H_v5.RData")
pathwaysH <- Mm.H
fgseaRes <- fgsea(pathwaysH,
rankData,
minSize=15,
maxSize = 500,
nperm=1000)
## Warning in fgsea(pathwaysH, rankData, minSize = 15, maxSize = 500, nperm = 1000): There are ties in the preranked stats (0.09% of the list).
## The order of those tied genes will be arbitrary, which may produce unexpected results.
The warning produced indicates that there are few genes that have the same fold change and so are ranked equally. fgsea
with arbitrarily order determine which comes first in the ranked list. As long as this number is small it shouldn’t significantly effect the results. If the number is large something is suspicious about the fold change results.
Lets look at the top 10 results.
fgseaRes %>%
arrange(desc(abs(NES))) %>%
top_n(10, -padj)
## # A tibble: 14 x 8
## pathway pval padj ES NES nMoreExtreme size leadingEdge
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <list>
## 1 HALLMARK_CH… 0.00220 0.00787 0.609 2.31 0 73 <chr [33]>
## 2 HALLMARK_OX… 0.00204 0.00787 0.514 2.30 0 196 <chr [127]>
## 3 HALLMARK_TG… 0.00189 0.00787 -0.591 -2.09 0 54 <chr [17]>
## 4 HALLMARK_AD… 0.00206 0.00787 0.445 1.99 0 199 <chr [86]>
## 5 HALLMARK_FA… 0.00203 0.00787 0.438 1.91 0 155 <chr [56]>
## 6 HALLMARK_MT… 0.00207 0.00787 0.410 1.84 0 203 <chr [79]>
## 7 HALLMARK_KR… 0.00196 0.00787 -0.415 -1.80 0 190 <chr [58]>
## 8 HALLMARK_ES… 0.00194 0.00787 -0.411 -1.79 0 193 <chr [64]>
## 9 HALLMARK_IN… 0.00196 0.00787 -0.398 -1.72 0 188 <chr [56]>
## 10 HALLMARK_PE… 0.00215 0.00787 0.426 1.70 0 95 <chr [35]>
## 11 HALLMARK_TN… 0.00194 0.00787 -0.388 -1.69 0 199 <chr [69]>
## 12 HALLMARK_CO… 0.00190 0.00787 -0.406 -1.67 0 129 <chr [39]>
## 13 HALLMARK_MY… 0.00194 0.00787 -0.364 -1.59 0 201 <chr [69]>
## 14 HALLMARK_ES… 0.00195 0.00787 -0.349 -1.52 0 197 <chr [52]>
plotEnrichment(pathwaysH[["HALLMARK_ESTROGEN_RESPONSE_EARLY"]], rankData)
Remember to check the GSEA article for the complete explanation.
The function plotGseaTable
allows us to plot a summary figue showing the results for multiple pathways.
topPathways <- fgseaRes %>%
top_n(20, wt=-padj) %>%
arrange(-NES) %>%
pull(pathway)
plotGseaTable(pathwaysH[topPathways],
rankData,
fgseaRes,
gseaParam = 0.5)
Challenge 1
Another common way to rank the genes is to order by pvalue, but also, sorting so that upregulated genes are at start and downregulated at the other - you can do this combining the sign of the fold change and the pvalue.
1. Rank the genes by statisical significance - you will need to create a new ranking value using-log10({p value}) * sign({Fold Change})
2. Load the “C2” pathways from the thedata/mouse_c2_v5.RData
file
3. Runfgsea
using the new ranked genes and the C2 pathways
4. Runfgsea
using the new ranked genes and the H pathways. How do these results differ from the ones we got when ranking by the fold change alone?
goseq
GOseq is a method to conduct Gene Ontology (GO) analysis suitable for RNA-seq data as it accounts for the gene length bias in detection of over-representation (Young et al. 2010).
From the GOseq vignette:
“GO analysis of RNA-seq data requires the use of random sampling in order to generate a suitable null distribution for GO category membership and calculate each category’s significance for over representation amongst DE genes. … In most cases, the Wallenius distribution can be used to approximate the true null distribution, without any significant loss in accuracy. The goseq package implements this approximation as its default option.”
library(goseq)
supportedOrganisms() %>% filter(str_detect(Genome, "mm"))
## # A tibble: 10 x 5
## Genome Id `Id Description` `Lengths in geneLen… `GO Annotation Ava…
## <fct> <fct> <fct> <lgl> <lgl>
## 1 mm7 knownG… Entrez Gene ID TRUE TRUE
## 2 mm7 ensGene Ensembl gene ID TRUE TRUE
## 3 mm7 geneSy… Gene Symbol TRUE TRUE
## 4 mm8 knownG… Entrez Gene ID TRUE TRUE
## 5 mm8 ensGene Ensembl gene ID TRUE TRUE
## 6 mm8 geneSy… Gene Symbol TRUE TRUE
## 7 mm9 knownG… Entrez Gene ID TRUE TRUE
## 8 mm9 ensGene Ensembl gene ID TRUE TRUE
## 9 mm9 geneSy… Gene Symbol TRUE TRUE
## 10 mm10 "" "" FALSE TRUE
The input for goseq
is a vector that indicates, for each gene, whether or not it is significantly differentially expressed. This should be a named vector, where the names are the gene ids and the values are 1
if the gene is significant and a 0
if it is not.
In this case we can use the Ensembl gene IDs.
sigData <- as.integer( shrinkLvV$FDR < 0.01 & !is.na(shrinkLvV$FDR) )
names(sigData) <- shrinkLvV$GeneID
pwf <- nullp(sigData, "mm10", "ensGene", bias.data = shrinkLvV$medianTxLength)
## Warning in pcls(G): initial point very close to some inequality constraints
goResults <- goseq(pwf, "mm10","ensGene", test.cats=c("GO:BP"))
goResults %>%
top_n(10, wt=-over_represented_pvalue) %>%
mutate(hitsPerc=numDEInCat*100/numInCat) %>%
ggplot(aes(x=hitsPerc,
y=term,
colour=over_represented_pvalue,
size=numDEInCat)) +
geom_point() +
expand_limits(x=0) +
labs(x="Hits (%)", y="GO term", colour="p value", size="Count")
library(GO.db)
GOTERM[[goResults$category[1]]]
## GOID: GO:0044281
## Term: small molecule metabolic process
## Ontology: BP
## Definition: The chemical reactions and pathways involving small
## molecules, any low molecular weight, monomeric, non-encoded
## molecule.
## Synonym: small molecule metabolism
Challenge 2
- Create a vector showing genes that are statistically significant at FDR < 0.01 and that are up-regulated by at least 4x (logFC>2) in lactating mice
- Run a
goseq
analysis on this gene list- Plot the results
- How is this result different to the previous GO analysis?
clusterProfiler
We can analyse for enrichment of KEGG pathways in much the same way as for GO terms. We could also use goseq
for this, but this time we’re going to use clusterProfiler
(Yu et al. 2012). clusterprofiler
supports direct online access of the current KEGG database, rather than relying on R annotation packages, it also provides some nice visualisation options.
library(clusterProfiler)
search_kegg_organism('mmu', by='kegg_code')
## kegg_code scientific_name common_name
## 14 mmu Mus musculus mouse
The input for the KEGG enrichment is list of gene IDs for significant genes.
This time we’ll only use gene with an absolute fold change greater than 2.
For this tool we need to use Entrez IDs, so once again we will have to eliminate the missing values.
sigGenes <- shrinkLvV %>%
filter(FDR < 0.05 & !is.na(FDR) &
abs(logFC) > 1 &
!is.na(Entrez)) %>%
pull(Entrez)
kk <- enrichKEGG(gene = sigGenes, organism = 'mmu')
head(kk, n=10)
## # A tibble: 10 x 9
## ID Description GeneRatio BgRatio pvalue p.adjust qvalue geneID
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <chr>
## 1 mmu0… Fatty acid… 24/1018 61/8629 3.22e-8 9.94e-6 7.72e-6 11363…
## 2 mmu0… Steroid bi… 12/1018 19/8629 1.56e-7 2.41e-5 1.87e-5 15490…
## 3 mmu0… Biosynthes… 15/1018 32/8629 8.69e-7 8.95e-5 6.96e-5 30963…
## 4 mmu0… PPAR signa… 27/1018 87/8629 1.29e-6 9.97e-5 7.74e-5 11363…
## 5 mmu0… Valine, le… 18/1018 56/8629 4.45e-5 2.32e-3 1.80e-3 22709…
## 6 mmu0… Glycerolip… 19/1018 61/8629 4.50e-5 2.32e-3 1.80e-3 68393…
## 7 mmu0… Cell adhes… 38/1018 171/86… 7.20e-5 3.18e-3 2.47e-3 12487…
## 8 mmu0… Cytokine-c… 57/1018 296/86… 1.05e-4 4.07e-3 3.16e-3 16178…
## 9 mmu0… Calcium si… 41/1018 194/86… 1.25e-4 4.30e-3 3.34e-3 12767…
## 10 mmu0… Histidine … 10/1018 24/8629 2.06e-4 5.37e-3 4.17e-3 56752…
## # … with 1 more variable: Count <int>
clusterProfile
has a function browseKegg
that allows you to view the Kegg pathway with the genes colours in in your browser.
browseKEGG(kk, 'mmu03320')
The package pathview
(Luo et al. 2013) can be used to generate figures of KEGG pathways.
One advantage over clusterProfiles
browser method is that the genes are coloured according to their fold change levels in our data. To do this we need to pass pathview
a named vector of fold change data (actually you could colour by any numeric vector, e.g. p-value).
The package plots the KEGG pathway to a png
file in the working directory.
library(pathview)
logFC <- annotLvV$logFC
names(logFC) <- annotLvV$Entrez
pathview(gene.data = logFC,
pathway.id = "mmu03320",
species = "mmu",
limit = list(gene=5, cpd=1))
mmu03320.pathview.png:
Challenge 3
- Use
pathview
to export a figure for “mmu04060”, but this time only use genes that are statistically significant at FDR < 0.01
Luo, Weijun, Brouwer, and Cory. 2013. “Pathview: An R/Bioconductor Package for Pathway-Based Data Integration and Visualization.” Bioinformatics 29 (14): 1830–1. https://doi.org/10.1093/bioinformatics/btt285.
Sergushichev, Alexey. 2016. “An Algorithm for Fast Preranked Gene Set Enrichment Analysis Using Cumulative Statistic Calculation.” bioRxiv. Cold Spring Harbor Labs Journals. https://doi.org/10.1101/060012.
Subramanian, Aravind, Pablo Tamayo, Vamsi K. Mootha, Sayan Mukherjee, Benjamin L. Ebert, Michael A. Gillette, Amanda Paulovich, et al. 2005. “Gene Set Enrichment Analysis: A Knowledge-Based Approach for Interpreting Genome-Wide Expression Profiles.” Proceedings of the National Academy of Sciences 102 (43). National Academy of Sciences: 15545–50. https://doi.org/10.1073/pnas.0506580102.
Young, Matthew D, Matthew J Wakefield, Gordon K Smyth, and Alicia Oshlack. 2010. “Gene Ontology Analysis for Rna-Seq: Accounting for Selection Bias.” Genome Biology 11: R14.
Yu, Guangchuang, Li-Gen Wang, Yanyan Han, and Qing-Yu He. 2012. “ClusterProfiler: An R Package for Comparing Biological Themes Among Gene Clusters.” OMICS: A Journal of Integrative Biology 16 (5): 284–87. https://doi.org/10.1089/omi.2011.0118.