The list of differentially expressed genes is sometimes so long that its interpretation becomes cumbersome and time consuming. It may also be very short while some genes have low p-value yet higher than the given threshold.
A common downstream procedure to combine information across genes is gene set testing. It aims at finding pathways or gene networks the differentially expressed genes play a role in.
Various ways exist to test for enrichment of biological pathways. We will look into over representation and gene set enrichment analyses.
A gene set comprises genes that share a biological function, chromosomal location, or any other relevant criterion.
To save time and effort there are a number of packages that make applying these tests to a large number of gene sets simpler, and which will import gene lists for testing from various sources.
Today we will use clusterProfiler
.
This method tests whether genes in a pathway are present in a subset of our data in a higher number than expected by chance (explanations derived from the clusterProfiler manual).
Genes in the experiment are split in two ways:
We can then create a contingency table with:
And test for independence of the two variables with the Fisher exact test.
clusterProfiler
clusterprofiler
(Yu et al.
2012) supports direct online access of the current KEGG database
(KEGG: Kyoto Encyclopedia of Genes and Genomes), rather than relying on
R annotation packages. It also provides some nice visualisation
options.
We first search the resource for mouse data:
library(tidyverse)
library(clusterProfiler)
search_kegg_organism('mouse', by='common_name')
## kegg_code scientific_name
## 20 mmur Microcebus murinus
## 22 mmu Mus musculus
## 23 mcal Mus caroli
## 24 mpah Mus pahari
## 26 mcoc Mastomys coucha
## 29 pleu Peromyscus leucopus
## 85 mmyo Myotis myotis
## 5095 asf Candidatus Arthromitus sp. SFB-mouse-Japan
## 5096 asm Candidatus Arthromitus sp. SFB-mouse-Yit
## 5097 aso Candidatus Arthromitus sp. SFB-mouse-NL
## common_name
## 20 gray mouse lemur
## 22 house mouse
## 23 Ryukyu mouse
## 24 shrew mouse
## 26 southern multimammate mouse
## 29 white-footed mouse
## 85 greater mouse-eared bat
## 5095 Candidatus Arthromitus sp. SFB-mouse-Japan
## 5096 Candidatus Arthromitus sp. SFB-mouse-Yit
## 5097 Candidatus Arthromitus sp. SFB-mouse-NL
We will use the ‘mmu’ ‘kegg_code’.
The input for the KEGG enrichment analysis is the list of gene IDs of significant genes.
We now load the R object keeping the outcome of the differential expression analysis for the d11 contrast.
shrink.d11 <- readRDS("RObjects/Shrunk_Results.d11.rds")
We will only use genes that have:
We need to remember to eliminate genes with missing values in the FDR as a result of the independent filtering by DESeq2.
For this tool we need to use Entrez IDs, so we will also need to eliminate genes with a missing Entrez ID (NA values in the ‘Entrez’ column).
sigGenes <- shrink.d11 %>%
drop_na(Entrez, FDR) %>%
filter(FDR < 0.05 & abs(logFC) > 1) %>%
pull(Entrez)
keggRes <- enrichKEGG(gene = sigGenes, organism = 'mmu')
## Reading KEGG annotation online: "https://rest.kegg.jp/link/mmu/pathway"...
## Reading KEGG annotation online: "https://rest.kegg.jp/list/pathway/mmu"...
as_tibble(keggRes)
## # A tibble: 72 × 9
## ID Description GeneR…¹ BgRatio pvalue p.adjust qvalue geneID Count
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <chr> <int>
## 1 mmu04612 Antigen pro… 40/338 90/9070 8.15e-34 1.91e-31 1.36e-31 14991… 40
## 2 mmu05169 Epstein-Bar… 56/338 231/90… 8.46e-31 9.90e-29 7.04e-29 12502… 56
## 3 mmu05332 Graft-versu… 32/338 63/9070 1.44e-29 1.13e-27 8.01e-28 14939… 32
## 4 mmu04940 Type I diab… 33/338 70/9070 4.39e-29 2.57e-27 1.82e-27 16160… 33
## 5 mmu04145 Phagosome -… 48/338 182/90… 3.22e-28 1.51e-26 1.07e-26 16414… 48
## 6 mmu05330 Allograft r… 31/338 63/9070 4.10e-28 1.60e-26 1.14e-26 16160… 31
## 7 mmu05164 Influenza A… 45/338 173/90… 3.33e-26 1.11e-24 7.92e-25 21706… 45
## 8 mmu04514 Cell adhesi… 45/338 182/90… 3.43e-25 1.00e-23 7.12e-24 16414… 45
## 9 mmu05416 Viral myoca… 32/338 88/9070 6.77e-24 1.76e-22 1.25e-22 16414… 32
## 10 mmu05140 Leishmanias… 28/338 70/9070 2.22e-22 5.20e-21 3.70e-21 16414… 28
## # … with 62 more rows, and abbreviated variable name ¹GeneRatio
clusterProfiler
has a function browseKegg
to view the KEGG pathway in a browser, highlighting the genes we
selected as differentially expressed.
We will show one of the top hits: pathway ‘mmu04612’ for ‘Antigen processing and presentation’.
browseKEGG(keggRes, 'mmu04612')
The package pathview
(Luo et al.
2013) can be used to generate figures of KEGG pathways.
One advantage over the clusterProfiler
browser method
browseKEGG
is that genes can be coloured according to fold
change levels in our data. To do this we need to pass
pathview
a named vector of fold change values (one could in
fact colour by any numeric vector, e.g. p-value).
The package plots the KEGG pathway to a png
file in the
working directory.
library(pathview)
logFC <- shrink.d11$logFC
names(logFC) <- shrink.d11$Entrez
pathview(gene.data = logFC,
pathway.id = "mmu04612",
species = "mmu",
limit = list(gene=20, cpd=1))
mmu04612.pathview.png:
Exercise 1
- Use
pathview
to export a figure for “mmu04659” or “mmu04658”, but this time only use genes that are statistically significant at FDR < 0.01
clusterProfiler
can also perform over-representation
analysis on GO terms using the command enrichGO
. For this
analysis we will use Ensembl gene IDs instead of Entrez IDs and in order
to do this we need to load another package which contains the mouse
database called org.Mm.eg.db
.
To run the GO enrichment analysis, this time we also need a couple of extra things. Firstly, we should provide a list of the ‘universe’ of all the genes in our DE analysis not just the ones we have selected as significant.
Gene Ontology terms are divided into 3 categories. - Metabolic Functions - Biological Processes - Cellular Components
For this analysis we will narrow our search terms in the ‘Biological Processes’ Ontology so we can add the parameter “BP” with the ‘ont’ argument (the default is Molecular Functions).
library(org.Mm.eg.db)
sigGenes_GO <- shrink.d11 %>%
drop_na(FDR) %>%
filter(FDR < 0.01 & abs(logFC) > 2) %>%
pull(GeneID)
universe <- shrink.d11$GeneID
ego <- enrichGO(gene = sigGenes_GO,
universe = universe,
OrgDb = org.Mm.eg.db,
keyType = "ENSEMBL",
ont = "BP",
pvalueCutoff = 0.01,
readable = TRUE)
We can use the barplot
function to visualise the
results. Count is the number of differentially expressed in each gene
ontology term.
barplot(ego, showCategory=20)
or perhaps the dotplot
version is more informative. Gene
ratio is Count divided by the number of genes in that GO term.
dotplot(ego, font.size = 14)
Another visualisation that can be nice to try is the
emapplot
which shows the overlap between genes in the
different GO terms.
library(enrichplot)
ego_pt <- pairwise_termsim(ego)
emapplot(ego_pt, cex_label_category = 0.25)
## Warning in emapplot.enrichResult(x, showCategory = showCategory, ...): Use 'cex.params = list(category_label = your_value)' instead of 'cex_label_category'.
## The cex_label_category parameter will be removed in the next version.
Gene Set Enrichment Analysis (GSEA) identifies gene sets that are enriched in the dataset between samples (Subramanian et al. 2005).
The software is distributed by the Broad Institute and is freely available for use by academic and non-profit organisations. The Broad also provide a number of very well curated gene sets for testing against your data - the Molecular Signatures Database (MSigDB). These are collections of human genes. Fortunately, these lists have been translated to mouse equivalents by the Walter+Eliza Hall Institute Bioinformatics service and made available for download. They are now also available from a recent R package msigdbr, which we will use.
Let’s load msigdbr
now.
library(msigdbr)
The analysis is performed by:
The article describing the original software is available here, while this commentary on GSEA provides a shorter description.
We will use clusterProfiler
’s GSEA
package (Yu et al. 2012) that implements
the same algorithm in R.
We need to provide GSEA
with a vector containing values
for a given gene mtric, e.g. log(fold change), sorted in decreasing
order.
To start with we will simply use a rank the genes based on their fold change.
We must exclude genes with no Ensembl ID.
Also, we should use the shrunk LFC values.
rankedGenes <- shrink.d11 %>%
filter(!is.na(GeneID)) %>%
mutate(rank = logFC) %>%
arrange(desc(rank)) %>%
pull(rank, GeneID)
We will load the MSigDB Hallmark gene set with msigdbr
,
setting the category
parameter to ‘H’ for
Hallmark gene set. The object created is a
tibble
with information on each {gene set; gene} pair (one
per row). We will only keep the the gene set name, gene Ensembl ID.
term2gene <- msigdbr(species = "Mus musculus", category = "H") %>%
dplyr::select(gs_name, ensembl_gene)
term2name <- msigdbr(species = "Mus musculus", category = "H") %>%
dplyr::select(gs_name, gs_description) %>%
distinct()
Arguments passed to GSEA
include:
gseaRes <- GSEA(rankedGenes,
TERM2GENE = term2gene,
TERM2NAME = term2name,
pvalueCutoff = 1.00,
minGSSize = 15,
maxGSSize = 500)
## preparing geneSet collections...
## GSEA analysis...
## leading edge analysis...
## done...
Let’s look at the top 10 results.
as_tibble(gseaRes) %>%
arrange(desc(abs(NES))) %>%
top_n(10, wt=-p.adjust) %>%
dplyr::select(-core_enrichment) %>%
mutate(across(c("enrichmentScore", "NES"), round, digits=3)) %>%
mutate(across(c("pvalue", "p.adjust", "qvalue"), scales::scientific))
The enrichment score plot displays along the x-axis that represents the decreasing gene rank:
gseaplot(gseaRes,
geneSetID = "HALLMARK_INFLAMMATORY_RESPONSE",
title = "HALLMARK_INFLAMMATORY_RESPONSE")
Remember to check the GSEA article for the complete explanation.
Exercise 2
Another common way to rank the genes is to order by pvalue while sorting so that upregulated genes are at the start and downregulated at the end. You can do this combining the sign of the fold change and the pvalue.
- Rank the genes by statistical significance - you will need to create a new ranking value using
-log10({p value}) * sign({Fold Change})
.- Run GSEA using the new ranked genes and the H pathways.
- Conduct the same analysis for the day 33 Infected vs Uninfected contrast.