The list of differentially expressed genes is sometimes so long that its interpretation becomes cumbersome and time consuming. It may also be very short while some genes have low p-value yet higher than the given threshold.

A common downstream procedure to combine information across genes is gene set testing. It aims at finding pathways or gene networks the differentially expressed genes play a role in.

Various ways exist to test for enrichment of biological pathways. We will look into over representation and gene set enrichment analysis.

A gene set comprises genes that share a biological function, chromosomal location, or any other relevant criterion.

Over-representation

Method

This method tests whether genes in a pathway are present in a subset of our data more than expected (explanations derived from the clusterProfiler manual).

Genes is the experiment are split in two ways:

annotated to the pathway or not
differentially expressed or not

Contingency table with:

rows: in pathway or not
columns: differentially expressed or not

Example:

d <- data.frame(
  diffExpNo=c(1980, 17920),
  diffExpYes=c(20, 80))
row.names(d) <- c("pathwayYes", "pathwayNo")
d

##            diffExpNo diffExpYes
## pathwayYes      1980         20
## pathwayNo      17920         80

For a given pathway:

N: total number of genes in the background set, e.g. all genes tested
M: number of genes within that background distribution that are annotated to the pathway
n: number of differentially expressed genes
k: number of differentially expressed genes that are annotated to the pathway

Significance can be assessed with a the hypergeometric distribution:

The test above is identical to the one-tailed Fisher’s exact test.

fisher.test(d, alternative = "greater")

## 
##  Fisher's Exact Test for Count Data
## 
## data:  d
## p-value = 0.9992
## alternative hypothesis: true odds ratio is greater than 1
## 95 percent confidence interval:
##  0.2874878       Inf
## sample estimates:
## odds ratio 
##  0.4419959

`clusterProfiler`

clusterprofiler (Yu et al. 2012) supports direct online access of the current KEGG database, rather than relying on R annotation packages, it also provides some nice visualisation options (KEGG: Kyoto Encyclopedia of Genes and Genomes).

library(clusterProfiler)
search_kegg_organism('mmu', by='kegg_code')

##    kegg_code scientific_name common_name
## 14       mmu    Mus musculus       mouse

KEGG enrichment analysis

The input for the KEGG enrichment is list of gene IDs for significant genes.

We now load the R object keeping the outcome of the differential expression analysis for the LvV contrast.

load("Robjects/Annotated_Results_LvV.RData")

We will only use genes with an absolute fold change greater than 2.

For this tool we need to use Entrez IDs, so we will eliminate genes with no such ID by filtering on missing values in ‘Entrez’.

sigGenes <- shrinkLvV %>% 
    filter(FDR < 0.05 & !is.na(FDR) & 
               abs(logFC) > 1 & 
               !is.na(Entrez)) %>% 
    pull(Entrez)

kk <- enrichKEGG(gene = sigGenes, organism = 'mmu')

## Reading KEGG annotation online:
## 
## Reading KEGG annotation online:

head(kk, n=10)

##                ID                                Description GeneRatio  BgRatio
## mmu01212 mmu01212                      Fatty acid metabolism   24/1028  61/8786
## mmu00100 mmu00100                       Steroid biosynthesis   12/1028  20/8786
## mmu01040 mmu01040    Biosynthesis of unsaturated fatty acids   15/1028  33/8786
## mmu03320 mmu03320                     PPAR signaling pathway   27/1028  89/8786
## mmu00280 mmu00280 Valine, leucine and isoleucine degradation   18/1028  56/8786
## mmu00561 mmu00561                    Glycerolipid metabolism   19/1028  62/8786
## mmu04020 mmu04020                  Calcium signaling pathway   42/1028 195/8786
## mmu04514 mmu04514             Cell adhesion molecules (CAMs)   38/1028 173/8786
## mmu04060 mmu04060     Cytokine-cytokine receptor interaction   57/1028 296/8786
## mmu04151 mmu04151                 PI3K-Akt signaling pathway   65/1028 356/8786
##                pvalue     p.adjust       qvalue
## mmu01212 2.744654e-08 8.590767e-06 6.644952e-06
## mmu00100 3.177791e-07 4.973243e-05 3.846800e-05
## mmu01040 1.279131e-06 1.334560e-04 1.032281e-04
## mmu03320 1.804478e-06 1.412004e-04 1.092184e-04
## mmu00280 3.989195e-05 2.444575e-03 1.890877e-03
## mmu00561 5.166462e-05 2.444575e-03 1.890877e-03
## mmu04020 5.467102e-05 2.444575e-03 1.890877e-03
## mmu04514 7.872988e-05 2.937966e-03 2.272514e-03
## mmu04060 8.447825e-05 2.937966e-03 2.272514e-03
## mmu04151 1.437911e-04 4.500660e-03 3.481257e-03
##                                                                                                                                                                                                                                                                                                                                                                                                               geneID
## mmu01212                                                                                                                                                                                                                                                         11363/74205/30963/170439/66775/12896/54325/66885/14081/71147/94180/68801/107476/14104/54326/74147/110460/12894/56473/76267/30049/20250/329065/20249
## mmu00100                                                                                                                                                                                                                                                                                                                                    15490/12613/18194/74754/13121/13360/16987/66234/235293/20775/73166/16889
## mmu01040                                                                                                                                                                                                                                                                                                                30963/170439/66775/54325/70025/68801/54326/26897/171281/56473/76267/30049/20250/329065/20249
## mmu03320                                                                                                                                                                                                                                       11363/74205/13167/20181/26569/14933/15360/11520/12896/13117/14077/108078/14081/16956/26457/94180/17436/208715/19013/74147/225579/12894/56473/30049/20250/329065/20249
## mmu00280                                                                                                                                                                                                                                                                                            227095/11761/71724/56752/56357/15360/13171/100470/78894/12035/12039/66885/208715/104776/67041/74147/110460/17850
## mmu00561                                                                                                                                                                                                                                                                                     68393/56752/12613/331374/14933/231510/14187/320127/67800/381925/16956/15450/218121/14245/217480/68262/55979/16891/14732
## mmu04020                                                                                                                                        12767/215303/20192/18796/11550/18795/228785/58861/12373/108058/21338/68279/18127/18595/109305/269717/18430/11941/15465/20190/233979/215999/21390/14676/54140/13867/11556/11513/104110/19229/12323/53313/18436/94045/12291/13866/18750/11607/12062/107589/12322/15562
## mmu04514                                                                                                                                                                      12487/21367/20339/20344/94332/12490/16728/12738/22329/20345/12740/12739/192198/66797/17123/71908/12550/58187/69524/18613/18260/20969/319504/18191/14663/12737/239931/54420/54419/56863/15001/15002/15006/15015/15007/14963/14997/18190
## mmu04060                                                 16178/12778/21934/16324/12767/93672/215257/16181/12156/14563/16164/245527/22035/12977/242316/12804/230405/16847/17480/21938/21936/16193/330122/20309/16846/16156/21803/83430/23886/16168/76527/14560/12159/29820/21943/12160/252837/13051/16878/57916/20292/16323/57266/56838/14600/18414/19116/18383/12984/12983/11482/21950/21948/53603/12978/56708/77125
## mmu04151 12043/21960/20181/27219/16776/12841/14696/14173/24088/12977/13645/65086/21687/108079/12840/19699/18127/16193/14184/18595/16590/11839/20750/14254/21802/22371/14709/14936/56636/16001/17863/12834/12833/12567/13867/13685/22341/12845/16453/14180/67168/319480/235542/665033/14172/320207/22059/14710/13866/18750/20963/21828/18708/109700/30955/14600/18414/19116/11600/18591/12824/13640/16772/12978/83490
##          Count
## mmu01212    24
## mmu00100    12
## mmu01040    15
## mmu03320    27
## mmu00280    18
## mmu00561    19
## mmu04020    42
## mmu04514    38
## mmu04060    57
## mmu04151    65

head(kk, n=10) %>% as_tibble()

## # A tibble: 10 x 9
##    ID     Description  GeneRatio BgRatio  pvalue p.adjust  qvalue geneID   Count
##    <chr>  <chr>        <chr>     <chr>     <dbl>    <dbl>   <dbl> <chr>    <int>
##  1 mmu01… Fatty acid … 24/1028   61/8786 2.74e-8  8.59e-6 6.64e-6 11363/7…    24
##  2 mmu00… Steroid bio… 12/1028   20/8786 3.18e-7  4.97e-5 3.85e-5 15490/1…    12
##  3 mmu01… Biosynthesi… 15/1028   33/8786 1.28e-6  1.33e-4 1.03e-4 30963/1…    15
##  4 mmu03… PPAR signal… 27/1028   89/8786 1.80e-6  1.41e-4 1.09e-4 11363/7…    27
##  5 mmu00… Valine, leu… 18/1028   56/8786 3.99e-5  2.44e-3 1.89e-3 227095/…    18
##  6 mmu00… Glycerolipi… 19/1028   62/8786 5.17e-5  2.44e-3 1.89e-3 68393/5…    19
##  7 mmu04… Calcium sig… 42/1028   195/87… 5.47e-5  2.44e-3 1.89e-3 12767/2…    42
##  8 mmu04… Cell adhesi… 38/1028   173/87… 7.87e-5  2.94e-3 2.27e-3 12487/2…    38
##  9 mmu04… Cytokine-cy… 57/1028   296/87… 8.45e-5  2.94e-3 2.27e-3 16178/1…    57
## 10 mmu04… PI3K-Akt si… 65/1028   356/87… 1.44e-4  4.50e-3 3.48e-3 12043/2…    65

Visualise a pathway

In a browser

clusterProfile has a function browseKegg that allows you to view the KEGG pathway with the genes colours in in your browser.

browseKEGG(kk, 'mmu03320')

As a file

The package pathview (Luo et al. 2013) can be used to generate figures of KEGG pathways.

One advantage over clusterProfiles browser method is that the genes are coloured according to their fold change levels in our data. To do this we need to pass pathview a named vector of fold change data (actually you could colour by any numeric vector, e.g. p-value).

The package plots the KEGG pathway to a png file in the working directory.

library(pathview)
logFC <- annotLvV$logFC
names(logFC) <- annotLvV$Entrez
pathview(gene.data = logFC, 
         pathway.id = "mmu03320", 
         species = "mmu", 
         limit = list(gene=5, cpd=1))

mmu03320.pathview.png:

mmu03320 - PPAR signaling pathway

Challenge 1

Use pathview to export a figure for “mmu04060”, but this time only use genes that are statistically significant at FDR < 0.01

GSEA analysis

Gene Set Enrichment Analysis (GSEA) identifies gene sets that are related to the difference of interest between samples (Subramanian et al. 2005).

The software is distributed by the Broad Institute and is freely available for use by academic and non-profit organisations. The Broad also provide a number of very well curated gene sets for testing against your data - the Molecular Signatures Database (MSigDB). These are collections of human genes, however. Fortunately, these lists have been translated to mouse equivalents by the Walter+Eliza Hall Institute Bioinformatics service and made available for download.

Concept

We will look into random walk, distributions and the Kolomogorov-Smirnoff test whose concept GSEA uses (inspired from this course).

Random walks

A random walk is a succession of random steps along a mathematical path. of the distance from 0.

# a path 
plot.new()
plot.window(xlim=c(-10,10), ylim=c(-5,5))
axis(side=1)

Starting at 0, we will take a step randomly in either direction and keep track of the distance from 0.

A short walk:

# random walk
rn <- rnorm(n=10,m=0,sd=0.5)
rwalk <- cumsum(sign(rn))
plot(rwalk, type="l", ylim=c(-5,5))
abline(h=0)

A longer walk:

rn <- rnorm(n=1000,m=0,sd=0.5)
rwalk <- cumsum(sign(rn))
plot(rwalk, type="l")
abline(h=0)

We can force the walk to end at 0, by having the same number of steps in both directions.

# draw observations
rn <- rnorm(n=1000,m=0,sd=0.5)
# find indices of positive values (steps to the right)
whichPos <- which(rn >= 0)
# find indices of negative values (steps to the left)
whichNeg <- which(rn < 0)
# check number of steps in each direction
length(whichPos)

## [1] 482

length(whichNeg)

## [1] 518

# concat indices for the same number of steps in each direction, eg 400 steps each
rn2 <- c(rn[whichNeg][1:400], rn[whichPos][1:400])
# re-shuffle
rn2 <- sample(rn2)
# compute walk
rwalk <- cumsum(sign(rn2))
# plot
plot(rwalk, type="l")
abline(h=0)

The probability of observing the maximum distance observed here can be computed.

Distributions

A brief description of probability distributions. We will use a example of a variable following a Normal distribution of mean 0 and standard deviation 0.5.

par(mfrow = c(1, 2))
    
# probability density function

xx <- seq(-2, 2, length=100)
plot(xx, dnorm(xx, mean=0, sd=0.5), type='l')

# cumulative distribution function

cumulative <- pnorm(xx, 0, 1)
plot(xx, cumulative, type='l')

Kolmogorov-Smirnov test

The Kolmogorov-Smirnov (KS) test of goodness-of-fit tests whether the observed data is consistent with a given cumulative density function (CDF).

To illustrate the method, we will keep the random variable following the Normal distribution. We will make some observations and plot both the observed and expected CDFs.

Because the observed data was obtained with a finite sample, the observed CDF will differ from the theoretical CDF by some random scatter.

The KS test considers that if there is no difference between the two CDFs then that difference should be a random walk, with both ends fixed at 0. For which we can compute the probability of observing a given distance.

Example:

rn <- rnorm(n=20, m=0, sd=0.5)
X <- rn
X <- sort(X)
e_cdf <- 1:length(X) / length(X)
plot(X, e_cdf, type = "s", xlim=c(-1,1), ylim=c(0,1))

xx <- seq(min(X), max(X), length=20)
cumulative <- pnorm(xx, 0, 0.5)
lines(xx, cumulative)

Another example, where the two distributions differ. The random variable now follows a Normal distribution of mean -0.5 (and standard deviation still at 0.5).

rn <- rnorm(n=20, m=-0.3, sd=0.5)
X <- rn
X <- sort(X)
e_cdf <- 1:length(X) / length(X)
plot(X, e_cdf, type = "s", xlim=c(-1,1), ylim=c(0,1))

xx <- seq(min(X), max(X), length=20)
cumulative <- pnorm(xx, 0, 0.5)
lines(xx, cumulative)

GSEA

The analysis is performed by:

1. ranking all genes in the data set
1. identifying the rank positions of all members of the gene set in the ranked data set
1. calculating an enrichment score (ES) that represents the difference between the observed rankings and that which would be expected assuming a random rank distribution.

The article describing the original software is available here, while this commentary on GSEA provides a shorter description.

We will use the fgsea package (Sergushichev 2016) that implements the same algorithm in R. ‘fgsea’ stands for’ “fast preranked gene set enrichment analysis (GSEA)”.

library(fgsea)

Ranking Data

We need to provide fgsea a vector containing numeric data by which it should rank the genes. To start with we will simply use a rank based on their fold change. We do not need to rank the genes ourselves, fgsea will do this based on the data we provide.

We must exclude genes for which we do not have Entrez IDs. Also, we should use the shrunk LFC values.

gseaDat <- filter(shrinkLvV, !is.na(Entrez))

rankData <- gseaDat$logFC
names(rankData) <- gseaDat$Entrez
head(rankData)

##     497097      19888      20671      27395      18777      21399 
##  0.5959404 -1.2525813 -0.5856117  0.4398234  0.6631350 -0.1187075

Load pathways

load("Robjects/mouse_H_v5.RData")
pathwaysH <- Mm.H

Conduct analysis

fgseaRes <- fgsea(pathwaysH, 
                  rankData, 
                  minSize = 15, 
                  maxSize = 500, 
                  nperm = 1000)

## Warning in fgsea(pathwaysH, rankData, minSize = 15, maxSize = 500, nperm
## = 1000): You are trying to run fgseaSimple. It is recommended to use
## fgseaMultilevel. To run fgseaMultilevel, you need to remove the nperm argument
## in the fgsea function call.

## Warning in preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, : There are ties in the preranked stats (0.09% of the list).
## The order of those tied genes will be arbitrary, which may produce unexpected results.

The warning produced indicates that there are few genes that have the same fold change and so are ranked equally. fgsea arbitrarily determines which comes first in the ranked list. As long as this number is small it shouldn’t significantly affect the results. If the number is large something is suspicious about the fold change results.

Lets look at the top 10 results.

fgseaRes %>% 
    arrange(desc(abs(NES))) %>% 
    top_n(10, -padj)

## # A tibble: 12 x 8
##    pathway              pval    padj     ES   NES nMoreExtreme  size leadingEdge
##    <chr>               <dbl>   <dbl>  <dbl> <dbl>        <dbl> <int> <list>     
##  1 HALLMARK_CHOLEST… 0.00196 0.00918  0.609  2.32            0    73 <chr [33]> 
##  2 HALLMARK_OXIDATI… 0.00212 0.00918  0.514  2.31            0   196 <chr [127]>
##  3 HALLMARK_TGF_BET… 0.00211 0.00918 -0.591 -2.10            0    54 <chr [17]> 
##  4 HALLMARK_ADIPOGE… 0.00215 0.00918  0.445  1.99            0   199 <chr [86]> 
##  5 HALLMARK_NOTCH_S… 0.00206 0.00918 -0.605 -1.93            0    31 <chr [15]> 
##  6 HALLMARK_FATTY_A… 0.00216 0.00918  0.438  1.88            0   155 <chr [56]> 
##  7 HALLMARK_MTORC1_… 0.00220 0.00918  0.410  1.84            0   203 <chr [79]> 
##  8 HALLMARK_KRAS_SI… 0.00190 0.00918 -0.415 -1.79            0   190 <chr [58]> 
##  9 HALLMARK_ESTROGE… 0.00189 0.00918 -0.411 -1.78            0   193 <chr [64]> 
## 10 HALLMARK_MYC_TAR… 0.00207 0.00918 -0.493 -1.77            0    58 <chr [27]> 
## 11 HALLMARK_INFLAMM… 0.00189 0.00918 -0.398 -1.72            0   188 <chr [56]> 
## 12 HALLMARK_TNFA_SI… 0.00186 0.00918 -0.388 -1.69            0   199 <chr [69]>

Enrichment score plot

plotEnrichment(pathwaysH[["HALLMARK_ESTROGEN_RESPONSE_EARLY"]], rankData)

Remember to check the GSEA article for the complete explanation.

Challenge 2

Another common way to rank the genes is to order by pvalue, but also, sorting so that upregulated genes are at start and downregulated at the other - you can do this combining the sign of the fold change and the pvalue.
1. Rank the genes by statisical significance - you will need to create a new ranking value using -log10({p value}) * sign({Fold Change})
2. Load the “C2” pathways from the the Robjects/mouse_c2_v5.RData file
3. Run fgsea using the new ranked genes and the C2 pathways
4. Run fgsea using the new ranked genes and the H pathways. How do these results differ from the ones we got when ranking by the fold change alone?

References

Luo, Weijun, Brouwer, and Cory. 2013. “Pathview: An R/Bioconductor Package for Pathway-Based Data Integration and Visualization.” Bioinformatics 29 (14): 1830–1. https://doi.org/10.1093/bioinformatics/btt285.

Sergushichev, Alexey. 2016. “An Algorithm for Fast Preranked Gene Set Enrichment Analysis Using Cumulative Statistic Calculation.” bioRxiv. Cold Spring Harbor Labs Journals. https://doi.org/10.1101/060012.

Subramanian, Aravind, Pablo Tamayo, Vamsi K. Mootha, Sayan Mukherjee, Benjamin L. Ebert, Michael A. Gillette, Amanda Paulovich, et al. 2005. “Gene Set Enrichment Analysis: A Knowledge-Based Approach for Interpreting Genome-Wide Expression Profiles.” Proceedings of the National Academy of Sciences 102 (43). National Academy of Sciences: 15545–50. https://doi.org/10.1073/pnas.0506580102.

Yu, Guangchuang, Li-Gen Wang, Yanyan Han, and Qing-Yu He. 2012. “ClusterProfiler: An R Package for Comparing Biological Themes Among Gene Clusters.” OMICS: A Journal of Integrative Biology 16 (5): 284–87. https://doi.org/10.1089/omi.2011.0118.

RNA-seq analysis in R

Gene Set Testing for RNA-seq

Stephane Ballereau, Dominique-Laurent Couturier, Mark Dunning, Abbi Edwards, Ashley Sawle

Last modified: 03 Jul 2020