The list of differentially expressed genes is sometimes so long that its interpretation becomes cumbersome and time consuming. A common downstream procedure is gene set testing. It aims to understand which pathways or gene networks the differentially expressed genes are implicated in.
Various ways exist to test for enrichment of biological pathways.
GSEA analysis
Gene Set Enrichment Analysis GSEA was tests whether a set of genes of interest, e.g. genes (Subramanian et al. 2005). The software is distributed by the Broad Institute and is freely available for use by academic and non-profit organisations.
In addition to the GSEA software the Broad also provide a number of very well curated gene set
s for testing against your data - the Molecular Signatures Database (MSigDB). Unfortunately, these are collections of human genes, however, these lists have been translated to mouse equivalents by the Walter+Eliza Hall Institutes Bioinformatics service and made avaialble for download.
The analysis is performed by:
- ranking all genes in the
data set
- identifying the rank positions of all members of the
gene set
in the ranked data set
- calculating an enrichment score (ES) that represents the difference between the observed rankings and that which would be expected assuming a random rank distribution.
commentary on GSEA. The article describing the original software is available here.
fgsea
The fgsea
package (Sergushichev 2016) implements the same algorithm in R vignette “fast preranked gene set enrichment analysis (GSEA)”.
library(fgsea)
Loading required package: Rcpp
load("Robjects/Annotated_Results_LvV.RData")
Create ranks
Rank all genes based on their fold change. We need to exclude genes for which we do not have Entrez IDs. Also, we should use the shrunk LFC values.
gseaDat <- filter(shrinkLvV, !is.na(Entrez))
ranks <- gseaDat$logFC
names(ranks) <- gseaDat$Entrez
head(ranks)
497097 19888 20671 27395 18777 21399
0.5851914 -1.0498457 -0.5893634 0.4313540 0.6478714 -0.1110589
Plot the ranked fold changes.
barplot(sort(ranks, decreasing = T))
Load pathways
load("Robjects/mouse_H_v5.RData")
pathwaysH <- Mm.H
Conduct analysis
fgseaRes <- fgsea(pathwaysH, ranks, minSize=15, maxSize = 500, nperm=1000)
There are ties in the preranked stats (0.05% of the list).
The order of those tied genes will be arbitrary, which may produce unexpected results.
The warning produced indicates that there are few genes that have the same fold change and so are ranked equally. fgsea
with arbitrarily order determine which comes first in the ranked list. As long as this number is small it shouldn’t significantly effect the results. If the number is large something is suspicious about the fold change results.
Lets look at the top 10 results.
head(fgseaRes[order(padj, -abs(NES)), ], n=10)
Enrichment score plot
plotEnrichment(pathwaysH[["HALLMARK_ESTROGEN_RESPONSE_EARLY"]], ranks)
Remember to check the GSEA article for the complete explanation.
GSEA table plot
The function plotGseaTable
allows us to plot a summary figue showing the results for multiple pathways.
topUp <- fgseaRes %>%
filter(ES > 0) %>%
top_n(10, wt=-padj)
topDown <- fgseaRes %>%
filter(ES < 0) %>%
top_n(10, wt=-padj)
topPathways <- bind_rows(topUp, topDown) %>%
arrange(-ES)
plotGseaTable(pathwaysH[topPathways$pathway],
ranks,
fgseaRes,
gseaParam = 0.5)
Challenge 1
Another common way to rank the genes is to order by pvalue, but also, sorting so that upregulated genes are at start and downregulated at the other - you can do this combining the sign of the fold change and the pvalue.
1. Rank the genes by statisical significance - you will need to create a new ranking value using -log10({p value}) * sign({Fold Change})
2. Load the “C2” pathways from the the data/mouse_c2_v5.RData
file
3. Run fgsea
using the new ranked genes and the C2 pathways
4. Run fgsea
using the new ranked genes and the H pathways. How do these results differ from the ones we got when ranking by the fold change alone?
GO enrichment analysis
goseq
GOseq is a method to conduct Gene Ontology (GO) analysis suitable for RNA-seq data as it accounts for the gene length bias in detection of over-representation (Young et al. 2010)
From the GOseq vignette:
- GOseq first needs to quantify the length bias present in the dataset under consideration.
- This is done by calculating a Probability Weighting Function or PWF which can be thought of as a function which gives the probability that a gene will be differentially expressed (DE), based on its length alone.
- The PWF is calculated by fitting a monotonic spline to the binary data series of differential expression (1=DE, 0=Not DE) as a function of gene length.
- The PWF is used to weight the chance of selecting each gene when forming a null distribution for GO category membership.
- The fact that the PWF is calculated directly from the dataset under consideration makes this approach robust, only correcting for the length bias present in the data.
“GO analysis of RNA-seq data requires the use of random sampling in order to generate a suitable null distribution for GO category membership and calculate each category’s significance for over representation amongst DE genes. … In most cases, the Wallenius distribution can be used to approximate the true null distribution, without any significant loss in accuracy. The goseq package implements this approximation as its default option.”
library(goseq)
supportedOrganisms() %>% filter(str_detect(Genome, "mm"))
NAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercion
Create a list of differentially expressed genes
isSigGene <- shrinkLvV$FDR < 0.01 & !is.na(shrinkLvV$FDR)
genes <- as.integer(isSigGene)
names(genes) <- shrinkLvV$GeneID
Fit the Probability Weighting Function (PWF)
pwf <- nullp(genes, "mm10", "ensGene", bias.data = shrinkLvV$medianTxLength)
initial point very close to some inequality constraints
Conduct GO enrichment analysis
goResults <- goseq(pwf, "mm10","ensGene", test.cats=c("GO:BP"))
Plot the top 10
goResults %>%
top_n(10, wt=-over_represented_pvalue) %>%
mutate(hitsPerc=numDEInCat*100/numInCat) %>%
ggplot(aes(x=hitsPerc,
y=term,
colour=over_represented_pvalue,
size=numDEInCat)) +
geom_point() +
expand_limits(x=0) +
labs(x="Hits (%)", y="GO term", colour="p value", size="Count")
KEGG pathway enrichment analysis
clusterProfiler
We can analyse for enrichment of KEGG pathways in much the same way as for GO terms. We could also use goseq
for this, but this time we’re going to use clusterProfiler
(Yu et al. 2012). clusterprofiler
supports direct online access of the current KEGG database, rather than relying on R annotation packages, it also provides some nice visualisation options.
library(clusterProfiler)
search_kegg_organism('mmu', by='kegg_code')
KEGG enrichment analysis
sigGenes <- shrinkLvV$Entrez[ shrinkLvV$FDR < 0.01 &
!is.na(shrinkLvV$FDR) &
abs(shrinkLvV$logFC) > 1 ]
sigGenes <- na.exclude(sigGenes)
kk <- enrichKEGG(gene = sigGenes, organism = 'mmu')
head(kk, n=10)
Visualise a pathway
In a browser
clusterProfile
has a function browseKegg
that allows you to view the Kegg pathway with the genes colours in in your browser.
browseKEGG(kk, 'mmu03320')
As a file
The package pathview
(Luo et al. 2013) can be used to generate figures of KEGG pathways. One advantage over clusterProfiles
browser methos is that the genes are coloured according to their expression levels in our data. The package plots the KEGG pathway to a png
file in the working directory.
library(pathview)
Loading required package: org.Hs.eg.db
##############################################################################
Pathview is an open source software package distributed under GNU General
Public License version 3 (GPLv3). Details of GPLv3 is available at
http://www.gnu.org/licenses/gpl-3.0.html. Particullary, users are required to
formally cite the original Pathview paper (not just mention it) in publications
or products. For details, do citation("pathview") within R.
The pathview downloads and uses KEGG data. Non-academic uses may require a KEGG
license agreement (details at http://www.kegg.jp/kegg/legal.html).
##############################################################################
logFC <- annotLvV$logFC
names(logFC) <- annotLvV$Entrez
pathview(gene.data = logFC,
pathway.id = "mmu03320",
species = "mmu",
limit = list(gene=5, cpd=1))
Info: Downloading xml files for mmu03320, 1/1 pathways..
Info: Downloading png files for mmu03320, 1/1 pathways..
'select()' returned 1:1 mapping between keys and columns
Info: Working in directory /Users/sawle01/Documents/RNAseq_course/CRUK_CI_RNAseq_course/Course_Materials
Info: Writing image file mmu03320.pathview.png
mmu03320.pathview.png:
Challenge 3
- Use
pathview
to export a figure for “mmu04060”, but this time only use genes that are statistically significant at FDR < 0.01
---
title: "RNA-seq analysis in R"
author: "Stephane Ballereau, Mark Dunning, Abbi Edwards, Oscar Rueda, Ashley Sawle"
date: '`r format(Sys.time(), "Last modified: %d %b %Y")`'
output:
  html_notebook:
    toc: yes
  html_document:
    toc: yes
minutes: 300
layout: page
subtitle: Gene Set Testing for RNA-seq
bibliography: ref.bib
editor_options: 
  chunk_output_type: inline
---

```{r setup, include=FALSE}
library(tidyverse)
```

The list of differentially expressed genes is sometimes so long that its interpretation becomes cumbersome and time consuming. A common downstream procedure is gene set testing. It aims to understand which pathways or gene networks the differentially expressed genes are implicated in.

Various ways exist to test for enrichment of biological pathways.

# GSEA analysis

Gene Set Enrichment Analysis GSEA was tests whether a set of genes of interest, 
e.g. genes
[@Subramanian15545]. The software is distributed by the [Broad 
Institute](http://software.broadinstitute.org/gsea/index.jsp) and is freely 
available for use by academic and non-profit organisations. 

In addition to the GSEA software the Broad also provide a number of very well 
curated `gene set`s for testing against your data - the [Molecular Signatures 
Database (MSigDB)](http://software.broadinstitute.org/gsea/msigdb/index.jsp). 
Unfortunately, these are collections of human genes, however, these lists
have been translated to mouse equivalents by the Walter+Eliza Hall Institutes
Bioinformatics service and made avaialble for [download](http://bioinf.wehi.edu.au/software/MSigDB/).


The analysis is performed by:

- (i) ranking all genes in the `data set`
- (ii) identifying the rank positions of all members of the `gene set` in the 
ranked `data set`
- (iii) calculating an enrichment score (ES) that represents the difference 
between the observed rankings and that which would be expected assuming a random 
rank distribution.

[commentary on GSEA](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1266131/). 
The article describing the original software is available 
[here](http://www.pnas.org/content/102/43/15545.long).

## `fgsea`

The `fgsea` package [@Sergushichev2016] implements the same algorithm in R [vignette](http://www.bioconductor.org/packages/release/bioc/vignettes/fgsea/inst/doc/fgsea-tutorial.html) "fast preranked gene set enrichment analysis (GSEA)".

```{r fgsea}
library(fgsea)
```

```{r loadData}
load("Robjects/Annotated_Results_LvV.RData")
```

## Create ranks

Rank all genes based on their fold change. We need to exclude genes for
which we do not have Entrez IDs. Also, we should use the shrunk LFC values.

```{r preparedata}
gseaDat <- filter(shrinkLvV, !is.na(Entrez))

ranks <- gseaDat$logFC
names(ranks) <- gseaDat$Entrez
head(ranks)
```

Plot the ranked fold changes.

```{r}
barplot(sort(ranks, decreasing = T))
```

## Load pathways

```{r}
load("Robjects/mouse_H_v5.RData")
pathwaysH <- Mm.H
```

## Conduct analysis

```{r}
fgseaRes <- fgsea(pathwaysH, ranks, minSize=15, maxSize = 500, nperm=1000)
```

The warning produced indicates that there are few genes that have the same fold
change and so are ranked equally. `fgsea` with arbitrarily order determine which
comes first in the ranked list. As long as this number is small it shouldn't
significantly effect the results. If the number is large something is 
suspicious about the fold change results.

Lets look at the top 10 results.

```{r}
head(fgseaRes[order(padj, -abs(NES)), ], n=10)
```

## Enrichment score plot

```{r}
plotEnrichment(pathwaysH[["HALLMARK_ESTROGEN_RESPONSE_EARLY"]], ranks)
```

Remember to check the [GSEA article](http://www.pnas.org/content/102/43/15545.full) for the complete explanation.

## GSEA table plot

The function `plotGseaTable` allows us to plot a summary figue showing the 
results for multiple pathways.

```{r}
topUp <- fgseaRes %>% 
    filter(ES > 0) %>% 
    top_n(10, wt=-padj)
topDown <- fgseaRes %>% 
    filter(ES < 0) %>% 
    top_n(10, wt=-padj)
topPathways <- bind_rows(topUp, topDown) %>% 
    arrange(-ES)

plotGseaTable(pathwaysH[topPathways$pathway], 
              ranks, 
              fgseaRes, 
              gseaParam = 0.5)
```

> ## Challenge 1 {.challenge}
>
> Another common way to rank the genes is to order by pvalue, but also, sorting
> so that upregulated genes are at start and downregulated at the other - 
> you can do this combining the sign of the fold change and the pvalue.  
> 1. Rank the genes by statisical significance - you will need to create
> a new ranking value using `-log10({p value}) * sign({Fold Change})`  
> 2. Load the "C2" pathways from the the `data/mouse_c2_v5.RData` file   
> 3. Run `fgsea` using the new ranked genes and the C2 pathways  
> 4. Run `fgsea` using the new ranked genes and the H pathways. How do these 
> results differ from the ones we got when ranking by the fold change alone?  

```{r solution1}

```

# GO enrichment analysis

## `goseq`

GOseq is a method to conduct Gene Ontology (GO) analysis suitable for RNA-seq data as it accounts for the gene length bias in detection of over-representation [[@Young2010]](https://genomebiology.biomedcentral.com/articles/10.1186/gb-2010-11-2-r14)

From the [GOseq vignette](https://www.bioconductor.org/packages/release/bioc/vignettes/goseq/inst/doc/goseq.pdf):

- GOseq first needs to quantify the length bias present in the dataset under consideration.
- This is done by calculating a Probability Weighting Function or PWF which can be thought of as a function which gives the probability that a gene will be differentially expressed (DE), based on its length alone.
- The PWF is calculated by fitting a monotonic spline to the binary data series of differential expression (1=DE, 0=Not DE) as a function of gene length.
- The PWF is used to weight the chance of selecting each gene when forming a null distribution for GO category membership.
- The fact that the PWF is calculated directly from the dataset under consideration makes this approach robust, only correcting for the length bias present in the data.

"GO analysis of RNA-seq data requires the use of random sampling in order to generate a suitable null distribution for GO category membership and calculate each category's significance for over representation amongst DE genes. ... In  most  cases,  the  Wallenius distribution can be used to approximate the true null distribution, without any significant loss in accuracy. The goseq package implements this approximation as its default option."

```{r goSeqPackage, message=FALSE}
library(goseq)
supportedOrganisms() %>% filter(str_detect(Genome, "mm"))
```

## Create a list of differentially expressed genes

```{r getDEGs}
isSigGene <- shrinkLvV$FDR < 0.01 & !is.na(shrinkLvV$FDR)
genes <- as.integer(isSigGene)
names(genes) <- shrinkLvV$GeneID
```

## Fit the Probability Weighting Function (PWF)

```{r pwFunction}
pwf <- nullp(genes, "mm10", "ensGene", bias.data = shrinkLvV$medianTxLength)
```

## Conduct GO enrichment analysis

```{r runGoseq, message=FALSE}
goResults <- goseq(pwf, "mm10","ensGene", test.cats=c("GO:BP"))
```

## Plot the top 10

```{r plotGO}
goResults %>% 
    top_n(10, wt=-over_represented_pvalue) %>% 
    mutate(hitsPerc=numDEInCat*100/numInCat) %>% 
    ggplot(aes(x=hitsPerc, 
               y=term, 
               colour=over_represented_pvalue, 
               size=numDEInCat)) +
        geom_point() +
        expand_limits(x=0) +
        labs(x="Hits (%)", y="GO term", colour="p value", size="Count")
```

## Get the GO information for the GO accessions

```{r getGOinfo}
library(GO.db)
GOTERM[[goResults$category[1]]]
```

> ## Challenge 2 {.challenge}
>
> 1. Create a list of genes that are statistically significant at FDR < 0.01 and
> are up-regulated by at least 4x (logFC>2)
> in lactating mice
> 2. Run a `goseq` analysis on this gene list
> 3. Plot the results
> 4. How is this result different to the previous GO analysis?

```{r solution2, eval=F}

```

# KEGG pathway enrichment analysis

## `clusterProfiler`

We can analyse for enrichment of KEGG pathways in much the same way as for GO 
terms. We could also use `goseq` for this, but this time we're going to use 
`clusterProfiler` [@Yu2012]. `clusterprofiler` supports direct online access of
the current KEGG database, rather than relying on R annotation packages, it
also provides some nice visualisation options.

```{r loadClusterProfiler, message=FALSE}
library(clusterProfiler)
search_kegg_organism('mmu', by='kegg_code')
```

## KEGG enrichment analysis

```{r enrichKEGG}
sigGenes <- shrinkLvV$Entrez[ shrinkLvV$FDR < 0.01 & 
                              !is.na(shrinkLvV$FDR) &
                              abs(shrinkLvV$logFC) > 1 ]
sigGenes <- na.exclude(sigGenes)
kk <- enrichKEGG(gene = sigGenes, organism = 'mmu')
head(kk, n=10)
```

## Visualise a pathway

### In a browser

`clusterProfile` has a function `browseKegg` that allows you to view the Kegg
pathway with the genes colours in in your browser.

```{r browseKegg}
browseKEGG(kk, 'mmu03320')
```

### As a file

The package `pathview` [@Luo2013] can be used to generate figures of KEGG 
pathways. One advantage over `clusterProfiles` browser methos is that the genes
are coloured according to their expression levels in our data. The package plots
the KEGG pathway to a `png` file in the working directory.

```{r pathview}
library(pathview)

logFC <- annotLvV$logFC
names(logFC) <- annotLvV$Entrez
pathview(gene.data = logFC, 
         pathway.id = "mmu03320", 
         species = "mmu", 
         limit = list(gene=5, cpd=1))
```

mmu03320.pathview.png:

![mmu03320 - PPAR signaling pathway](../images/mmu03320.pathview.png)

> ## Challenge 3 {.challenge}
>
> 1. Use `pathview` to export a figure for "mmu04060", but this time only
> use genes that are statistically significant at FDR < 0.01

```{r solution3, eval=F}

```

---------------------------------------------------------------

# References