GO enrichment analysis

`goseq`

GOseq is a method to conduct Gene Ontology (GO) analysis suitable for RNA-seq data as it accounts for the gene length bias in detection of over-representation (Young et al. 2010).

From the GOseq vignette:

GOseq first needs to quantify the length bias present in the dataset under consideration.
This is done by calculating a Probability Weighting Function or PWF which can be thought of as a function which gives the probability that a gene will be differentially expressed (DE), based on its length alone.
The PWF is calculated by fitting a monotonic spline to the binary data series of differential expression (1=DE, 0=Not DE) as a function of gene length.
The PWF is used to weight the chance of selecting each gene when forming a null distribution for GO category membership.
The fact that the PWF is calculated directly from the dataset under consideration makes this approach robust, only correcting for the length bias present in the data.

library(goseq)
supportedOrganisms() %>% filter(str_detect(Genome, "mm"))

## # A tibble: 10 x 5
##    Genome Id        `Id Description` `Lengths in geneLeneD… `GO Annotation Avai…
##    <chr>  <chr>     <chr>            <lgl>                  <lgl>               
##  1 mm10   ""        ""               FALSE                  TRUE                
##  2 mm7    "knownGe… "Entrez Gene ID" TRUE                   TRUE                
##  3 mm7    "ensGene" "Ensembl gene I… TRUE                   TRUE                
##  4 mm7    "geneSym… "Gene Symbol"    TRUE                   TRUE                
##  5 mm8    "knownGe… "Entrez Gene ID" TRUE                   TRUE                
##  6 mm8    "ensGene" "Ensembl gene I… TRUE                   TRUE                
##  7 mm8    "geneSym… "Gene Symbol"    TRUE                   TRUE                
##  8 mm9    "knownGe… "Entrez Gene ID" TRUE                   TRUE                
##  9 mm9    "ensGene" "Ensembl gene I… TRUE                   TRUE                
## 10 mm9    "geneSym… "Gene Symbol"    TRUE                   TRUE

Create a list of differentially expressed genes

The input for goseq is a vector that indicates, for each gene, whether or not it is significantly differentially expressed. This should be a named vector, where the names are the gene ids and the values are 1 if the gene is significant and 0 if it is not.

In this case we can use the Ensembl gene IDs.

sigData <- as.integer(!is.na(shrinkLvV$FDR) & shrinkLvV$FDR < 0.01)
names(sigData) <- shrinkLvV$GeneID

Fit the Probability Weighting Function (PWF)

pwf <- nullp(sigData, "mm10", "ensGene", bias.data = shrinkLvV$medianTxLength)

## Warning in pcls(G): initial point very close to some inequality constraints

Conduct GO enrichment analysis

goResults <- goseq(pwf, "mm10","ensGene", test.cats=c("GO:BP"))

Plot the top 10

goResults %>% 
    top_n(10, wt=-over_represented_pvalue) %>% 
    mutate(hitsPerc=numDEInCat*100/numInCat) %>% 
    ggplot(aes(x=hitsPerc, 
               y=term, 
               colour=over_represented_pvalue, 
               size=numDEInCat)) +
        geom_point() +
        expand_limits(x=0) +
        labs(x="Hits (%)", y="GO term", colour="p value", size="Count")

Get the GO information for the GO accessions

library(GO.db)
GOTERM[[goResults$category[1]]]

## GOID: GO:0044281
## Term: small molecule metabolic process
## Ontology: BP
## Definition: The chemical reactions and pathways involving small
##     molecules, any low molecular weight, monomeric, non-encoded
##     molecule.
## Synonym: small molecule metabolism

Challenge 2

Create a vector showing genes that are statistically significant at FDR < 0.01 and that are up-regulated by at least 4x (logFC>2) in lactating mice

Run a goseq analysis on this gene list

Plot the results

How is this result different to the previous GO analysis?

Young, Matthew D, Matthew J Wakefield, Gordon K Smyth, and Alicia Oshlack. 2010. “Gene Ontology Analysis for Rna-Seq: Accounting for Selection Bias.” Genome Biology 11: R14.

RNA-seq analysis in R

Gene Set Testing for RNA-seq

Stephane Ballereau, Dominique-Laurent Couturier, Mark Dunning, Abbi Edwards, Ashley Sawle

Last modified: 24 Jul 2020