library(tidyverse)
# load the data
load("Robjects/Annotated_Results_LvV.RData")
goseq
GOseq is a method to conduct Gene Ontology (GO) analysis suitable for RNA-seq data as it accounts for the gene length bias in detection of over-representation (Young et al. 2010).
From the GOseq vignette:
library(goseq)
supportedOrganisms() %>% filter(str_detect(Genome, "mm"))
## # A tibble: 10 x 5
## Genome Id `Id Description` `Lengths in geneLeneD… `GO Annotation Avai…
## <chr> <chr> <chr> <lgl> <lgl>
## 1 mm10 "" "" FALSE TRUE
## 2 mm7 "knownGe… "Entrez Gene ID" TRUE TRUE
## 3 mm7 "ensGene" "Ensembl gene I… TRUE TRUE
## 4 mm7 "geneSym… "Gene Symbol" TRUE TRUE
## 5 mm8 "knownGe… "Entrez Gene ID" TRUE TRUE
## 6 mm8 "ensGene" "Ensembl gene I… TRUE TRUE
## 7 mm8 "geneSym… "Gene Symbol" TRUE TRUE
## 8 mm9 "knownGe… "Entrez Gene ID" TRUE TRUE
## 9 mm9 "ensGene" "Ensembl gene I… TRUE TRUE
## 10 mm9 "geneSym… "Gene Symbol" TRUE TRUE
The input for goseq
is a vector that indicates, for each gene, whether or not it is significantly differentially expressed. This should be a named vector, where the names are the gene ids and the values are 1
if the gene is significant and 0
if it is not.
In this case we can use the Ensembl gene IDs.
sigData <- as.integer(!is.na(shrinkLvV$FDR) & shrinkLvV$FDR < 0.01)
names(sigData) <- shrinkLvV$GeneID
pwf <- nullp(sigData, "mm10", "ensGene", bias.data = shrinkLvV$medianTxLength)
## Warning in pcls(G): initial point very close to some inequality constraints
goResults <- goseq(pwf, "mm10","ensGene", test.cats=c("GO:BP"))
goResults %>%
top_n(10, wt=-over_represented_pvalue) %>%
mutate(hitsPerc=numDEInCat*100/numInCat) %>%
ggplot(aes(x=hitsPerc,
y=term,
colour=over_represented_pvalue,
size=numDEInCat)) +
geom_point() +
expand_limits(x=0) +
labs(x="Hits (%)", y="GO term", colour="p value", size="Count")
library(GO.db)
GOTERM[[goResults$category[1]]]
## GOID: GO:0044281
## Term: small molecule metabolic process
## Ontology: BP
## Definition: The chemical reactions and pathways involving small
## molecules, any low molecular weight, monomeric, non-encoded
## molecule.
## Synonym: small molecule metabolism
Challenge 2
- Create a vector showing genes that are statistically significant at FDR < 0.01 and that are up-regulated by at least 4x (logFC>2) in lactating mice
- Run a
goseq
analysis on this gene list- Plot the results
- How is this result different to the previous GO analysis?
Young, Matthew D, Matthew J Wakefield, Gordon K Smyth, and Alicia Oshlack. 2010. “Gene Ontology Analysis for Rna-Seq: Accounting for Selection Bias.” Genome Biology 11: R14.