library(AnnotationHub)
library(AnnotationDbi)
library(ensembldb)
library(DESeq2)
library(tidyverse)
Before starting this section, we will make sure we have all the relevant objects from the Differential Expression analysis.
ddsObj.interaction <- readRDS("RObjects/DESeqDataSet.interaction.rds")
results.interaction.11 <- readRDS("RObjects/DESeqResults.interaction_d11.rds")
We have a list of significantly deferentially expressed genes, but the only annotation we can see is the Ensembl Gene ID, which is not very informative.
There are a number of ways to add annotation. One method is to do this using a Bioconductor annotation package. These packages which are re-built every periodically with the latest annotations. These packages are listed on the annotation section of the Bioconductor, and are installed in the same way as regular Bioconductor packages.
An another approach is to use biomaRt
, an interface to the BioMart resource. Using BioMart ensures that you are able to get the latest annotations for the GeneIDs, and can match the version of the gene annotation that was used for read counting.
A third method is to use AnnotationHub
, this is like the bioconductor packages but in an online database like bioMaRt
. They keep them slightly more up to date than the standard bioconductor packages and each time you use them the results are cached on your machine.
Today we will use the AnnotationHub
method. A workflow for annotation with biomaRt is included in the extended materials section accessible on the course website.
First we need to get the correct database from AnnotationHub
. We make the instance (the first time we do this it will create a local cache on your machine so that repeat queries are very quick).
As you can see ah
contains huge amounts of information and it is constantly changing. This is why it gives us the snapshot date so we know when our cached version is from. The ah
object actually online contains pointers to where all the information is online and we don’t want to download all of them as it would take a very long time and we don’t need all of it.
This object is a vector and you can get information about a single resource by indexing with a single bracket [
or download a resource with a double bracket [[
.
# create an annotationhub instance
ah <- AnnotationHub()
ah
## AnnotationHub with 71634 records
## # snapshotDate(): 2024-04-30
## # $dataprovider: Ensembl, BroadInstitute, UCSC, ftp://ftp.ncbi.nlm.nih.gov/g...
## # $species: Homo sapiens, Mus musculus, Drosophila melanogaster, Rattus norv...
## # $rdataclass: GRanges, TwoBitFile, BigWigFile, EnsDb, Rle, OrgDb, SQLiteFil...
## # additional mcols(): taxonomyid, genome, description,
## # coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
## # rdatapath, sourceurl, sourcetype
## # retrieve records with, e.g., 'object[["AH5012"]]'
##
## title
## AH5012 | Chromosome Band
## AH5013 | STS Markers
## AH5014 | FISH Clones
## AH5015 | Recomb Rate
## AH5016 | ENCODE Pilot
## ... ...
## AH117051 | Ensembl 112 EnsDb for Xiphophorus maculatus
## AH117052 | Ensembl 112 EnsDb for Xenopus tropicalis
## AH117053 | Ensembl 112 EnsDb for Zonotrichia albicollis
## AH117054 | Ensembl 112 EnsDb for Zalophus californianus
## AH117055 | Ensembl 112 EnsDb for Zosterops lateralis melanops
# Download the database we want to use
ahQueryResult <- query(ah, c("EnsDb", "Mus musculus", "102"))
ahQueryResult
## AnnotationHub with 1 record
## # snapshotDate(): 2024-04-30
## # names(): AH89211
## # $dataprovider: Ensembl
## # $species: Mus musculus
## # $rdataclass: EnsDb
## # $rdatadateadded: 2020-10-27
## # $title: Ensembl 102 EnsDb for Mus musculus
## # $description: Gene and protein annotations for Mus musculus based on Ensem...
## # $taxonomyid: 10090
## # $genome: GRCm38
## # $sourcetype: ensembl
## # $sourceurl: http://www.ensembl.org
## # $sourcesize: NA
## # $tags: c("102", "AHEnsDbs", "Annotation", "EnsDb", "Ensembl", "Gene",
## # "Protein", "Transcript")
## # retrieve record with 'object[["AH89211"]]'
MouseEnsDb <- ahQueryResult[[1]]
This database contains the entire gene annotation from Ensembl release 102 for mouse. It includes descriptions of genes, transcripts, exons, UTRs etc.
We can turn the whole gene-level annotation table into a data frame so we can work with it using the tidyverse suite of tools.
annotations <- genes(MouseEnsDb, return.type = "data.frame")
# lets see what information we have
colnames(annotations)
## [1] "gene_id" "gene_name" "gene_biotype"
## [4] "gene_seq_start" "gene_seq_end" "seq_name"
## [7] "seq_strand" "seq_coord_system" "description"
## [10] "gene_id_version" "canonical_transcript" "symbol"
## [13] "entrezid"
Let’s inspect the annotation.
head(annotations)
## gene_id gene_name gene_biotype gene_seq_start
## 1 ENSMUSG00000102693 4933401J01Rik TEC 3073253
## 2 ENSMUSG00000064842 Gm26206 snRNA 3102016
## 3 ENSMUSG00000051951 Xkr4 protein_coding 3205901
## 4 ENSMUSG00000102851 Gm18956 processed_pseudogene 3252757
## 5 ENSMUSG00000103377 Gm37180 TEC 3365731
## 6 ENSMUSG00000104017 Gm37363 TEC 3375556
## gene_seq_end seq_name seq_strand seq_coord_system
## 1 3074322 1 1 chromosome
## 2 3102125 1 1 chromosome
## 3 3671498 1 -1 chromosome
## 4 3253236 1 1 chromosome
## 5 3368549 1 -1 chromosome
## 6 3377788 1 -1 chromosome
## description
## 1 RIKEN cDNA 4933401J01 gene [Source:MGI Symbol;Acc:MGI:1918292]
## 2 predicted gene, 26206 [Source:MGI Symbol;Acc:MGI:5455983]
## 3 X-linked Kx blood group related 4 [Source:MGI Symbol;Acc:MGI:3528744]
## 4 predicted gene, 18956 [Source:MGI Symbol;Acc:MGI:5011141]
## 5 predicted gene, 37180 [Source:MGI Symbol;Acc:MGI:5610408]
## 6 predicted gene, 37363 [Source:MGI Symbol;Acc:MGI:5610591]
## gene_id_version canonical_transcript symbol entrezid
## 1 ENSMUSG00000102693.1 ENSMUST00000193812 4933401J01Rik NA
## 2 ENSMUSG00000064842.1 ENSMUST00000082908 Gm26206 NA
## 3 ENSMUSG00000051951.5 ENSMUST00000070533 Xkr4 497097
## 4 ENSMUSG00000102851.1 ENSMUST00000192857 Gm18956 NA
## 5 ENSMUSG00000103377.1 ENSMUST00000195335 Gm37180 NA
## 6 ENSMUSG00000104017.1 ENSMUST00000192336 Gm37363 NA
dim(annotations)
## [1] 56305 13
length(unique(annotations$entrezid))
## [1] 21282
sum(is.na(annotations$entrezid)) # Why are there NAs in the ENTREZID column?
## [1] 34686
Gene/transcript/protein IDs mapping between different databases not always perfect. Although majority of IDs map between databases, small subset may not have matching ID or may have more than one match. This is because feature identification algorithms, naming methodologies and versions may differ among databases. For instance NCBI and HGNC give same ID for different gene versions, whereas Ensembl assigned separate IDs for gene versions. Read interesting discussion on biostars.
There are some Ensembl IDs with no EntrezID. These gene ids has no corresponding Entrez ID in the EnsDb
database package. The Ensembl and Entrez databases don’t match on a 1:1 level although they have started taking steps towards consolidating in recent years.
Select the columns we want and rename them.
annot <- annotations %>%
select(GeneID = gene_id, Symbol = symbol, Description = description) %>%
distinct()
NOTE: You may get an error with this command that looks like:
Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function 'select' for signature '"data.frame"'
This is due to the select
function from dplyr (part of tidyverse) being masked by the select
function from one of the annotation packages. This will have happened because the annotation package was loaded after the tidyverse. You can either restart your R session and reload the required packages, this time being sure to load tidyverse last, or just use dplyr::select
to explicitly use the select
function from dplyr:
annot <- annotations %>%
dplyr::select(GeneID = gene_id, Symbol = symbol, Description = description)
annot.interaction.11 <- results.interaction.11 %>%
as.data.frame() %>%
rownames_to_column("GeneID") %>%
as_tibble() %>%
left_join(annot, by="GeneID")
annot.interaction.11
## # A tibble: 20,091 × 9
## GeneID baseMean log2FoldChange lfcSE stat pvalue padj Symbol
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 ENSMUSG00000000… 1103. -0.164 0.141 -1.16 2.45e-1 0.688 Gnai3
## 2 ENSMUSG00000000… 58.6 0.427 0.393 1.09 2.77e-1 0.719 Cdc45
## 3 ENSMUSG00000000… 49.2 -0.238 0.627 -0.380 7.04e-1 0.929 Scml2
## 4 ENSMUSG00000000… 7.99 0.819 0.974 0.841 4.00e-1 0.803 Apoh
## 5 ENSMUSG00000000… 1981. 0.128 0.0998 1.29 1.99e-1 0.640 Narf
## 6 ENSMUSG00000000… 607. -0.304 0.148 -2.05 4.01e-2 0.299 Cav2
## 7 ENSMUSG00000000… 1297. 0.520 0.146 3.57 3.59e-4 0.00842 Klf6
## 8 ENSMUSG00000000… 1063. 0.106 0.189 0.561 5.75e-1 0.885 Scmh1
## 9 ENSMUSG00000000… 2286. -0.255 0.132 -1.94 5.29e-2 0.352 Cox5a
## 10 ENSMUSG00000000… 92.7 -0.320 0.288 -1.11 2.67e-1 0.709 Tbx2
## # ℹ 20,081 more rows
## # ℹ 1 more variable: Description <chr>
A quick and easy “sanity check” for our DE results is to generate a p-value histogram. What we should see is a high bar at 0 - 0.05
and then a roughly uniform tail to the right of this. There is a nice explanation of other possible patterns in the histogram and what to do when you see them in this post.
hist(annot.interaction.11$pvalue)