library(AnnotationHub)
library(AnnotationDbi)
library(ensembldb)
library(DESeq2)
library(tidyverse)

Before starting this section, we will make sure we have all the relevant objects from the Differential Expression analysis.

ddsObj.interaction <- readRDS("RObjects/DESeqDataSet.interaction.rds")
results.interaction.11 <- readRDS("RObjects/DESeqResults.interaction_d11.rds")

Overview

Adding annotation to the DESeq2 results

We have a list of significantly deferentially expressed genes, but the only annotation we can see is the Ensembl Gene ID, which is not very informative.

There are a number of ways to add annotation. One method is to do this using a Bioconductor annotation package. These packages which are re-built every periodically with the latest annotations. These packages are listed on the annotation section of the Bioconductor, and are installed in the same way as regular Bioconductor packages.

An another approach is to use biomaRt, an interface to the BioMart resource. Using BioMart ensures that you are able to get the latest annotations for the GeneIDs, and can match the version of the gene annotation that was used for read counting.

A third method is to use AnnotationHub, this is like the bioconductor packages but in an online database like bioMaRt. They keep them slightly more up to date than the standard bioconductor packages and each time you use them the results are cached on your machine.

Today we will use the AnnotationHub method. A workflow for annotation with biomaRt is included in the extended materials section accessible on the course website.

Query the database

First we need to get the correct database from AnnotationHub. We make the instance (the first time we do this it will create a local cache on your machine so that repeat queries are very quick).

As you can see ah contains huge amounts of information and it is constantly changing. This is why it gives us the snapshot date so we know when our cached version is from. The ah object actually online contains pointers to where all the information is online and we don’t want to download all of them as it would take a very long time and we don’t need all of it.

This object is a vector and you can get information about a single resource by indexing with a single bracket [ or download a resource with a double bracket [[.

# create an annotationhub instance
ah <- AnnotationHub()
ah
## AnnotationHub with 71634 records
## # snapshotDate(): 2024-04-30
## # $dataprovider: Ensembl, BroadInstitute, UCSC, ftp://ftp.ncbi.nlm.nih.gov/g...
## # $species: Homo sapiens, Mus musculus, Drosophila melanogaster, Rattus norv...
## # $rdataclass: GRanges, TwoBitFile, BigWigFile, EnsDb, Rle, OrgDb, SQLiteFil...
## # additional mcols(): taxonomyid, genome, description,
## #   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
## #   rdatapath, sourceurl, sourcetype 
## # retrieve records with, e.g., 'object[["AH5012"]]' 
## 
##              title                                             
##   AH5012   | Chromosome Band                                   
##   AH5013   | STS Markers                                       
##   AH5014   | FISH Clones                                       
##   AH5015   | Recomb Rate                                       
##   AH5016   | ENCODE Pilot                                      
##   ...        ...                                               
##   AH117051 | Ensembl 112 EnsDb for Xiphophorus maculatus       
##   AH117052 | Ensembl 112 EnsDb for Xenopus tropicalis          
##   AH117053 | Ensembl 112 EnsDb for Zonotrichia albicollis      
##   AH117054 | Ensembl 112 EnsDb for Zalophus californianus      
##   AH117055 | Ensembl 112 EnsDb for Zosterops lateralis melanops
# Download the database we want to use
ahQueryResult <- query(ah, c("EnsDb", "Mus musculus", "102"))
ahQueryResult
## AnnotationHub with 1 record
## # snapshotDate(): 2024-04-30
## # names(): AH89211
## # $dataprovider: Ensembl
## # $species: Mus musculus
## # $rdataclass: EnsDb
## # $rdatadateadded: 2020-10-27
## # $title: Ensembl 102 EnsDb for Mus musculus
## # $description: Gene and protein annotations for Mus musculus based on Ensem...
## # $taxonomyid: 10090
## # $genome: GRCm38
## # $sourcetype: ensembl
## # $sourceurl: http://www.ensembl.org
## # $sourcesize: NA
## # $tags: c("102", "AHEnsDbs", "Annotation", "EnsDb", "Ensembl", "Gene",
## #   "Protein", "Transcript") 
## # retrieve record with 'object[["AH89211"]]'
MouseEnsDb <- ahQueryResult[[1]]

This database contains the entire gene annotation from Ensembl release 102 for mouse. It includes descriptions of genes, transcripts, exons, UTRs etc.

We can turn the whole gene-level annotation table into a data frame so we can work with it using the tidyverse suite of tools.

annotations <- genes(MouseEnsDb, return.type = "data.frame")

# lets see what information we have
colnames(annotations)
##  [1] "gene_id"              "gene_name"            "gene_biotype"        
##  [4] "gene_seq_start"       "gene_seq_end"         "seq_name"            
##  [7] "seq_strand"           "seq_coord_system"     "description"         
## [10] "gene_id_version"      "canonical_transcript" "symbol"              
## [13] "entrezid"

Missing annotations

Let’s inspect the annotation.

head(annotations)
##              gene_id     gene_name         gene_biotype gene_seq_start
## 1 ENSMUSG00000102693 4933401J01Rik                  TEC        3073253
## 2 ENSMUSG00000064842       Gm26206                snRNA        3102016
## 3 ENSMUSG00000051951          Xkr4       protein_coding        3205901
## 4 ENSMUSG00000102851       Gm18956 processed_pseudogene        3252757
## 5 ENSMUSG00000103377       Gm37180                  TEC        3365731
## 6 ENSMUSG00000104017       Gm37363                  TEC        3375556
##   gene_seq_end seq_name seq_strand seq_coord_system
## 1      3074322        1          1       chromosome
## 2      3102125        1          1       chromosome
## 3      3671498        1         -1       chromosome
## 4      3253236        1          1       chromosome
## 5      3368549        1         -1       chromosome
## 6      3377788        1         -1       chromosome
##                                                             description
## 1        RIKEN cDNA 4933401J01 gene [Source:MGI Symbol;Acc:MGI:1918292]
## 2             predicted gene, 26206 [Source:MGI Symbol;Acc:MGI:5455983]
## 3 X-linked Kx blood group related 4 [Source:MGI Symbol;Acc:MGI:3528744]
## 4             predicted gene, 18956 [Source:MGI Symbol;Acc:MGI:5011141]
## 5             predicted gene, 37180 [Source:MGI Symbol;Acc:MGI:5610408]
## 6             predicted gene, 37363 [Source:MGI Symbol;Acc:MGI:5610591]
##        gene_id_version canonical_transcript        symbol entrezid
## 1 ENSMUSG00000102693.1   ENSMUST00000193812 4933401J01Rik       NA
## 2 ENSMUSG00000064842.1   ENSMUST00000082908       Gm26206       NA
## 3 ENSMUSG00000051951.5   ENSMUST00000070533          Xkr4   497097
## 4 ENSMUSG00000102851.1   ENSMUST00000192857       Gm18956       NA
## 5 ENSMUSG00000103377.1   ENSMUST00000195335       Gm37180       NA
## 6 ENSMUSG00000104017.1   ENSMUST00000192336       Gm37363       NA
dim(annotations)
## [1] 56305    13
length(unique(annotations$entrezid))
## [1] 21282
sum(is.na(annotations$entrezid)) # Why are there NAs in the ENTREZID column?
## [1] 34686

Gene/transcript/protein IDs mapping between different databases not always perfect. Although majority of IDs map between databases, small subset may not have matching ID or may have more than one match. This is because feature identification algorithms, naming methodologies and versions may differ among databases. For instance NCBI and HGNC give same ID for different gene versions, whereas Ensembl assigned separate IDs for gene versions. Read interesting discussion on biostars.

There are some Ensembl IDs with no EntrezID. These gene ids has no corresponding Entrez ID in the EnsDb database package. The Ensembl and Entrez databases don’t match on a 1:1 level although they have started taking steps towards consolidating in recent years.

Join the annotation to the differential expression results table

Select the columns we want and rename them.

annot <- annotations %>%
    select(GeneID = gene_id, Symbol = symbol, Description = description) %>%
    distinct()

NOTE: You may get an error with this command that looks like:

Error in (function (classes, fdef, mtable)  :
  unable to find an inherited method for function 'select' for signature '"data.frame"'

This is due to the select function from dplyr (part of tidyverse) being masked by the select function from one of the annotation packages. This will have happened because the annotation package was loaded after the tidyverse. You can either restart your R session and reload the required packages, this time being sure to load tidyverse last, or just use dplyr::select to explicitly use the select function from dplyr:

annot <- annotations %>%
    dplyr::select(GeneID = gene_id, Symbol = symbol, Description = description)
annot.interaction.11 <- results.interaction.11 %>%
    as.data.frame() %>%
    rownames_to_column("GeneID") %>%
    as_tibble() %>%
    left_join(annot, by="GeneID")
annot.interaction.11
## # A tibble: 20,091 × 9
##    GeneID           baseMean log2FoldChange  lfcSE   stat  pvalue    padj Symbol
##    <chr>               <dbl>          <dbl>  <dbl>  <dbl>   <dbl>   <dbl> <chr> 
##  1 ENSMUSG00000000…  1103.           -0.164 0.141  -1.16  2.45e-1 0.688   Gnai3 
##  2 ENSMUSG00000000…    58.6           0.427 0.393   1.09  2.77e-1 0.719   Cdc45 
##  3 ENSMUSG00000000…    49.2          -0.238 0.627  -0.380 7.04e-1 0.929   Scml2 
##  4 ENSMUSG00000000…     7.99          0.819 0.974   0.841 4.00e-1 0.803   Apoh  
##  5 ENSMUSG00000000…  1981.            0.128 0.0998  1.29  1.99e-1 0.640   Narf  
##  6 ENSMUSG00000000…   607.           -0.304 0.148  -2.05  4.01e-2 0.299   Cav2  
##  7 ENSMUSG00000000…  1297.            0.520 0.146   3.57  3.59e-4 0.00842 Klf6  
##  8 ENSMUSG00000000…  1063.            0.106 0.189   0.561 5.75e-1 0.885   Scmh1 
##  9 ENSMUSG00000000…  2286.           -0.255 0.132  -1.94  5.29e-2 0.352   Cox5a 
## 10 ENSMUSG00000000…    92.7          -0.320 0.288  -1.11  2.67e-1 0.709   Tbx2  
## # ℹ 20,081 more rows
## # ℹ 1 more variable: Description <chr>

Visualisation

P-value histogram

A quick and easy “sanity check” for our DE results is to generate a p-value histogram. What we should see is a high bar at 0 - 0.05 and then a roughly uniform tail to the right of this. There is a nice explanation of other possible patterns in the histogram and what to do when you see them in this post.

hist(annot.interaction.11$pvalue)