library(biomaRt)
library(tidyverse)

Before starting this section, we will make sure we have all the relevant objects from the Differential Expression analysis.

load("Robjects/DE.RData")

Overview

Adding annotation to the DESeq2 results

We have a list of significantly differentially expressed genes, but the only annotation we can see is the Ensembl Gene ID, which is not very informative.

There are a number of ways to add annotation. One method is to do this using the org.Mm.eg.db package. This package is one of several organism-level packages which are re-built every 6 months. These packages are listed on the annotation section of the Bioconductor, and are installed in the same way as regular Bioconductor packages.

An alternative approach is to use biomaRt, an interface to the BioMart resource. This is the method we will use today.

Select BioMart database and dataset

The first step is to select the Biomart database we are going to access and which data set we are going to use.

There are multiple mirror sites that we could use for access. The default is to use the European servers, however if the server is busy or inaccessible for some reason it is possible to access one of the three mirror sites. See the instructions at here for detailed instruction on using different mirrors, but in brief simply add the mirror argument to the listEnsembl and useEnsembl functions below.

e.g to use the US West mirror:
ensembl <- useEnsembl("genes", mirror = "uswest")

Pro Tip: The Ensembl servers tend to get very busy in the afternoons, to the extent that biomaRt may have trouble getting and maitaining a connection. Try to do this in the morning.

list the available databases

# view the available databases
listEnsembl()
##         biomart                version
## 1         genes      Ensembl Genes 101
## 2 mouse_strains      Mouse strains 101
## 3          snps  Ensembl Variation 101
## 4    regulation Ensembl Regulation 101

list the available datasets (species)

## set up connection to ensembl database
ensembl <- useEnsembl("genes")

# serach the available datasets (species)
searchDatasets(mart = ensembl, pattern = "Mouse")
##                    dataset                  description   version
## 105  mmurinus_gene_ensembl Mouse Lemur genes (Mmur_3.0)  Mmur_3.0
## 106 mmusculus_gene_ensembl      Mouse genes (GRCm38.p6) GRCm38.p6

specify a data set to use

ensembl <- useDataset("mmusculus_gene_ensembl", mart=ensembl)

Query the database

Now we need to set up a query. For this we need to specify three things:

  1. What type of information we are going to search the dataset on - called filters. In our case this is Ensembl Gene IDs
  2. A vector of the values for our filter - the Ensembl Gene IDs from our DE results table
  3. What columns (attributes) of the dataset we want returned.
# check the available "filters" - things you can filter for
ensembl_filters  <- listFilters(ensembl)

# To find the correct name for the Ensembl ID we can filter the name column
ensembl_filters %>% 
    filter(str_detect(name, "ensembl"))
##                                   name
## 1        with_clone_based_ensembl_gene
## 2  with_clone_based_ensembl_transcript
## 3                      ensembl_gene_id
## 4              ensembl_gene_id_version
## 5                ensembl_transcript_id
## 6        ensembl_transcript_id_version
## 7                   ensembl_peptide_id
## 8           ensembl_peptide_id_version
## 9                      ensembl_exon_id
## 10            clone_based_ensembl_gene
## 11      clone_based_ensembl_transcript
##                                                         description
## 1                             With Clone-based (Ensembl) gene ID(s)
## 2                       With Clone-based (Ensembl) transcript ID(s)
## 3                       Gene stable ID(s) [e.g. ENSMUSG00000000001]
## 4        Gene stable ID(s) with version [e.g. ENSMUSG00000000001.4]
## 5                 Transcript stable ID(s) [e.g. ENSMUST00000000001]
## 6  Transcript stable ID(s) with version [e.g. ENSMUST00000000001.4]
## 7                    Protein stable ID(s) [e.g. ENSMUSP00000000001]
## 8     Protein stable ID(s) with version [e.g. ENSMUSP00000000001.4]
## 9                              Exon ID(s) [e.g. ENSMUSE00000097910]
## 10               Clone-based (Ensembl) gene ID(s) [e.g. AC015535.1]
## 11     Clone-based (Ensembl) transcript ID(s) [e.g. AC015535.1-201]

So, we will use ensembl_gene_id to query the data bases

# check the available "attributes" - things you can retreive
ensembl_attributes <- listAttributes(ensembl)
head(ensembl_attributes, 20)
##                             name                                description
## 1                ensembl_gene_id                             Gene stable ID
## 2        ensembl_gene_id_version                     Gene stable ID version
## 3          ensembl_transcript_id                       Transcript stable ID
## 4  ensembl_transcript_id_version               Transcript stable ID version
## 5             ensembl_peptide_id                          Protein stable ID
## 6     ensembl_peptide_id_version                  Protein stable ID version
## 7                ensembl_exon_id                             Exon stable ID
## 8                    description                           Gene description
## 9                chromosome_name                   Chromosome/scaffold name
## 10                start_position                            Gene start (bp)
## 11                  end_position                              Gene end (bp)
## 12                        strand                                     Strand
## 13                          band                             Karyotype band
## 14              transcript_start                      Transcript start (bp)
## 15                transcript_end                        Transcript end (bp)
## 16      transcription_start_site             Transcription start site (TSS)
## 17             transcript_length Transcript length (including UTRs and CDS)
## 18                transcript_tsl             Transcript support level (TSL)
## 19      transcript_gencode_basic                   GENCODE basic annotation
## 20             transcript_appris                          APPRIS annotation
##            page
## 1  feature_page
## 2  feature_page
## 3  feature_page
## 4  feature_page
## 5  feature_page
## 6  feature_page
## 7  feature_page
## 8  feature_page
## 9  feature_page
## 10 feature_page
## 11 feature_page
## 12 feature_page
## 13 feature_page
## 14 feature_page
## 15 feature_page
## 16 feature_page
## 17 feature_page
## 18 feature_page
## 19 feature_page
## 20 feature_page

We’ll retrieve the external_gene_name, which is the Gene Symbol, the entrez_id, we’ll may need this for tools that use the NCBI databases, and the entrez_accession, which is the Gene Symbol associated with that entrez_id.

We also need to specify that we want the query to return the ensembl_gene_id that we used to query the database.

Returning data from Biomart can take time, so it’s always a good idea to test your query on a small list of values first to make sure it is doing what you want. We’ll just use the first 1000 genes for now.

# Set the filter type and values
ourFilterType <- "ensembl_gene_id"

# get the Ensembl IDs from our results table
filterValues <- rownames(resLvV)[1:1000]

# Set the list of attributes
attributeNames <- c("ensembl_gene_id",
                    "external_gene_name", 
                    "entrezgene_id", 
                    "entrezgene_accession")

# run the query
annot <- getBM(attributes=attributeNames, 
               filters = ourFilterType, 
               values = filterValues, 
               mart = ensembl)
head(annot)
##      ensembl_gene_id external_gene_name entrezgene_id entrezgene_accession
## 1 ENSMUSG00000001138              Cnnm3         94218                Cnnm3
## 2 ENSMUSG00000001143             Lman2l        214895               Lman2l
## 3 ENSMUSG00000001674              Ddx18         66942                Ddx18
## 4 ENSMUSG00000002459              Rgs20         58175                Rgs20
## 5 ENSMUSG00000002881               Nab1         17936                 Nab1
## 6 ENSMUSG00000003134             Tbc1d8         54610               Tbc1d8

One-to-many relationships

Let’s inspect the annotation.

dim(annot)
## [1] 996   4

Why are there less than 1000 rows?

Some of our Ensembl ID’s have no annotation. This is because we are accessing the latest Ensembl release, but the GTF we used to analyse the data was from an older release and some of the genes annotations have been deprecated. You could either ensure you are using the latest release from the beginning of your analysis, or access the archived release that matches the GTF you used. See the biomaRt manual for instruction on how to do this.

length(unique(annot$ensembl_gene_id))
## [1] 994

Why are there less than 996 annotations?

Some genes that have multiple entries in the retrieved annotation. This is because there are multiple Entrez IDs for a single Ensembl gene. These one-to-many relationships come up frequently in genomic databases, it is important to be aware of them and check when necessary.

# find all rows containing duplicated ensembl ids
annot %>%  
    add_count(ensembl_gene_id) %>%  
    filter(n>1)
##      ensembl_gene_id external_gene_name entrezgene_id entrezgene_accession n
## 1 ENSMUSG00000044783              Hjurp        212427        A730008H23Rik 2
## 2 ENSMUSG00000044783              Hjurp        381280                Hjurp 2
## 3 ENSMUSG00000070645               Ren1         19701                 Ren1 2
## 4 ENSMUSG00000070645               Ren1         19702                 Ren2 2

We will need to do a little work before adding the annotation to our results table. We could decide to discard one or both of the Entrez ID mappings, or we could concatenate the Entrez IDs so that we don’t lose information. Matching the entrezgene_accession and the external_gene_name can help with resolving some of these problems.

fixedDuplicates <- annot %>%  
    add_count(ensembl_gene_id) %>%  
    filter(n>1) %>% 
    select(-n) %>% 
    filter(entrezgene_accession==external_gene_name)

annot <- annot %>%  
    add_count(ensembl_gene_id) %>%  
    filter(n==1) %>% 
    select(-n) %>% 
    bind_rows(fixedDuplicates)

nrow(annot)
## [1] 994
length(unique(annot$ensembl_gene_id))
## [1] 994

Retrieve full annotation

Challenge 1

That was just 1000 genes. We need annotations for the entire results table. Also, there may be some other interesting columns in BioMart that we wish to retrieve.

  1. Search the attributes and add the following to our list of attributes:

    1. The gene description
    2. The gene biotype
  2. Query BioMart using all of the genes in our results table (resLvV)

  3. How many Ensembl genes have multipe Entrez IDs associated with them?

  4. How many Ensembl genes in resLvV don’t have any annotation?