library(biomaRt)
library(DESeq2)
library(tidyverse)
Before starting this section, we will make sure we have all the relevant objects from the Differential Expression analysis.
load("Robjects/DE.RData")
We have a list of significantly differentially expressed genes, but the only annotation we can see is the Ensembl Gene ID, which is not very informative.
There are a number of ways to add annotation. One method is to do this using the org.Mm.eg.db package. This package is one of several organism-level packages which are re-built every 6 months. These packages are listed on the annotation section of the Bioconductor, and are installed in the same way as regular Bioconductor packages.
An alternative approach is to use biomaRt
, an interface to the BioMart resource. This is the method we will use today.
The first step is to select the Biomart database we are going to access and which data set we are going to use.
There are multiple mirror sites that we could use for access. The default is to use the European servers, however if the server is busy or inaccessible for some reason it is possible to access one of the three mirror sites. See the instructions at here for detailed instruction on using different mirrors, but in brief simply add the host
argument to the listMarts
and useMart
functions below.
e.g to use the US West mirror:
ensembl=useMart("ENSEMBL_MART_ENSEMBL", host="uswest.ensembl.org")
# view the available databases
listMarts()
## biomart version
## 1 ENSEMBL_MART_ENSEMBL Ensembl Genes 100
## 2 ENSEMBL_MART_MOUSE Mouse strains 100
## 3 ENSEMBL_MART_SNP Ensembl Variation 100
## 4 ENSEMBL_MART_FUNCGEN Ensembl Regulation 100
## set up connection to ensembl database
ensembl=useMart("ENSEMBL_MART_ENSEMBL")
# list the available datasets (species)
listDatasets(ensembl) %>%
filter(str_detect(description, "Mouse"))
## dataset description version
## 1 mmurinus_gene_ensembl Mouse Lemur genes (Mmur_3.0) Mmur_3.0
## 2 mmusculus_gene_ensembl Mouse genes (GRCm38.p6) GRCm38.p6
# specify a data set to use
ensembl = useDataset("mmusculus_gene_ensembl", mart=ensembl)
Now we need to set up a query. For this we need to specify three things:
Returning data from Biomart can take time, so it's always a good idea to test your query on a small list of values first to make sure it is doing what you want. We'll just use the first 1000 genes for now.
# check the available "filters" - things you can filter for
listFilters(ensembl) %>%
filter(str_detect(name, "ensembl"))
## name
## 1 with_clone_based_ensembl_gene
## 2 with_clone_based_ensembl_transcript
## 3 ensembl_gene_id
## 4 ensembl_gene_id_version
## 5 ensembl_transcript_id
## 6 ensembl_transcript_id_version
## 7 ensembl_peptide_id
## 8 ensembl_peptide_id_version
## 9 ensembl_exon_id
## 10 clone_based_ensembl_gene
## 11 clone_based_ensembl_transcript
## description
## 1 With Clone-based (Ensembl) gene ID(s)
## 2 With Clone-based (Ensembl) transcript ID(s)
## 3 Gene stable ID(s) [e.g. ENSMUSG00000000001]
## 4 Gene stable ID(s) with version [e.g. ENSMUSG00000000001.4]
## 5 Transcript stable ID(s) [e.g. ENSMUST00000000001]
## 6 Transcript stable ID(s) with version [e.g. ENSMUST00000000001.4]
## 7 Protein stable ID(s) [e.g. ENSMUSP00000000001]
## 8 Protein stable ID(s) with version [e.g. ENSMUSP00000000001.4]
## 9 Exon ID(s) [e.g. ENSMUSE00000097910]
## 10 Clone-based (Ensembl) gene ID(s) [e.g. AC015535.1]
## 11 Clone-based (Ensembl) transcript ID(s) [e.g. AC015535.1-201]
# Set the filter type and values
ourFilterType <- "ensembl_gene_id"
filterValues <- rownames(resLvV)[1:1000]
# check the available "attributes" - things you can retreive
listAttributes(ensembl) %>%
head(20)
## name description
## 1 ensembl_gene_id Gene stable ID
## 2 ensembl_gene_id_version Gene stable ID version
## 3 ensembl_transcript_id Transcript stable ID
## 4 ensembl_transcript_id_version Transcript stable ID version
## 5 ensembl_peptide_id Protein stable ID
## 6 ensembl_peptide_id_version Protein stable ID version
## 7 ensembl_exon_id Exon stable ID
## 8 description Gene description
## 9 chromosome_name Chromosome/scaffold name
## 10 start_position Gene start (bp)
## 11 end_position Gene end (bp)
## 12 strand Strand
## 13 band Karyotype band
## 14 transcript_start Transcript start (bp)
## 15 transcript_end Transcript end (bp)
## 16 transcription_start_site Transcription start site (TSS)
## 17 transcript_length Transcript length (including UTRs and CDS)
## 18 transcript_tsl Transcript support level (TSL)
## 19 transcript_gencode_basic GENCODE basic annotation
## 20 transcript_appris APPRIS annotation
## page
## 1 feature_page
## 2 feature_page
## 3 feature_page
## 4 feature_page
## 5 feature_page
## 6 feature_page
## 7 feature_page
## 8 feature_page
## 9 feature_page
## 10 feature_page
## 11 feature_page
## 12 feature_page
## 13 feature_page
## 14 feature_page
## 15 feature_page
## 16 feature_page
## 17 feature_page
## 18 feature_page
## 19 feature_page
## 20 feature_page
# Set the list of attributes
attributeNames <- c('ensembl_gene_id', 'entrezgene_id', 'external_gene_name')
# run the query
annot <- getBM(attributes=attributeNames,
filters = ourFilterType,
values = filterValues,
mart = ensembl)
Let's inspect the annotation.
head(annot)
## ensembl_gene_id entrezgene_id external_gene_name
## 1 ENSMUSG00000001138 94218 Cnnm3
## 2 ENSMUSG00000001143 214895 Lman2l
## 3 ENSMUSG00000001674 66942 Ddx18
## 4 ENSMUSG00000002459 58175 Rgs20
## 5 ENSMUSG00000002881 17936 Nab1
## 6 ENSMUSG00000003134 54610 Tbc1d8
dim(annot) # why are there more than 1000 rows?
## [1] 996 3
length(unique(annot$ensembl_gene_id))
## [1] 994
# find all rows containing duplicated ensembl ids
annot %>%
add_count(ensembl_gene_id) %>%
filter(n>1)
## ensembl_gene_id entrezgene_id external_gene_name n
## 1 ENSMUSG00000044783 212427 Hjurp 2
## 2 ENSMUSG00000044783 381280 Hjurp 2
## 3 ENSMUSG00000070645 19701 Ren1 2
## 4 ENSMUSG00000070645 19702 Ren1 2
There are a couple of genes that have multiple entries in the retrieved annotation. This is becaues there are multiple Entrez IDs for a single Ensembl gene. These one-to-many relationships come up frequently in genomic databases, it is important to be aware of them and check when necessary.
We will need to do a little work before adding the annotation to out results table. We could decide to discard one or both of the Entrez ID mappings, or we could concatenate the Entrez IDs so that we don't lose information.
Challenge 1
That was just 1000 genes. We need annotations for the entire results table. Also, there may be some other interesting columns in BioMart that we wish to retrieve.
- Search the attributes and add the following to our list of attributes:
- The gene description
- The gene biotype
Query BioMart using all of the genes in our results table (
resLvV
)- How many Ensembl genes have multipe Entrez IDs associated with them?
How many Ensembl genes in
resLvV
don't have any annotation? Why is this?