library(biomaRt)
library(tidyverse)
We have a list of significantly differentially expressed genes, but the only annotation we can see is the Ensembl Gene ID, which is not very informative.
results.interaction.11 <- readRDS("RObjects/DESeqResults.interaction_d11.rds")
results.interaction.11
## log2 fold change (MLE): Status Infected vs Uninfected
## Wald test p-value: Status Infected vs Uninfected
## DataFrame with 20091 rows and 6 columns
## baseMean log2FoldChange lfcSE stat pvalue
## <numeric> <numeric> <numeric> <numeric> <numeric>
## ENSMUSG00000000001 1102.56094 -0.163980 0.1411446 -1.161790 0.245321
## ENSMUSG00000000028 58.60055 0.427344 0.3933454 1.086435 0.277287
## ENSMUSG00000000037 49.23586 -0.238397 0.6272741 -0.380053 0.703906
## ENSMUSG00000000049 7.98789 0.819093 0.9737780 0.841150 0.400264
## ENSMUSG00000000056 1981.00402 0.128189 0.0997537 1.285053 0.198774
## ... ... ... ... ... ...
## ENSMUSG00000118619 5.99873 0.328550 0.836781 0.392635 0.6945888
## ENSMUSG00000118623 6.01600 -0.679592 1.645886 -0.412904 0.6796773
## ENSMUSG00000118633 2.49506 3.130835 1.797103 1.742157 0.0814810
## ENSMUSG00000118642 47.62473 3.411558 1.422908 2.397595 0.0165031
## ENSMUSG00000118653 3.65289 4.881280 2.714221 1.798409 0.0721122
## padj
## <numeric>
## ENSMUSG00000000001 0.687681
## ENSMUSG00000000028 0.719157
## ENSMUSG00000000037 0.928560
## ENSMUSG00000000049 0.802875
## ENSMUSG00000000056 0.640060
## ... ...
## ENSMUSG00000118619 0.925145
## ENSMUSG00000118623 0.921939
## ENSMUSG00000118633 0.437111
## ENSMUSG00000118642 0.173757
## ENSMUSG00000118653 0.411369
There are a number of ways to add annotation. One method is to do this using the org.Mm.eg.db package. This package is one of several organism-level packages which are re-built every 6 months. These packages are listed on the annotation section of the Bioconductor, and are installed in the same way as regular Bioconductor packages.
An alternative approach is to use biomaRt
, an interface to the BioMart resource. This is the method we will use today.
The first step is to select the Biomart database we are going to access and which data set we are going to use.
There are multiple mirror sites that we could use for access. The default is to use the European servers, however if the server is busy or inaccessible for some reason it is possible to access one of the three mirror sites. See the instructions at here for detailed instruction on using different mirrors, but in brief simply add the mirror
argument to the listEnsembl
and useEnsembl
functions below.
e.g to use the US West mirror:
ensembl <- useEnsembl("genes", mirror = "uswest")
Pro Tip: The Ensembl servers tend to get very busy in the afternoons, to the extent that biomaRt may have trouble getting and maitaining a connection. Try to do this in the morning.
# view the available databases
listEnsembl()
## biomart version
## 1 genes Ensembl Genes 103
## 2 mouse_strains Mouse strains 103
## 3 snps Ensembl Variation 103
## 4 regulation Ensembl Regulation 103
We want the “genes” database.
The current Ensembl release is version 103, however, when we downloaded the reference files the current version was 102. We will need to ensure that biomaRt
accesses the correct version of the gene annotations.
# view the available databases
listEnsembl(version = 102)
## biomart version
## 1 genes Ensembl Genes 102
## 2 mouse_strains Mouse strains 102
## 3 snps Ensembl Variation 102
## 4 regulation Ensembl Regulation 102
Now we need to identify the dataset for mouse.
## set up connection to ensembl database
ensembl <- useEnsembl(biomart = "genes", version=102)
# serach the available datasets (species)
searchDatasets(mart = ensembl, pattern = "Mouse")
## dataset description version
## 105 mmurinus_gene_ensembl Mouse Lemur genes (Mmur_3.0) Mmur_3.0
## 106 mmusculus_gene_ensembl Mouse genes (GRCm38.p6) GRCm38.p6
We want the mmusculus_gene_ensembl
data set.
We can now specify the complete connection to the database and dataset that we want to access.
ensembl <- useEnsembl(biomart = 'genes',
dataset = 'mmusculus_gene_ensembl',
version = 102)
Now we need to set up a query. For this we need to specify three things:
# check the available "filters" - things you can filter for
ensembl_filters <- listFilters(ensembl)
# To find the correct name for the Ensembl ID we can filter the name column
ensembl_filters %>%
filter(str_detect(name, "ensembl"))
## name
## 1 with_clone_based_ensembl_gene
## 2 with_clone_based_ensembl_transcript
## 3 ensembl_gene_id
## 4 ensembl_gene_id_version
## 5 ensembl_transcript_id
## 6 ensembl_transcript_id_version
## 7 ensembl_peptide_id
## 8 ensembl_peptide_id_version
## 9 ensembl_exon_id
## 10 clone_based_ensembl_gene
## 11 clone_based_ensembl_transcript
## description
## 1 With Clone-based (Ensembl) gene ID(s)
## 2 With Clone-based (Ensembl) transcript ID(s)
## 3 Gene stable ID(s) [e.g. ENSMUSG00000000001]
## 4 Gene stable ID(s) with version [e.g. ENSMUSG00000000001.4]
## 5 Transcript stable ID(s) [e.g. ENSMUST00000000001]
## 6 Transcript stable ID(s) with version [e.g. ENSMUST00000000001.4]
## 7 Protein stable ID(s) [e.g. ENSMUSP00000000001]
## 8 Protein stable ID(s) with version [e.g. ENSMUSP00000000001.4]
## 9 Exon ID(s) [e.g. ENSMUSE00000097910]
## 10 Clone-based (Ensembl) gene ID(s) [e.g. AC015535.1]
## 11 Clone-based (Ensembl) transcript ID(s) [e.g. AC015535.1-201]
So, we will use ensembl_gene_id
to query the data bases
# check the available "attributes" - things you can retreive
ensembl_attributes <- listAttributes(ensembl)
head(ensembl_attributes, 20)
## name description
## 1 ensembl_gene_id Gene stable ID
## 2 ensembl_gene_id_version Gene stable ID version
## 3 ensembl_transcript_id Transcript stable ID
## 4 ensembl_transcript_id_version Transcript stable ID version
## 5 ensembl_peptide_id Protein stable ID
## 6 ensembl_peptide_id_version Protein stable ID version
## 7 ensembl_exon_id Exon stable ID
## 8 description Gene description
## 9 chromosome_name Chromosome/scaffold name
## 10 start_position Gene start (bp)
## 11 end_position Gene end (bp)
## 12 strand Strand
## 13 band Karyotype band
## 14 transcript_start Transcript start (bp)
## 15 transcript_end Transcript end (bp)
## 16 transcription_start_site Transcription start site (TSS)
## 17 transcript_length Transcript length (including UTRs and CDS)
## 18 transcript_tsl Transcript support level (TSL)
## 19 transcript_gencode_basic GENCODE basic annotation
## 20 transcript_appris APPRIS annotation
## page
## 1 feature_page
## 2 feature_page
## 3 feature_page
## 4 feature_page
## 5 feature_page
## 6 feature_page
## 7 feature_page
## 8 feature_page
## 9 feature_page
## 10 feature_page
## 11 feature_page
## 12 feature_page
## 13 feature_page
## 14 feature_page
## 15 feature_page
## 16 feature_page
## 17 feature_page
## 18 feature_page
## 19 feature_page
## 20 feature_page
We’ll retrieve the external_gene_name
, which is the Gene Symbol, the entrez_id
- we may need this for tools that use the NCBI databases - and the entrez_accession
, which is the Gene Symbol associated with that entrez_id
.
We also need to specify that we want the query to return the ensembl_gene_id
that we used to query the database. This ensures that our final table will include all of query gene ids, even if no record was found in the data base.
Returning data from Biomart can take time, so it’s always a good idea to test your query on a small list of values first to make sure it is doing what you want. We’ll just use the first 1000 genes for now.
# Set the filter type and values
ourFilterType <- "ensembl_gene_id"
# get the Ensembl IDs from our results table
filterValues <- rownames(results.interaction.11)[1:1000]
# Set the list of attributes
attributeNames <- c("ensembl_gene_id",
"external_gene_name",
"entrezgene_id",
"entrezgene_accession")
# run the query
annot <- getBM(attributes=attributeNames,
filters = ourFilterType,
values = filterValues,
mart = ensembl)
head(annot)
## ensembl_gene_id external_gene_name entrezgene_id entrezgene_accession
## 1 ENSMUSG00000000001 Gnai3 14679 Gnai3
## 2 ENSMUSG00000000028 Cdc45 12544 Cdc45
## 3 ENSMUSG00000000037 Scml2 107815 Scml2
## 4 ENSMUSG00000000049 Apoh 11818 Apoh
## 5 ENSMUSG00000000056 Narf 67608 Narf
## 6 ENSMUSG00000000058 Cav2 12390 Cav2
Let’s inspect the annotation.
dim(annot)
## [1] 1002 4
Some genes that have multiple entries in the retrieved annotation. This is because there are multiple Entrez IDs for a single Ensembl gene. These one-to-many relationships come up frequently in genomic databases, it is important to be aware of them and check when necessary.
# find all rows containing duplicated ensembl ids
annot %>%
add_count(ensembl_gene_id) %>%
filter(n>1)
## ensembl_gene_id external_gene_name entrezgene_id entrezgene_accession n
## 1 ENSMUSG00000000562 Adora3 11542 Adora3 2
## 2 ENSMUSG00000000562 Adora3 69296 Tmigd3 2
## 3 ENSMUSG00000004455 Ppp1cc 19047 Ppp1cc 2
## 4 ENSMUSG00000004455 Ppp1cc 434233 Ppp1ccb 2
We will need to do a little work before adding the annotation to our results table. We could decide to discard one or both of the Entrez ID mappings, or we could concatenate the Entrez IDs so that we don’t lose information. Matching the entrezgene_accession
and the external_gene_name
can help with resolving some of these problems.
fixedDuplicates <- annot %>%
add_count(ensembl_gene_id) %>%
filter(n>1) %>%
select(-n) %>%
filter(entrezgene_accession==external_gene_name)
annot <- annot %>%
add_count(ensembl_gene_id) %>%
filter(n==1) %>%
select(-n) %>%
bind_rows(fixedDuplicates)
nrow(annot)
## [1] 1000
length(unique(annot$ensembl_gene_id))
## [1] 1000
Challenge 1
That was just 1000 genes. We need annotations for the entire results table. Also, there may be some other interesting columns in BioMart that we wish to retrieve.
Search the attributes and add the following to our list of attributes:
- The gene description
- The gene biotype
Query BioMart using all of the genes in our results table (
results.interaction.11
)How many Ensembl genes have multipe Entrez IDs associated with them?
How many Ensembl genes in
results.interaction.11
don’t have any annotation?