library(biomaRt)
library(DESeq2)
library(tidyverse)
package ‘dplyr’ was built under R version 3.5.1

Before starting this section, we will make sure we have all the relevant objects from the Differential Expression analysis.

load("Robjects/DE.RData")

Overview

Adding annotation to the DESeq2 results

We have a list of significantly differentially expressed genes, but the only annotation we can see is the Ensembl Gene ID, which is not very informative.

There are a number of ways to add annotation. One method is to do this using the org.Mm.eg.db package. This package is one of several organism-level packages which are re-built every 6 months. These packages are listed on the annotation section of the Bioconductor, and are installed in the same way as regular Bioconductor packages.

An alternative approach is to use biomaRt, an interface to the BioMart resource. This is the method we will use today.

Select BioMart database and dataset

The first step is to select the Biomart database we are going to access and which data set we are going to use.

# view the available databases
listMarts()
## set up connection to ensembl database
ensembl=useMart("ENSEMBL_MART_ENSEMBL")
# list the available datasets (species)
listDatasets(ensembl) %>% 
    filter(str_detect(description, "Mouse"))
# specify a data set to use
ensembl = useDataset("mmusculus_gene_ensembl", mart=ensembl)

Query the database

Now we need to set up a query. For this we need to specify three things:

  1. What type of information we are going to search the dataset on - called filters. In our case this is Ensembl Gene IDs
  2. A vector of the values for our filter - the Ensembl Gene IDs from our DE results table
  3. What columns (attributes) of the dataset we want returned.

Returning data from Biomart can take time, so it’s always a good idea to test your query on a small list of values first to make sure it is doing what you want. We’ll just use the first 1000 genes for now.

# check the available "filters" - things you can filter for
listFilters(ensembl) %>% 
    filter(str_detect(name, "ensembl"))
# Set the filter type and values
filterType <- "ensembl_gene_id"
filterValues <- rownames(resLvV)[1:1000]
# check the available "attributes" - things you can retreive
listAttributes(ensembl) %>% 
    head(20)
# Set the list of attributes
attributeNames <- c('ensembl_gene_id', 'entrezgene', 'external_gene_name')
# run the query
annot <- getBM(attributes=attributeNames, 
               filters = filterType, 
               values = filterValues, 
               mart = ensembl)

Batch submitting query [============================] 100% eta:  0s
                                                                   

One-to-many relationships

Let’s inspect the annotation.

head(annot)
dim(annot) # why are there more than 1000 rows?
[1] 1001    3
length(unique(annot$ensembl_gene_id)) # why are there less than 1000 Gene ids?
[1] 999
isDup <- duplicated(annot$ensembl_gene_id)
dup <- annot$ensembl_gene_id[isDup]
annot[annot$ensembl_gene_id%in%dup,]

There are a couple of genes that have multiple entries in the retrieved annotation. This is becaues there are multiple Entrez IDs for a single Ensembl gene. These one-to-many relationships come up frequently in genomic databases, it is important to be aware of them and check when necessary.

We will need to do a little work before adding the annotation to out results table. We could decide to discard one or both of the Entrez ID mappings, or we could concatenate the Entrez IDs so that we don’t lose information.

Retrieve full annotation

Challenge 1

That was just 1000 genes. We need annotations for the entire results table. Also, there may be some other interesting columns in BioMart that we wish to retrieve.

  1. Search the attributes and add the following to our list of attributes:
    1. The gene description
    2. The genomic position - chromosome, start, end, and strand (4 columns)
    3. The gene biotype
  2. Query BioMart using all of the genes in our results table (resLvV)
  3. How many Ensembl genes have multipe Entrez IDs associated with them?
  4. How many Ensembl genes in resLvV don’t have any annotation? Why is this?

Add annotation to the results table

We can now add the annotation to the results table and then save the results using the write_tsv function, which writes the results out to a tab separated file. To save time we have created an annotation table in which we have modified the cumbersome Biomart column names, added median transcript length (we’ll need this in a later session), and dealt with the one-to-many issues for Entrez IDs.

load("Robjects/Ensembl_annotations.RData")
colnames(ensemblAnnot)
 [1] "GeneID"         "Entrez"         "Symbol"         "Description"   
 [5] "Biotype"        "Chr"            "Start"          "End"           
 [9] "Strand"         "medianTxLength"
annotLvV <- as.data.frame(resLvV) %>% 
    rownames_to_column("GeneID") %>% 
    left_join(ensemblAnnot, "GeneID") %>% 
    rename(logFC=log2FoldChange, FDR=padj)

Finally we can output the annotation DE results using write_csv.

write_tsv(annotLvV, "data/VirginVsLactating_Results_Annotated.txt")

Challenge 2

Have a look at gene symbols for most significant genes by adjusted p-value. Do they make biological sense in the context of comparing gene expression in mammary gland tissue between lactating and virgin mice? You may want to do a quick web search of your favourite gene/protein database

Visualisation

DESeq2 provides a functon called lfcShrink that shrinks log-Fold Change (LFC) estimates towards zero using and empirical Bayes procedure. The reason for doing this is that there is high variance in the LFC estimates when counts are low and this results in lowly expressed genes appearing to be show greater differences between groups that highly expressed genes. The lfcShrink method compensates for this and allows better visualisation and ranking of genes. We will use it for our visualisations of the data.

ddsShrink <- lfcShrink(ddsObj, coef="Status_lactate_vs_virgin")
shrinkLvV <- as.data.frame(ddsShrink) %>%
    rownames_to_column("GeneID") %>% 
    left_join(ensemblAnnot, "GeneID") %>% 
    rename(logFC=log2FoldChange, FDR=padj)

P-value histogram

A quick and easy “sanity check” for our DE results is to generate a p-value histogram. What we should see is a high bar in the 0 - 0.05 and then a roughly uniform tail to the right of this. There is a nice explanation of other possible patterns in the histogram and what to when you see them in this post.

hist(shrinkLvV$pvalue)

MA plots

MA plots are a common way to visualize the results of a differential analysis. We met them briefly towards the end of Session 2. This plot shows the log-Fold Change for each gene against its average expression across all samples in the two conditions being contrasted.

DESeq2 has a handy function for plotting this…