library(biomaRt)
library(DESeq2)
library(tidyverse)

Before starting this section, we will make sure we have all the relevant objects from the Differential Expression analysis.

load("Robjects/DE.RData")

Overview

Adding annotation to the DESeq2 results

We have a list of significantly differentially expressed genes, but the only annotation we can see is the Ensembl Gene ID, which is not very informative.

There are a number of ways to add annotation. One method is to do this using the org.Mm.eg.db package. This package is one of several organism-level packages which are re-built every 6 months. These packages are listed on the annotation section of the Bioconductor, and are installed in the same way as regular Bioconductor packages.

An alternative approach is to use biomaRt, an interface to the BioMart resource. This is the method we will use today.

Select BioMart database and dataset

The first step is to select the Biomart database we are going to access and which data set we are going to use.

There are multiple mirror sites that we could use for access. The default is to use the European servers, however if the server is busy or inaccessible for some reason it is possible to access one of the three mirror sites. See the instructions at here for detailed instruction on using different mirrors, but in brief simply add the host argument to the listMarts and useMart functions below.

e.g to use the US West mirror:
ensembl=useMart("ENSEMBL_MART_ENSEMBL", host="uswest.ensembl.org")

list the available datasets (species)

# view the available databases
listMarts()
##                biomart               version
## 1 ENSEMBL_MART_ENSEMBL      Ensembl Genes 97
## 2   ENSEMBL_MART_MOUSE      Mouse strains 97
## 3     ENSEMBL_MART_SNP  Ensembl Variation 97
## 4 ENSEMBL_MART_FUNCGEN Ensembl Regulation 97
## set up connection to ensembl database
ensembl=useMart("ENSEMBL_MART_ENSEMBL")

# list the available datasets (species)
listDatasets(ensembl) %>% 
    filter(str_detect(description, "Mouse"))
##                  dataset                  description   version
## 1  mmurinus_gene_ensembl Mouse Lemur genes (Mmur_3.0)  Mmur_3.0
## 2 mmusculus_gene_ensembl      Mouse genes (GRCm38.p6) GRCm38.p6
# specify a data set to use
ensembl = useDataset("mmusculus_gene_ensembl", mart=ensembl)

Query the database

Now we need to set up a query. For this we need to specify three things:

  1. What type of information we are going to search the dataset on - called filters. In our case this is Ensembl Gene IDs
  2. A vector of the values for our filter - the Ensembl Gene IDs from our DE results table
  3. What columns (attributes) of the dataset we want returned.

Returning data from Biomart can take time, so it’s always a good idea to test your query on a small list of values first to make sure it is doing what you want. We’ll just use the first 1000 genes for now.

# check the available "filters" - things you can filter for
listFilters(ensembl) %>% 
    filter(str_detect(name, "ensembl"))
##                                   name
## 1        with_clone_based_ensembl_gene
## 2  with_clone_based_ensembl_transcript
## 3                      ensembl_gene_id
## 4              ensembl_gene_id_version
## 5                ensembl_transcript_id
## 6        ensembl_transcript_id_version
## 7                   ensembl_peptide_id
## 8           ensembl_peptide_id_version
## 9                      ensembl_exon_id
## 10            clone_based_ensembl_gene
## 11      clone_based_ensembl_transcript
##                                                         description
## 1                             With Clone-based (Ensembl) gene ID(s)
## 2                       With Clone-based (Ensembl) transcript ID(s)
## 3                       Gene stable ID(s) [e.g. ENSMUSG00000000001]
## 4        Gene stable ID(s) with version [e.g. ENSMUSG00000000001.4]
## 5                 Transcript stable ID(s) [e.g. ENSMUST00000000001]
## 6  Transcript stable ID(s) with version [e.g. ENSMUST00000000001.4]
## 7                    Protein stable ID(s) [e.g. ENSMUSP00000000001]
## 8     Protein stable ID(s) with version [e.g. ENSMUSP00000000001.4]
## 9                              Exon ID(s) [e.g. ENSMUSE00000097910]
## 10               Clone-based (Ensembl) gene ID(s) [e.g. AC016791.1]
## 11     Clone-based (Ensembl) transcript ID(s) [e.g. AC016791.1-201]
# Set the filter type and values
ourFilterType <- "ensembl_gene_id"
filterValues <- rownames(resLvV)[1:1000]

# check the available "attributes" - things you can retreive
listAttributes(ensembl) %>% 
    head(20)
##                             name
## 1                ensembl_gene_id
## 2        ensembl_gene_id_version
## 3          ensembl_transcript_id
## 4  ensembl_transcript_id_version
## 5             ensembl_peptide_id
## 6     ensembl_peptide_id_version
## 7                ensembl_exon_id
## 8                    description
## 9                chromosome_name
## 10                start_position
## 11                  end_position
## 12                        strand
## 13                          band
## 14              transcript_start
## 15                transcript_end
## 16      transcription_start_site
## 17             transcript_length
## 18                transcript_tsl
## 19      transcript_gencode_basic
## 20             transcript_appris
##                                   description         page
## 1                              Gene stable ID feature_page
## 2                      Gene stable ID version feature_page
## 3                        Transcript stable ID feature_page
## 4                Transcript stable ID version feature_page
## 5                           Protein stable ID feature_page
## 6                   Protein stable ID version feature_page
## 7                              Exon stable ID feature_page
## 8                            Gene description feature_page
## 9                    Chromosome/scaffold name feature_page
## 10                            Gene start (bp) feature_page
## 11                              Gene end (bp) feature_page
## 12                                     Strand feature_page
## 13                             Karyotype band feature_page
## 14                      Transcript start (bp) feature_page
## 15                        Transcript end (bp) feature_page
## 16             Transcription start site (TSS) feature_page
## 17 Transcript length (including UTRs and CDS) feature_page
## 18             Transcript support level (TSL) feature_page
## 19                   GENCODE basic annotation feature_page
## 20                          APPRIS annotation feature_page
# Set the list of attributes
attributeNames <- c('ensembl_gene_id', 'entrezgene_id', 'external_gene_name')

# run the query
annot <- getBM(attributes=attributeNames, 
               filters = ourFilterType, 
               values = filterValues, 
               mart = ensembl)

One-to-many relationships

Let’s inspect the annotation.

head(annot)
##      ensembl_gene_id entrezgene_id external_gene_name
## 1 ENSMUSG00000001138         94218              Cnnm3
## 2 ENSMUSG00000001143        214895             Lman2l
## 3 ENSMUSG00000002459         58175              Rgs20
## 4 ENSMUSG00000002881         17936               Nab1
## 5 ENSMUSG00000003134         54610             Tbc1d8
## 6 ENSMUSG00000003135         52846             Cnot11
dim(annot) # why are there more than 1000 rows?
## [1] 1002    3
length(unique(annot$ensembl_gene_id))
## [1] 1000
# find all rows containing duplicated ensembl ids
annot %>%  
    add_count(ensembl_gene_id) %>%  
    filter(n>1)
## # A tibble: 4 x 4
##   ensembl_gene_id    entrezgene_id external_gene_name     n
##   <chr>                      <int> <chr>              <int>
## 1 ENSMUSG00000044783        212427 Hjurp                  2
## 2 ENSMUSG00000044783        381280 Hjurp                  2
## 3 ENSMUSG00000070645         19701 Ren1                   2
## 4 ENSMUSG00000070645         19702 Ren1                   2

There are a couple of genes that have multiple entries in the retrieved annotation. This is becaues there are multiple Entrez IDs for a single Ensembl gene. These one-to-many relationships come up frequently in genomic databases, it is important to be aware of them and check when necessary.

We will need to do a little work before adding the annotation to out results table. We could decide to discard one or both of the Entrez ID mappings, or we could concatenate the Entrez IDs so that we don’t lose information.

Retrieve full annotation

Challenge 1

That was just 1000 genes. We need annotations for the entire results table. Also, there may be some other interesting columns in BioMart that we wish to retrieve.

  1. Search the attributes and add the following to our list of attributes:
    1. The gene description
    2. The gene biotype
  2. Query BioMart using all of the genes in our results table (resLvV)

  3. How many Ensembl genes have multipe Entrez IDs associated with them?
  4. How many Ensembl genes in resLvV don’t have any annotation? Why is this?

Add annotation to the results table

We can now add the annotation to the results table and then save the results using the write_tsv function, which writes the results out to a tab separated file. To save time we have created an annotation table in which we have modified the cumbersome Biomart column names, added median transcript length (we’ll need this in a later session), and dealt with the one-to-many issues for Entrez IDs.

load("Robjects/Ensembl_annotations.RData")
colnames(ensemblAnnot)
##  [1] "GeneID"         "Entrez"         "Symbol"         "Description"   
##  [5] "Biotype"        "Chr"            "Start"          "End"           
##  [9] "Strand"         "medianTxLength"
annotLvV <- as.data.frame(resLvV) %>% 
    rownames_to_column("GeneID") %>% 
    left_join(ensemblAnnot, "GeneID") %>% 
    rename(logFC=log2FoldChange, FDR=padj)

Finally we can output the annotation DE results using write_tsv.

write_tsv(annotLvV, "results/VirginVsLactating_Results_Annotated.txt")

Visualisation

DESeq2 provides a functon called lfcShrink that shrinks log-Fold Change (LFC) estimates towards zero using and empirical Bayes procedure. The reason for doing this is that there is high variance in the LFC estimates when counts are low and this results in lowly expressed genes appearing to show greater differences between groups than highly expressed genes. The lfcShrink method compensates for this and allows better visualisation and ranking of genes. We will use it for our visualisation of the data.

ddsShrink <- lfcShrink(ddsObj, coef="Status_lactate_vs_virgin")
## using 'normal' for LFC shrinkage, the Normal prior from Love et al (2014).
## 
## Note that type='apeglm' and type='ashr' have shown to have less bias than type='normal'.
## See ?lfcShrink for more details on shrinkage type, and the DESeq2 vignette.
## Reference: https://doi.org/10.1093/bioinformatics/bty895
shrinkLvV <- as.data.frame(ddsShrink) %>%
    rownames_to_column("GeneID") %>% 
    left_join(ensemblAnnot, "GeneID") %>% 
    rename(logFC=log2FoldChange, FDR=padj)

P-value histogram

A quick and easy “sanity check” for our DE results is to generate a p-value histogram. What we should see is a high bar at 0 - 0.05 and then a roughly uniform tail to the right of this. There is a nice explanation of other possible patterns in the histogram and what to do when you see them in this post.

hist(shrinkLvV$pvalue)

MA plots

MA plots are a common way to visualize the results of a differential analysis. We met them briefly towards the end of Session 2. This plot shows the log-Fold Change for each gene against its average expression across all samples in the two conditions being contrasted.
DESeq2 has a handy function for plotting this…

plotMA(ddsShrink, alpha=0.05)

…this is fine for a quick look, but it is not easy to make changes to the way it looks or add things such as gene labels. Perhaps we would like to add labels for the top 20 most significantly differentially expressed genes. Let’s use the package ggplot2 instead.

A Brief Introduction to ggplot2

The ggplot2 package has emerged as an attractive alternative to the traditional plots provided by base R. A full overview of all capabilities of the package is available from the cheatsheet.

In brief:-

  • shrinkLvV is our data frame containing the variables we wish to plot
  • aes creates a mapping between the variables in our data frame to the aesthetic properties of the plot:
    • the x-axis will be mapped to log2(baseMean)
    • the y-axis will be mapped to the logFC
  • geom_point specifies the particular type of plot we want (in this case a bar plot)
  • geom_text allows us to add labels to some or all of the points

The real advantage of ggplot2 is the ability to change the appearance of our plot by mapping other variables to aspects of the plot. For example, we could colour the points based on the sample group. To do this we can add metadata from the sampleinfo table to the data. The colours are automatically chosen by ggplot2, but we can specifiy particular values.

# add a column with the names of only the top 10 genes
cutoff <- sort(shrinkLvV$pvalue)[10]
shrinkLvV <- shrinkLvV %>% 
    mutate(TopGeneLabel=ifelse(pvalue<=cutoff, Symbol, ""))

ggplot(shrinkLvV, aes(x = log2(baseMean), y=logFC)) + 
    geom_point(aes(colour=FDR < 0.05), shape=20, size=0.5) +
    geom_text(aes(label=TopGeneLabel)) +
    labs(x="mean of normalised counts", y="log fold change")

Volcano plot

Another common visualisation is the volcano plot which displays a measure of significance on the y-axis and fold-change on the x-axis.

Challenge 2

Use the log2 fold change (logFC) on the x-axis, and use -log10(pvalue) on the y-axis. (This -log10 transformation is commonly used for p-values as it means that more significant genes have a higher scale)

  1. Create a column of -log10(pvalue) values

  2. Create a plot with points coloured by if pvalue < 0.05

An example of what your plot should look like:

Strip chart for gene expression

Before following up on the DE genes with further lab work, a recommended sanity check is to have a look at the expression levels of the individual samples for the genes of interest. We can quickly look at grouped expression by using plotCounts function of DESeq2 to retrieve the normalised expression values from the ddsObj object and then plotting with ggplot2.

# Let's look at the most significantly differentially expressed gene
topgene <- filter(shrinkLvV, Symbol=="Wap")
geneID <- topgene$GeneID
plotCounts(ddsObj, gene = geneID, intgroup = c("CellType", "Status"),
           returnData = T) %>% 
    ggplot(aes(x=Status, y=log2(count))) +
      geom_point(aes(fill=Status), shape=21, size=2) +
      facet_wrap(~CellType) +
      expand_limits(y=0)

Interactive StripChart with Glimma

An interactive version of the volcano plot above that includes the raw per sample values in a separate panel is possible via the glXYPlot function in the Glimma package.

library(Glimma)

group <- str_remove_all(sampleinfo$Group, "[aeiou]")

de <- as.integer(shrinkLvV$FDR <= 0.05 & !is.na(shrinkLvV$FDR))

normCounts <- log2(counts(ddsObj))

glXYPlot(
  x = shrinkLvV$logFC,
  y = -log10(shrinkLvV$pvalue),
  xlab = "logFC",
  ylab = "FDR",
  main = "Lactating v Virgin",
  counts = normCounts,
  groups = group,
  status = de,
  anno = shrinkLvV[, c("GeneID", "Symbol", "Description")],
  folder = "volcano"
)

This function creates an html page (./volcano/XY-Plot.html) with a volcano plot on the left and a plot showing the log-CPM per sample for a selected gene on the right. A search bar is available to search for genes of interest.

Heatmap

We’re going to use the package ComplexHeatmap (Z. Gu, Eils, and Schlesner 2016). We’ll also use circlize to generate a colour scale (Z. Gu et al. 2014).

library(ComplexHeatmap)
library(circlize)

We can’t plot the entire data set, let’s just select the top 150 by FDR. We’ll also z-transform the counts.

# get the top genes
sigGenes <- as.data.frame(shrinkLvV) %>% 
    top_n(150, wt=-FDR) %>% 
    pull("GeneID")

# filter the data for the top 200 by padj in the LRT test
plotDat <- vst(ddsObj)[sigGenes,] %>% 
    assay()
z.mat <- t(scale(t(plotDat), center=TRUE, scale=TRUE))
# colour palette
myPalette <- c("red3", "ivory", "blue3")
myRamp = colorRamp2(c(-2, 0, 2), myPalette)
Heatmap(z.mat, name = "z-score",
        col = myRamp,            
        show_row_names = FALSE,
        cluster_columns = FALSE)

we can also split the heat map into clusters and add some annotation.

# cluster the data and split the tree
hcDat <- hclust(dist(z.mat))
cutGroups <- cutree(hcDat, h=4)

ha1 = HeatmapAnnotation(df = colData(ddsObj)[,c("CellType", "Status")])

Heatmap(z.mat, name = "z-score",
        col = myRamp,            
        show_row_name = FALSE,
        cluster_columns = FALSE,
        split=cutGroups,
        rect_gp = gpar(col = "darkgrey", lwd=0.5),
        top_annotation = ha1)

save(annotLvV, shrinkLvV, file="results/Annotated_Results_LvV.RData")

Additional Material

There is additional material for you to work through in the Supplementary Materials directory. Details include using genomic ranges, retrieving gene models, exporting browser tracks and some extra useful plots like the one below.


References

Gu, Zuguang, Roland Eils, and Matthias Schlesner. 2016. “Complex Heatmaps Reveal Patterns and Correlations in Multidimensional Genomic Data.” Bioinformatics.

Gu, Zuguang, Lei Gu, Roland Eils, Matthias Schlesner, and Benedikt Brors. 2014. “Circlize Implements and Enhances Circular Visualization in R.” Bioinformatics 30 (19): 2811–2.