library(EnsDb.Mmusculus.v79)
library(DESeq2)
library(tidyverse)
Before starting this section, we will make sure we have all the relevant objects from the Differential Expression analysis.
load("Robjects/DE.RData")
We have a list of significantly differentially expressed genes, but the only annotation we can see is the Ensembl Gene ID, which is not very informative.
There are a number of ways to add annotation. One method is to do this using a Bioconductor annotation package. These packages which are re-built every periodically with the latest annotations. These packages are listed on the annotation section of the Bioconductor, and are installed in the same way as regular Bioconductor packages.
An alternative approach is to use biomaRt
, an interface to the BioMart resource. Using BioMart ensures that you are able to get the latest annotations for the GeneIDs, and can match the version of the gene annotation that was used for read counting.
Using an annotation packages means that you may not have the exact same version of the gene annotations as was used to do the counting.
Today we will use the annotation package method.
We use the select function to query the database. Now we need to set up a query. This requires us to tell it what we want and what we have. For this we need to specify three things:
# what can we search for? 'columns'
columns(EnsDb.Mmusculus.v79)
## [1] "ENTREZID" "EXONID" "EXONIDX"
## [4] "EXONSEQEND" "EXONSEQSTART" "GENEBIOTYPE"
## [7] "GENEID" "GENENAME" "GENESEQEND"
## [10] "GENESEQSTART" "INTERPROACCESSION" "ISCIRCULAR"
## [13] "PROTDOMEND" "PROTDOMSTART" "PROTEINDOMAINID"
## [16] "PROTEINDOMAINSOURCE" "PROTEINID" "PROTEINSEQUENCE"
## [19] "SEQCOORDSYSTEM" "SEQLENGTH" "SEQNAME"
## [22] "SEQSTRAND" "SYMBOL" "TXBIOTYPE"
## [25] "TXCDSSEQEND" "TXCDSSEQSTART" "TXID"
## [28] "TXNAME" "TXSEQEND" "TXSEQSTART"
## [31] "UNIPROTDB" "UNIPROTID" "UNIPROTMAPPINGTYPE"
# what can we search with? 'keytypes'
keytypes(EnsDb.Mmusculus.v79)
## [1] "ENTREZID" "EXONID" "GENEBIOTYPE"
## [4] "GENEID" "GENENAME" "PROTDOMID"
## [7] "PROTEINDOMAINID" "PROTEINDOMAINSOURCE" "PROTEINID"
## [10] "SEQNAME" "SEQSTRAND" "SYMBOL"
## [13] "TXBIOTYPE" "TXID" "TXNAME"
## [16] "UNIPROTID"
# lets set it up
ourCols <- c("SYMBOL", "GENEID", "ENTREZID")
ourKeys <- rownames(resLvV)[1:1000]
# run the query
annot <- AnnotationDbi::select(EnsDb.Mmusculus.v79,
keys=ourKeys,
columns=ourCols,
keytype="GENEID")
Let’s inspect the annotation.
head(annot)
## SYMBOL GENEID ENTREZID
## 1 Xkr4 ENSMUSG00000051951 497097
## 2 RP23-122M2.3 ENSMUSG00000102331 NA
## 3 Rp1 ENSMUSG00000025900 19888
## 4 Sox17 ENSMUSG00000025902 20671
## 5 Gm6085 ENSMUSG00000098104 NA
## 6 RP23-34E15.4 ENSMUSG00000103922 NA
length(unique(annot$ENTREZID)) # Why are there NAs in the ENTREZID column?
## [1] 533
dim(annot) # why are there more than 1000 rows?
## [1] 1010 3
# find all rows containing duplicated ensembl ids
annot %>%
add_count(GENEID) %>%
dplyr::filter(n>1)
## SYMBOL GENEID ENTREZID n
## 1 Tceb1 ENSMUSG00000079658 67923 2
## 2 Tceb1 ENSMUSG00000079658 102642819 2
## 3 Ptp4a1 ENSMUSG00000026064 19243 2
## 4 Ptp4a1 ENSMUSG00000026064 102643131 2
## 5 Cox5b ENSMUSG00000061518 12859 3
## 6 Cox5b ENSMUSG00000061518 102638382 3
## 7 Cox5b ENSMUSG00000061518 102641600 3
## 8 Rpl31 ENSMUSG00000073702 665562 2
## 9 Rpl31 ENSMUSG00000073702 102641215 2
## 10 Hspd1 ENSMUSG00000025980 15510 2
## 11 Hspd1 ENSMUSG00000025980 102641967 2
## 12 Mff ENSMUSG00000026150 75734 2
## 13 Mff ENSMUSG00000026150 102643056 2
## 14 C130026I21Rik ENSMUSG00000052477 381287 4
## 15 C130026I21Rik ENSMUSG00000052477 620078 4
## 16 C130026I21Rik ENSMUSG00000052477 100041057 4
## 17 C130026I21Rik ENSMUSG00000052477 102639105 4
## 18 Ptma ENSMUSG00000026238 19231 2
## 19 Ptma ENSMUSG00000026238 100504173 2
## 20 Ugt1a7c ENSMUSG00000090124 394432 2
## 21 Ugt1a7c ENSMUSG00000090124 102641635 2
## 22 Ugt1a6a ENSMUSG00000054545 94284 2
## 23 Ugt1a6a ENSMUSG00000054545 102641635 2
## 24 Hjurp ENSMUSG00000044783 212427 2
## 25 Hjurp ENSMUSG00000044783 381280 2
## 26 Ren1 ENSMUSG00000070645 19701 3
## 27 Ren1 ENSMUSG00000070645 19702 3
## 28 Ren1 ENSMUSG00000070645 102643057 3
## 29 Snrpe ENSMUSG00000090553 20643 2
## 30 Snrpe ENSMUSG00000090553 102632439 2
There are quite a few Ensembl IDs with no EntrezID. This either due to differences in versions between our gtf we used for counting and the version in the EnsDb.Mmusculus.v79 or they haven’t designated a match for that ID. The two databases don’t match on a 1:1 level although they are taking step towards consolidating in recent years.
There are some genes that have multiple entries in the retrieved annotation. This is becaues there are multiple Entrez IDs for a single Ensembl gene. These one-to-many relationships come up frequently in genomic databases, it is important to be aware of them and check when necessary.
We will need to do a little work before adding the annotation to out results table. We could decide to discard one or both of the Entrez ID mappings, or we could concatenate the Entrez IDs so that we don’t lose information.
So far we have retrieved the annotation for just 1000 genes, but we need annotations for the entire results table.
A reminder of the code we have used so far:
# lets set it up
ourCols <- c("GENEID", "SYMBOL", "ENTREZID")
ourKeys <- rownames(resLvV)[1:1000]
# run the query
annot <- AnnotationDbi::select(EnsDb.Mmusculus.v79,
keys=ourKeys,
columns=ourCols,
keytype="GENEID")
Run the same query using all of the genes in our results table (
resLvV
), and this time include the biotype of the genes too. Hint: You can find the name of the column for this by runningcolumns(EnsDb.Mmusculus.v79)
How many Ensembl genes have multipe Entrez IDs associated with them?
Are all of the Ensembl gene IDs annotated? If not, why do you think this is?
Dealing with all the one-to-many annotation mappings requires some manual curation of your annotation table.
To save time we have created an annotation table in which we have modified the column names and dealt with the one-to-many/missing issues for Entrez IDs.
load("Robjects/Ensembl_annotations.RData")
colnames(ensemblAnnot)
## [1] "GeneID" "Entrez" "Symbol" "Description"
## [5] "Biotype" "Chr" "Start" "End"
## [9] "Strand" "medianTxLength"
annotLvV <- as.data.frame(resLvV) %>%
rownames_to_column("GeneID") %>%
left_join(ensemblAnnot, "GeneID") %>%
rename(logFC=log2FoldChange, FDR=padj)
Finally we can output the annotation DE results using write_tsv
.
write_tsv(annotLvV, "results/VirginVsLactating_Results_Annotated.txt")
A quick and easy “sanity check” for our DE results is to generate a p-value histogram. What we should see is a high bar at 0 - 0.05
and then a roughly uniform tail to the right of this. There is a nice explanation of other possible patterns in the histogram and what to do when you see them in this post.
hist(resLvV$pvalue)
DESeq2
provides a functon called lfcShrink
that shrinks log-Fold Change (LFC) estimates towards zero using and empirical Bayes procedure. The reason for doing this is that there is high variance in the LFC estimates when counts are low and this results in lowly expressed genes appearing to show greater differences between groups than highly expressed genes. The lfcShrink
method compensates for this and allows better visualisation and ranking of genes. We will use it for our visualisation of the data.
ddsShrink <- lfcShrink(ddsObj, coef="Status_lactate_vs_virgin")
## using 'apeglm' for LFC shrinkage. If used in published research, please cite:
## Zhu, A., Ibrahim, J.G., Love, M.I. (2018) Heavy-tailed prior distributions for
## sequence count data: removing the noise and preserving large differences.
## Bioinformatics. https://doi.org/10.1093/bioinformatics/bty895
shrinkLvV <- as.data.frame(ddsShrink) %>%
rownames_to_column("GeneID") %>%
left_join(ensemblAnnot, "GeneID") %>%
rename(logFC=log2FoldChange, FDR=padj)
MA plots are a common way to visualize the results of a differential analysis. We met them briefly towards the end of Session 2. This plot shows the log-Fold Change for each gene against its average expression across all samples in the two conditions being contrasted.
DESeq2
has a handy function for plotting this. Let’s use it too compare the shrunk and un-shrunk fold changes.
par(mfrow=c(1,2))
plotMA(resLvV, alpha=0.05)
plotMA(ddsShrink, alpha=0.05)
The DESeq2 in plotMA
function is fine for a quick look, but it is not easy to make changes to the way it looks or add things such as gene labels. Perhaps we would like to add labels for the top 20 most significantly differentially expressed genes. Let’s use the package ggplot2
instead.
ggplot2
The ggplot2
package has emerged as an attractive alternative to the traditional plots provided by base R. A full overview of all capabilities of the package is available from the cheatsheet.
In brief:-
shrinkLvV
is our data frame containing the variables we wish to plotaes
creates a mapping between the variables in our data frame to the aesthetic properties of the plot:
baseMean
)logFC
geom_point
specifies the particular type of plot we want (in this case a bar plot)geom_text
allows us to add labels to some or all of the points
The real advantage of ggplot2
is the ability to change the appearance of our plot by mapping other variables to aspects of the plot. For example, we could colour the points based on the sample group. To do this we can add metadata from the sampleinfo
table to the data. The colours are automatically chosen by ggplot2
, but we can specifiy particular values.
ggplot(shrinkLvV, aes(x = log2(baseMean), y=logFC)) +
geom_point(aes(colour=FDR < 0.05), shape=20, size=0.5) +
geom_text(data=~top_n(.x, 10, wt=-FDR), aes(label=Symbol)) +
labs(x="mean of normalised counts", y="log fold change")
Challenge 2 - Volcano plot
Another common visualisation is the volcano plot which displays a measure of significance on the y-axis and fold-change on the x-axis.
If you haven’t already make sure you load in our data and annotation. Then shrink the values. You can copy and paste the code below.
# First load data and annotations
load("Robjects/DE.RData")
load("Robjects/Ensembl_annotations.RData")
#Shrink our values
ddsShrink <- lfcShrink(ddsObj, coef="Status_lactate_vs_virgin")
shrinkLvV <- as.data.frame(ddsShrink) %>%
rownames_to_column("GeneID") %>%
left_join(ensemblAnnot, "GeneID") %>%
rename(logFC=log2FoldChange, FDR=padj)
Use the log2 fold change (
logFC
) on the x-axis, and use-log10(pvalue)
on the y-axis. (This-log10
transformation is commonly used for p-values as it means that more significant genes have a higher scale)
Create a new column of -log10(pvalue) values in shrinkLvV
Create a plot with points coloured by FDR < 0.05 similar to how we did in the MA plot
An example of what your plot should look like:
We’re going to use the package ComplexHeatmap
(Gu, Eils, and Schlesner 2016). We’ll also use circlize
to generate a colour scale (Gu et al. 2014).
library(ComplexHeatmap)
library(circlize)
We can’t plot the entire data set, let’s just select the top 150 by FDR. We’ll want to use normalised expression values, so we’ll use the vst
function.
# get the top genes
sigGenes <- shrinkLvV %>%
top_n(150, wt=-FDR) %>%
pull("GeneID")
# filter the data for the top 200 by padj in the LRT test
plotDat <- vst(ddsObj)[sigGenes,] %>%
assay()
The range expression values for different genes can vary widely. Some genes will have very high expression. Our heatmap is going to be coloured according to gene expression. If we used a colour scale from 0 (no expression) to the maximum expression, the scale will be dominated by our most extreme genes and it will be difficult to discern any difference between most of the genes.
To overcome this we will z-scale the counts. This scaling method results in values for each that show the number of standard deviations the gene expression is from the mean for that gene across all the sample - the mean will be ‘0’, ‘1’ means 1 standard deviation higher than the mean, ‘-1’ means 1 standard deviation lower than the mean.
z.mat <- t(scale(t(plotDat), center=TRUE, scale=TRUE))
# colour palette
myPalette <- c("royalblue3", "ivory", "orangered3")
myRamp <- colorRamp2(c(-2, 0, 2), myPalette)
Heatmap(z.mat, name = "z-score",
col = myRamp,
show_row_names = FALSE,
cluster_columns = FALSE)
we can also split the heat map into clusters and add some annotation.
ha1 = HeatmapAnnotation(df = colData(ddsObj)[,c("CellType", "Status")])
Heatmap(z.mat, name = "z-score",
col = myRamp,
show_row_name = FALSE,
cluster_columns = FALSE,
split=7,
rect_gp = gpar(col = "darkgrey", lwd=0.5),
top_annotation = ha1)
save(annotLvV, shrinkLvV, file="results/Annotated_Results_LvV.RData")
Gu, Zuguang, Roland Eils, and Matthias Schlesner. 2016. “Complex Heatmaps Reveal Patterns and Correlations in Multidimensional Genomic Data.” Bioinformatics.
Gu, Zuguang, Lei Gu, Roland Eils, Matthias Schlesner, and Benedikt Brors. 2014. “Circlize Implements and Enhances Circular Visualization in R.” Bioinformatics 30 (19): 2811–2.