Exercise 1 - Retrieve the full annotation

So far we have retrieved the annotation for just 1000 genes, but we need annotations for the entire results table.

A reminder of the code we have used so far:

# First load data if you haven't already
load("Robjects/DE.RData")

# lets set it up
ourCols <- c("GENEID", "SYMBOL", "ENTREZID")
ourKeys <- rownames(resLvV)[1:1000]

# run the query
annot <- AnnotationDbi::select(EnsDb.Mmusculus.v79, 
                keys=ourKeys, 
                columns=ourCols, 
                keytype="GENEID")
  1. Run the same query using all of the genes in our results table (resLvV), and this time include the biotype of the genes too. Hint: You can find the name of the column for this by running columns(EnsDb.Mmusculus.v79)

  2. How many Ensembl genes have multipe Entrez IDs associated with them?

  3. Are all of the Ensembl gene IDs annotated? If not, why do you think this is?

# (a)

# use all of the genes
ourKeys <- rownames(resLvV)

# Include the biotype
columns(EnsDb.Mmusculus.v79)
##  [1] "ENTREZID"            "EXONID"              "EXONIDX"            
##  [4] "EXONSEQEND"          "EXONSEQSTART"        "GENEBIOTYPE"        
##  [7] "GENEID"              "GENENAME"            "GENESEQEND"         
## [10] "GENESEQSTART"        "INTERPROACCESSION"   "ISCIRCULAR"         
## [13] "PROTDOMEND"          "PROTDOMSTART"        "PROTEINDOMAINID"    
## [16] "PROTEINDOMAINSOURCE" "PROTEINID"           "PROTEINSEQUENCE"    
## [19] "SEQCOORDSYSTEM"      "SEQLENGTH"           "SEQNAME"            
## [22] "SEQSTRAND"           "SYMBOL"              "TXBIOTYPE"          
## [25] "TXCDSSEQEND"         "TXCDSSEQSTART"       "TXID"               
## [28] "TXNAME"              "TXSEQEND"            "TXSEQSTART"         
## [31] "UNIPROTDB"           "UNIPROTID"           "UNIPROTMAPPINGTYPE"
## column 6 is "GENEBIOTYPE"
ourCols <- c("GENEID", "SYMBOL", "ENTREZID", "GENEBIOTYPE")

# run the query
annot <- AnnotationDbi::select(EnsDb.Mmusculus.v79, 
                keys=ourKeys, 
                columns=ourCols,
                keytype="GENEID")

# (b)
annot %>%  
    add_count(GENEID) %>%  
    dplyr::filter(n>1) %>% 
    distinct(GENEID) %>%
    count()
##     n
## 1 352
# (c)
dim(annot)
## [1] 22514     4
length(ourKeys)
## [1] 25369

Challenge 2

If you haven’t already make sure you load in our data and annotation. Then shrink the values. You can copy and paste the code below.

## using 'apeglm' for LFC shrinkage. If used in published research, please cite:
##     Zhu, A., Ibrahim, J.G., Love, M.I. (2018) Heavy-tailed prior distributions for
##     sequence count data: removing the noise and preserving large differences.
##     Bioinformatics. https://doi.org/10.1093/bioinformatics/bty895
# First load data and annotations if you haven't already
load("Robjects/DE.RData")
load("Robjects/Ensembl_annotations.RData")

#Shrink our values
ddsShrink <- lfcShrink(ddsObj, coef="Status_lactate_vs_virgin")
shrinkLvV <- as.data.frame(ddsShrink) %>%
    rownames_to_column("GeneID") %>% 
    left_join(ensemblAnnot, "GeneID") %>% 
    rename(logFC=log2FoldChange, FDR=padj)

Use the log2 fold change (logFC) on the x-axis, and use -log10(pvalue) on the y-axis. (This -log10 transformation is commonly used for p-values as it means that more significant genes have a higher scale)

  1. Create a new column of -log10(pvalue) values in shrinkLvV

  2. Create a plot with points coloured by FDR < 0.05 similar to how we did in the MA plot

shrinkLvV %>% 
    mutate(`-log10(pvalue)` = -log10(pvalue)) %>% 
    ggplot(aes(x = logFC, y=`-log10(pvalue)`)) + 
      geom_point(aes(colour=FDR < 0.05), size=1)