Original Authors: Belinda Phipson, Anna Trigos, Matt Ritchie, Maria Doyle, Harriet Dashnow, Charity Law Based on the course RNAseq analysis in R delivered on May 11/12th 2016

Before starting this section, we will make sure we have all the relevant objects from the Differential Expression analysis present.

load("Robjects/DE.Rdata")

Overview

Main Materials

Create a table of differential expression results

We have a list of significantly differentially expressed genes, but the only annotation we can see is the Entrez Gene ID, which is not very informative.

results <- as.data.frame(topTags(lrt_BvsL, n = Inf))
head(results)

Visualise results with a ‘Smear plot’

edgeR provides a function plotSmear that allows us to visualise the results of a DE analysis. This plot shows the log-fold change against log-counts per million, with DE genes highlighted:

# identify differentially expressed genes
de <- decideTestsDGE(lrt_BvsL)
summary(de)
       CellTypeluminal
Down              2957
NotSig           11232
Up                1615
# create a vector of differentially expressed genes
detags <- rownames(dgeObj)[as.logical(de)]
# plot smear highlighting DE genes
plotSmear(lrt_BvsL, de.tags = detags)

However, on such a plot it would be nice to add labels to highlight the genes with most evidence for being DE, or our favourite genes. To perform such a task we need to map between the identifiers we have in the edgeR output and more familiar names.

Finally, we will look at sophisticated visualisations that allow us to incorporate information about the structure of a gene, level of sequencing coverage.

Adding annotation to the edgeR results

There are a number of ways to add annotation, but we will demonstrate how to do this using the org.Mm.eg.db package. This package is one of several organism-level packages which are re-built every 6 months. These packages are listed on the annotation section of the Bioconductor, and are installed in the same way as regular Bioconductor packages. An alternative approach is to use biomaRt, an interface to the BioMart resource. BioMart is much more comprehensive, but the organism packages fit better into the Bioconductor workflow.

# source("http://www.bioconductor.org/biocLite.R")
# # for mouse
# biocLite("org.Mm.eg.db")
# #For Human
# biocLite("org.Hs.eg.db")

The packages are larger in size that Bioconductor software pacakges, but essentially they are databases that can be used to make offline queries.

library(org.Mm.eg.db)

First we need to decide what information we want. In order to see what we can extract we can run the columns function on the annotation database.

columns(org.Mm.eg.db)
 [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS" "ENTREZID"     "ENZYME"       "EVIDENCE"    
 [9] "EVIDENCEALL"  "GENENAME"     "GO"           "GOALL"        "IPI"          "MGI"          "ONTOLOGY"     "ONTOLOGYALL" 
[17] "PATH"         "PFAM"         "PMID"         "PROSITE"      "REFSEQ"       "SYMBOL"       "UNIGENE"      "UNIPROT"     

We are going to filter the database by a key or set of keys in order to extract the information we want. Valid names for the key can be retrieved with the keytypes function.

keytypes(org.Mm.eg.db)
 [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS" "ENTREZID"     "ENZYME"       "EVIDENCE"    
 [9] "EVIDENCEALL"  "GENENAME"     "GO"           "GOALL"        "IPI"          "MGI"          "ONTOLOGY"     "ONTOLOGYALL" 
[17] "PATH"         "PFAM"         "PMID"         "PROSITE"      "REFSEQ"       "SYMBOL"       "UNIGENE"      "UNIPROT"     

We should see ENTREZID, which is the type of key we are going to use in this case. If we are unsure what values are acceptable for the key, we can check what keys are valid with keys

keys(org.Mm.eg.db, keytype="ENTREZID")[1:10]
 [1] "11287" "11298" "11302" "11303" "11304" "11305" "11306" "11307" "11308" "11350"

It is a useful sanity check to make sure that the keys you want to use are all valid. We could use %in% in this case.

## Build up the query step-by-step
my.keys <- rownames(results)[1:10]
my.keys %in% keys(org.Mm.eg.db, keytype="ENTREZID")
 [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
all(my.keys %in% keys(org.Mm.eg.db, keytype="ENTREZID"))
[1] TRUE

To annotate our results, we definitely want gene symbols and perhaps the full gene name. Let’s build up our annotation information in a separate data frame using the select function.

annot <- select(org.Mm.eg.db,
              keys=rownames(results), 
              columns=c("ENTREZID","SYMBOL","GENENAME"))
'select()' returned 1:1 mapping between keys and columns
# Have a look at the annotation
head(annot)

Let’s double check that the ENTREZID column matches exactly to our results rownames.

table(annot$ENTREZID==rownames(results))

 TRUE 
15804 

We can bind in the annotation information to the results data frame. (Please note that if the select function returns a 1:many mapping then you can’t just append the annotation to the fit object.)

results.annotated <- cbind(results, annot)
head(results.annotated)

We can save the results table using the write.csv function, which writes the results out to a csv file that you can open in excel.

write.csv(results.annotated,
          file="B.PregVsLacResults.csv",
          row.names=FALSE)

A note about deciding how many genes are significant: In order to decide which genes are differentially expressed, we usually take a cut-off of 0.05 on the adjusted p-value, NOT the raw p-value. This is because we are testing more than 15000 genes, and the chances of finding differentially expressed genes is very high when you do that many tests. Hence we need to control the false discovery rate, which is the adjusted p-value column in the results table. What this means is that if 100 genes are significant at a 5% false discovery rate, we are willing to accept that 5 will be false positives. Note that the decideTests function displays significant genes at 5% FDR.

Challenge

Re-visit the plotSmear plot from above and use the text function to add labels for the names of the top 20 most DE genes

Further visualisation

Volcano plot

Another common visualisation is the volcano plot which displays a measure of significance on the y-axis and fold-change on the x-axis.

signif <- -log10(results.annotated$FDR)
plot(results.annotated$logFC, 
     signif,
     bg="black",
     pch=21,
     xlab="log2(Fold Change)")
points(results.annotated[detags,"logFC"],
       -log10(results.annotated[detags,"FDR"]),
       pch=21,
       bg="red")

Strip chart for gene expression

Before following up on the DE genes with further lab work, a recommended sanity check is to have a look at the expression levels of the individual samples for the genes of interest. We can quickly look at grouped expression using stripchart. We can use the normalised log expression values in the dgeCounts object (dgeCounts$counts).

normCounts <- cpm(dgeObj, log = T)
# Let's look at the first gene in the topTable, Krt5, which has a rowname 110308
par(mar=c(8,4,2,2)) #adjust the plot margins the x-labels are visible - see ?par
stripchart(normCounts["110308",]~sampleinfo$Group,
           xlab="",
           ylab="log2(Counts)",
           vertical=TRUE,
           las=2,
           pch=21,
           col=1:6,
           cex=2)

Interactive StripChart with Glimma

An interactive version of the volcano plot above that includes the raw per sample values in a separate panel is possible via the glXYPlot function in the Glimma package.

library(Glimma)
group <- sampleinfo$Group
levels(group) <- c("basal.lactate","basal.preg","basal.virgin","lum.lactate", "lum.preg", "lum.virgin")
annot.mod <- annot
colnames(annot.mod)[1] <- "GeneID"
de <- as.numeric(results$FDR<=0.01)
glXYPlot(x=results$logFC, y=-log10(results$FDR),
         xlab="logFC", ylab="B", main="B.PregVsLac",
         counts=normCounts, groups=group, status=de,
         anno=annot.mod, id.column="ENTREZID", folder="volcano")

This function creates an html page (./volcano/XY-Plot.html) with a volcano plot on the left and a plot showing the log-CPM per sample for a selected gene on the right. A search bar is available to search for genes of interest.

Retrieving Genomic Locations

It might seem natural to add genomic locations to our annotation table, and possibly a bit odd that the org.Mm.eg.db package does not supply such mappings. In fact, there is a whole suite of package for performing this, and more-advanced queries that relate to the location of genes. These are listed on the Bioconductor annotation page and have the prefix TxDb.

The package we will be using is TxDb.Mmusculus.UCSC.mm10.knownGene. Packages are available for other organisms and genome builds. It is even possible to build your own database if one does not exist. See vignette("GenomicFeatures") for details

# source("http://www.bioconductor.org/biocLite.R")
## For mouse
# biocLite("TxDb.Mmusculus.UCSC.mm10.knownGene")
## For Humans
# biocLite("TxDb.Hsapiens.UCSC.hg19.knownGene")

We load the library in the usual fashion and create a new object to save some typing. As with the org. packages, we can query what columns are available with columns,

library(TxDb.Mmusculus.UCSC.mm10.knownGene)
tx <- TxDb.Mmusculus.UCSC.mm10.knownGene
columns(tx)
 [1] "CDSCHROM"   "CDSEND"     "CDSID"      "CDSNAME"    "CDSSTART"   "CDSSTRAND"  "EXONCHROM"  "EXONEND"    "EXONID"    
[10] "EXONNAME"   "EXONRANK"   "EXONSTART"  "EXONSTRAND" "GENEID"     "TXCHROM"    "TXEND"      "TXID"       "TXNAME"    
[19] "TXSTART"    "TXSTRAND"   "TXTYPE"    

The select function is used in the same manner as the org.Mm.eg.db packages.

Challenge 2

Use the TxDb.Mmusculus.UCSC.mm10.knownGene package to retrieve the exon coordinates for the genes 50916, 110308, 12293

Overview of GenomicRanges

One of the real strengths of the txdb.. packages is the ability of interface with GenomicRanges, which is the object type used throughout Bioconductor to manipulate Genomic Intervals.

These object types permit us to perform common operations on intervals such as overlapping and counting. We can define the chromosome, start and end position of each region (also strand too, but not shown here).

library(GenomicRanges)
simple_range <- GRanges(seqnames = "1", ranges = IRanges(start=1000, end=2000))
simple_range
GRanges object with 1 range and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]        1 1000-2000      *
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

We don’t have to have all our ranges located on the same chromosome

chrs <- c("chr13", "chr15", "chr5")
start <- c(73000000, 101000000, 15000000)
end <- c(74000000, 102000000, 16000000)
my_ranges <- GRanges(seqnames = rep(chrs, 3),
                     ranges = IRanges(start = rep(start, each = 3),
                                      end = rep(end, each = 3))
                     )
my_ranges
GRanges object with 9 ranges and 0 metadata columns:
      seqnames              ranges strand
         <Rle>           <IRanges>  <Rle>
  [1]    chr13   73000000-74000000      *
  [2]    chr15   73000000-74000000      *
  [3]     chr5   73000000-74000000      *
  [4]    chr13 101000000-102000000      *
  [5]    chr15 101000000-102000000      *
  [6]     chr5 101000000-102000000      *
  [7]    chr13   15000000-16000000      *
  [8]    chr15   15000000-16000000      *
  [9]     chr5   15000000-16000000      *
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths

There are a number of useful functions for calculating properties of the data (such as coverage or sorting). Not so much for RNA-seq analysis, but GenomicRanges are used throughout Bioconductor for the analysis of NGS data.

For instance, we can quickly identify overlapping regions between two GenomicRanges.

keys <- c("50916", "110308", "12293")
genePos <- select(tx,
                  keys = keys,
                  keytype = "GENEID",
                  columns = c("EXONCHROM", "EXONSTART", "EXONEND")
                  )
'select()' returned 1:many mapping between keys and columns
geneRanges <- GRanges(genePos$EXONCHROM, 
                      ranges = IRanges(genePos$EXONSTART, genePos$EXONEND), 
                      GENEID = genePos$GENEID)
geneRanges
GRanges object with 58 ranges and 1 metadata column:
       seqnames            ranges strand |      GENEID
          <Rle>         <IRanges>  <Rle> | <character>
   [1]    chr13 73260497-73260876      * |       50916
   [2]    chr13 73264848-73264979      * |       50916
   [3]    chr13 73265458-73265709      * |       50916
   [4]    chr13 73266596-73266708      * |       50916
   [5]    chr13 73267504-73267832      * |       50916
   ...      ...               ...    ... .         ...
  [54]     chr5 16370558-16374511      * |       12293
  [55]     chr5 16341990-16342010      * |       12293
  [56]     chr5 16326327-16326383      * |       12293
  [57]     chr5 16322539-16322595      * |       12293
  [58]     chr5 16267376-16268604      * |       12293
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths
findOverlaps(my_ranges, geneRanges)
Hits object with 16 hits and 0 metadata columns:
       queryHits subjectHits
       <integer>   <integer>
   [1]         1           1
   [2]         1           2
   [3]         1           3
   [4]         1           4
   [5]         1           5
   ...       ...         ...
  [12]         5          12
  [13]         5          13
  [14]         5          14
  [15]         5          15
  [16]         9          16
  -------
  queryLength: 9 / subjectLength: 58

However, we have to pay attention to the naming convention used for each object. seqlevelsStyle can help.

seqlevelsStyle(geneRanges)
[1] "UCSC"
seqlevelsStyle(my_ranges)
[1] "UCSC"
seqlevelsStyle(simple_range)
[1] "NCBI"    "Ensembl" "MSU6"    "AGPvF"  

Exporting tracks

It is also possible to save the results of a Bioconductor analysis in a browser to enable interactive analysis and integration with other data types, or sharing with collaborators. For instance, we might want a browser track to indicate where our differentially-expressed genes are located. We shall use the bed format to display these locations. We will annotate the ranges with information from our analysis such as the fold-change and significance.

First we create a data frame for just the DE genes.

sigGenes <- results.annotated[detags,]
message("Number of significantly DE genes: ", nrow(sigGenes))
Number of significantly DE genes: 4572
head(sigGenes)

Create a genomic ranges object

Several convenience functions exist to retrieve the structure of every gene from a given TxDb object in one list. The output of exonsBy is a list, where each item in the list is the exon co-ordinates of a particular gene, however, we do not need this level of granularity for the bed output, so we will collapse to a single region for each gene.

First we use the range function to obtain a single range for every gene and tranform to a more convenient object with unlist.

exo <- exonsBy(tx, "gene")
exoRanges <- unlist(range(exo))
sigRegions <- exoRanges[na.omit(match(sigGenes$ENTREZID, names(exoRanges)))]
sigRegions
GRanges object with 4393 ranges and 0 metadata columns:
         seqnames              ranges strand
            <Rle>           <IRanges>  <Rle>
  497097     chr1     3214482-3671498      -
   20671     chr1     4490928-4497354      -
   58175     chr1     4909576-5070285      -
   76187     chr1     9548046-9577968      +
   72481     chr1     9560833-9631092      -
     ...      ...                 ...    ...
  195727     chrX 161836430-162159441      -
  108012     chrX 163909017-163933666      +
   56078     chrX 163976822-164028010      -
   54156     chrX 166523007-166585716      -
  333605     chrX 167471306-168577233      -
  -------
  seqinfo: 66 sequences (1 circular) from mm10 genome

For visualisation purposes, we are going to restrict the data to genes that are located on chromosomes 1 to 19 and the sex chromosomes. This can be done with the keepSeqLevels function.

seqlevels(sigRegions)
 [1] "chr1"                 "chr2"                 "chr3"                 "chr4"                 "chr5"                
 [6] "chr6"                 "chr7"                 "chr8"                 "chr9"                 "chr10"               
[11] "chr11"                "chr12"                "chr13"                "chr14"                "chr15"               
[16] "chr16"                "chr17"                "chr18"                "chr19"                "chrX"                
[21] "chrY"                 "chrM"                 "chr1_GL456210_random" "chr1_GL456211_random" "chr1_GL456212_random"
[26] "chr1_GL456213_random" "chr1_GL456221_random" "chr4_GL456216_random" "chr4_GL456350_random" "chr4_JH584292_random"
[31] "chr4_JH584293_random" "chr4_JH584294_random" "chr4_JH584295_random" "chr5_GL456354_random" "chr5_JH584296_random"
[36] "chr5_JH584297_random" "chr5_JH584298_random" "chr5_JH584299_random" "chr7_GL456219_random" "chrX_GL456233_random"
[41] "chrY_JH584300_random" "chrY_JH584301_random" "chrY_JH584302_random" "chrY_JH584303_random" "chrUn_GL456239"      
[46] "chrUn_GL456359"       "chrUn_GL456360"       "chrUn_GL456366"       "chrUn_GL456367"       "chrUn_GL456368"      
[51] "chrUn_GL456370"       "chrUn_GL456372"       "chrUn_GL456378"       "chrUn_GL456379"       "chrUn_GL456381"      
[56] "chrUn_GL456382"       "chrUn_GL456383"       "chrUn_GL456385"       "chrUn_GL456387"       "chrUn_GL456389"      
[61] "chrUn_GL456390"       "chrUn_GL456392"       "chrUn_GL456393"       "chrUn_GL456394"       "chrUn_GL456396"      
[66] "chrUn_JH584304"      
sigRegions <- keepSeqlevels(sigRegions, 
                            value = paste0("chr", c(1:19,"X","Y")),
                            pruning.mode="tidy")
seqlevels(sigRegions)
 [1] "chr1"  "chr2"  "chr3"  "chr4"  "chr5"  "chr6"  "chr7"  "chr8"  "chr9"  "chr10" "chr11" "chr12" "chr13" "chr14" "chr15"
[16] "chr16" "chr17" "chr18" "chr19" "chrX"  "chrY" 

Add metadata to GRanges object

A useful propery of GenomicRanges is that we can attach metadata to each range using the mcols function. The metadata can be supplied in the form of a data frame.

mcols(sigRegions) <- sigGenes[match(names(sigRegions), rownames(sigGenes)), ]
sigRegions
GRanges object with 4392 ranges and 8 metadata columns:
         seqnames              ranges strand |             logFC             logCPM               LR               PValue
            <Rle>           <IRanges>  <Rle> |         <numeric>          <numeric>        <numeric>            <numeric>
  497097     chr1     3214482-3671498      - | -10.9472399580736    2.5236514829963  23.590694348389 1.19162399345158e-06
   20671     chr1     4490928-4497354      - | -2.67313117621058   1.24186399534385 10.5872405573132  0.00113870810229744
   58175     chr1     4909576-5070285      - |  4.47143440821072   1.12401146729236 14.3734409590833 0.000149901775017694
   76187     chr1     9548046-9577968      + |   3.0330034607647   2.40710126232655 8.47563041277367  0.00359935626571388
   72481     chr1     9560833-9631092      - |  2.13661774722519 -0.247275168961162 6.13758113260831   0.0132338222037952
     ...      ...                 ...    ... .               ...                ...              ...                  ...
  195727     chrX 161836430-162159441      - | -4.69261763279776   3.10777653736015 13.9121195008167 0.000191559297917662
  108012     chrX 163909017-163933666      + | -2.17699584124451   2.32098302864258 10.2017044061552  0.00140310912515326
   56078     chrX 163976822-164028010      - |   2.5883614796138    5.1331378732911 6.92309295315483  0.00850896813252038
   54156     chrX 166523007-166585716      - | -6.96434996075043   4.39157591753061 19.4442035074456 1.03581719496295e-05
  333605     chrX 167471306-168577233      - | -3.89607089373906  0.656193636646942 6.91789090661196  0.00853375663282725
                          FDR    ENTREZID        SYMBOL                                           GENENAME
                    <numeric> <character>   <character>                                        <character>
  497097 0.000437796062030272      497097          Xkr4                  X-linked Kx blood group related 4
   20671  0.00785535505018016       20671         Sox17              SRY (sex determining region Y)-box 17
   58175  0.00201964846750182       58175         Rgs20                regulator of G-protein signaling 20
   76187   0.0182496716148034       76187        Adhfe1          alcohol dehydrogenase, iron containing, 1
   72481   0.0468231044850456       72481 2610203C22Rik                         RIKEN cDNA 2610203C22 gene
     ...                  ...         ...           ...                                                ...
  195727  0.00234386623585872      195727           Nhs                     NHS actin remodeling regulator
  108012  0.00917069338871881      108012         Ap1s2 adaptor-related protein complex 1, sigma 2 subunit
   56078    0.033890053519746       56078         Car5b               carbonic anhydrase 5b, mitochondrial
   54156 0.000520513371686814       54156         Egfl6                        EGF-like-domain, multiple 6
  333605   0.0339270997902079      333605        Frmpd4                   FERM and PDZ domain containing 4
  -------
  seqinfo: 21 sequences from mm10 genome

Scores and colour on exported tracks

The .bed file format is commonly used to store genomic locations for display in genome browsers (e.g. the UCSC browser or IGV) as tracks. Rather than just representing the genomic locations, the .bed format is also able to colour each range according to some property of the analysis (e.g. direction and magnitude of change) to help highlight particular regions of interest. A score can also be displayed when a particular region is clicked-on.

For the score we can use the \(-log_{10}\) of the adjusted p-value and colour scheme for the regions based on the fold-change

colorRampPalette is a useful function in base R for constructing a palette between two extremes. When choosing colour palettes, make sure they are colour blind friendly. The red / green colour scheme traditionally-applied to microarrays is a bad choice.

We will also truncate the fold-changes to between -5 and 5 to and divide this range into 10 equal bins

rbPal <- colorRampPalette(c("red", "blue"))
logfc <- pmax(sigRegions$logFC, -5)
logfc <- pmin(logfc , 5)
Cols <- rbPal(10)[as.numeric(cut(logfc, breaks = 10))]

The colours and score have to be saved in the GRanges object as score and itemRgb columns respectively, and will be used to construct the browser track. The rtracklayer package can be used to import and export browsers tracks.

Now we can export the signifcant results from the DE analysis as a .bed track using rtracklayer. You can load the resulting file in IGV, if you wish.

mcols(sigRegions)$score <- -log10(sigRegions$FDR)
mcols(sigRegions)$itemRgb <- Cols
sigRegions
GRanges object with 4392 ranges and 10 metadata columns:
         seqnames              ranges strand |             logFC             logCPM               LR               PValue
            <Rle>           <IRanges>  <Rle> |         <numeric>          <numeric>        <numeric>            <numeric>
  497097     chr1     3214482-3671498      - | -10.9472399580736    2.5236514829963  23.590694348389 1.19162399345158e-06
   20671     chr1     4490928-4497354      - | -2.67313117621058   1.24186399534385 10.5872405573132  0.00113870810229744
   58175     chr1     4909576-5070285      - |  4.47143440821072   1.12401146729236 14.3734409590833 0.000149901775017694
   76187     chr1     9548046-9577968      + |   3.0330034607647   2.40710126232655 8.47563041277367  0.00359935626571388
   72481     chr1     9560833-9631092      - |  2.13661774722519 -0.247275168961162 6.13758113260831   0.0132338222037952
     ...      ...                 ...    ... .               ...                ...              ...                  ...
  195727     chrX 161836430-162159441      - | -4.69261763279776   3.10777653736015 13.9121195008167 0.000191559297917662
  108012     chrX 163909017-163933666      + | -2.17699584124451   2.32098302864258 10.2017044061552  0.00140310912515326
   56078     chrX 163976822-164028010      - |   2.5883614796138    5.1331378732911 6.92309295315483  0.00850896813252038
   54156     chrX 166523007-166585716      - | -6.96434996075043   4.39157591753061 19.4442035074456 1.03581719496295e-05
  333605     chrX 167471306-168577233      - | -3.89607089373906  0.656193636646942 6.91789090661196  0.00853375663282725
                          FDR    ENTREZID        SYMBOL                                           GENENAME            score
                    <numeric> <character>   <character>                                        <character>        <numeric>
  497097 0.000437796062030272      497097          Xkr4                  X-linked Kx blood group related 4 3.35872814922358
   20671  0.00785535505018016       20671         Sox17              SRY (sex determining region Y)-box 17 2.10483418072595
   58175  0.00201964846750182       58175         Rgs20                regulator of G-protein signaling 20 2.69472421565727
   76187   0.0182496716148034       76187        Adhfe1          alcohol dehydrogenase, iron containing, 1 1.73874494584652
   72481   0.0468231044850456       72481 2610203C22Rik                         RIKEN cDNA 2610203C22 gene  1.3295397949105
     ...                  ...         ...           ...                                                ...              ...
  195727  0.00234386623585872      195727           Nhs                     NHS actin remodeling regulator 2.63006717707634
  108012  0.00917069338871881      108012         Ap1s2 adaptor-related protein complex 1, sigma 2 subunit  2.0375978264336
   56078    0.033890053519746       56078         Car5b               carbonic anhydrase 5b, mitochondrial 1.46992774531689
   54156 0.000520513371686814       54156         Egfl6                        EGF-like-domain, multiple 6 3.28356810923663
  333605   0.0339270997902079      333605        Frmpd4                   FERM and PDZ domain containing 4 1.46945326381326
             itemRgb
         <character>
  497097     #FF0000
   20671     #C60038
   58175     #0000FF
   76187     #1C00E2
   72481     #3800C6
     ...         ...
  195727     #FF0000
  108012     #C60038
   56078     #3800C6
   54156     #FF0000
  333605     #E2001C
  -------
  seqinfo: 21 sequences from mm10 genome
library(rtracklayer)
export(sigRegions , con = "topHits.bed")

Additional Materials

Extracting Reads

As we have been using counts as our starting point, we haven’t investigated the aligned reads from our experiment, and how they are represented. As you may be aware, aligned reads are usually stored in a bam file that can be manipulated with open-source command-line tools such as samtools and picard. Bioconductor provide a low-level interface to data/bam/sam files in the form of the Rsamtools package. The GenomicAlignments package can also be used to retrieve the reads mapping to a particular genomic region in an efficient manner.

library(GenomicAlignments)

In the directory bams_small there should be .bam files for some of the samples in the example study. The workflow to produce these files is described in a supplmentary page for the course. In brief, the raw reads (fastq) were downloaded from the Short Read Archive (SRA) and aligned with hisat2. Each bam file was named according to the file name in SRA, but we have renamed the files according to their name in the study. An index file (.bai) has been generated for each bam file. In order to reduce the size, the bam files used here only contain a subset of the reads that were aligned in the region chr15:101707000-101713000.

list.files("data/bams_small/")
[1] "MCL1.DG.small.bam"     "MCL1.DG.small.bam.bai" "MCL1.DH.small.bam"     "MCL1.DH.small.bam.bai" "MCL1.LA.small.bam"    
[6] "MCL1.LA.small.bam.bai" "MCL1.LB.small.bam"     "MCL1.LB.small.bam.bai"

The readGAlignments function provides a simple interface to interrogate the aligned reads for a particular sample. It can also utilise the index file in order to retrieve only the reads that correspond to a specific region in an efficient manner. The output includes the genomic location of each aligned read and the CIGAR (Compact Idiosyncratic Gapped Alignment Report); where M denotes an match to the genome and I, D correspond to insertions and deletions.

generegion <- exo[["110308"]]
my.reads <- readGAlignments(file="data/bams_small/MCL1.DG.small.bam",
                       param=ScanBamParam(which=generegion))
my.reads
GAlignments object with 87191 alignments and 0 metadata columns:
          seqnames strand                   cigar    qwidth     start       end     width     njunc
             <Rle>  <Rle>             <character> <integer> <integer> <integer> <integer> <integer>
      [1]    chr15      -           20M181102N80M       100 101527773 101708974    181202         1
      [2]    chr15      - 3M814N34M1...35M724N13M       100 101556357 101708888    152532         4
      [3]    chr15      -  11S10M88694N51M394N28M       100 101618968 101708144     89177         2
      [4]    chr15      - 27M263561N27M1611N45M1S       100 101627088 101892358    265271         2
      [5]    chr15      - 23M19128N2...9M64743N1M       100 101627092 101711603     84512         3
      ...      ...    ...                     ...       ...       ...       ...       ...       ...
  [87187]    chr15      +                    100M       100 101712864 101712963       100         0
  [87188]    chr15      +                   3S97M       100 101712865 101712961        97         0
  [87189]    chr15      -                    100M       100 101712876 101712975       100         0
  [87190]    chr15      -                    100M       100 101712883 101712982       100         0
  [87191]    chr15      +                   1S99M       100 101712888 101712986        99         0
  -------
  seqinfo: 66 sequences from an unspecified genome

It is possible to tweak the function to retrieve other potentially-useful information from the bam file, such as the mapping quality and flag.

my.reads <- readGAlignments(file="data/bams_small//MCL1.DG.small.bam",
                       param=ScanBamParam(which=generegion,
                                          what=c("seq","mapq","flag")))
my.reads
GAlignments object with 87191 alignments and 3 metadata columns:
          seqnames strand                   cigar    qwidth     start       end     width     njunc |                     seq
             <Rle>  <Rle>             <character> <integer> <integer> <integer> <integer> <integer> |          <DNAStringSet>
      [1]    chr15      -           20M181102N80M       100 101527773 101708974    181202         1 | CATGAGCTCC...CAGCAGCCTG
      [2]    chr15      - 3M814N34M1...35M724N13M       100 101556357 101708888    152532         4 | ATACTTGGTC...CACTCCTCGC
      [3]    chr15      -  11S10M88694N51M394N28M       100 101618968 101708144     89177         2 | CACAAACCCC...CTCCTTCCCC
      [4]    chr15      - 27M263561N27M1611N45M1S       100 101627088 101892358    265271         2 | CAGCAGGGCC...CTTGATCTGC
      [5]    chr15      - 23M19128N2...9M64743N1M       100 101627092 101711603     84512         3 | AGGGCCCACT...ATCTGCTCCC
      ...      ...    ...                     ...       ...       ...       ...       ...       ... .                     ...
  [87187]    chr15      +                    100M       100 101712864 101712963       100         0 | CGAGGTCAGC...GCAGAGGAGC
  [87188]    chr15      +                   3S97M       100 101712865 101712961        97         0 | GGTGAGGTCA...GGGCAGAGGA
  [87189]    chr15      -                    100M       100 101712876 101712975       100         0 | CGTTCAACAG...TCGAGCTGTG
  [87190]    chr15      -                    100M       100 101712883 101712982       100         0 | CAGGACGCTG...GTGAATGCTT
  [87191]    chr15      +                   1S99M       100 101712888 101712986        99         0 | GCGCTGTGGG...ATGATTAGTG
               mapq      flag
          <integer> <integer>
      [1]        60      1040
      [2]        60      1040
      [3]        60      1040
      [4]         0        16
      [5]         0       272
      ...       ...       ...
  [87187]        60         0
  [87188]        60      1024
  [87189]        60        16
  [87190]        60        16
  [87191]        60         0
  -------
  seqinfo: 66 sequences from an unspecified genome

The flag can represent useful QC information. e.g.

  • Read is unmapped
  • Read is paired / unpaired
  • Read failed QC
  • Read is a PCR duplicate (see later)

The combination of any of these properties is used to derive a numeric value, as illustrated in this useful resource

Particular attributes of the reads can be extracted and visualised

hist(mcols(my.reads)$mapq, main="", xlab="MAPQ")

However, there are more-sophisticated visualisation options for aligned reads and range data. We will use the ggbio package, which first requires some discussion of the ggplot2 plotting package.

Brief Introduction to ggplot2

The ggplot2 package has emerged as an attractive alternative to the traditional plots provided by base R. A full overview of all capabilities of the package is available from the cheatsheet.

A simple scatter plot, equivalent to plotSmear from before, can be generated as follows:-

library(ggplot2)
ggplot(results, aes(x = logCPM, y=logFC)) + geom_point() 

In brief:-

  • results is our data frame containing the variables we wish to plot
  • aes creates a mpping between the variables in our data frame to the aesthetic proprties of the plot
    • the x-axis is mapped to logCPM, y-axis is mapped to logFC
  • geom_point specifies the particular type of plot we want (in this case a scatter plot)

The real advantage of ggplot2 is the ability to change the appearance of our plot by mapping other variables to aspects of the plot. For example, we could colour the points based on a p-value cut-off. The colours are automatically chosen by ggplot2, but we can specifiy particular values.

ggplot(results, aes(x = logCPM, y=logFC, fill=FDR < 0.05)) + 
    geom_point(pch=21)

ggplot(results, aes(x = logCPM, y=logFC, fill=FDR < 0.05)) + 
    geom_point(alpha=0.4, pch=21) + 
    scale_fill_manual(values=c("yellow","red"))

The volcano plot can be constructed in a similar manner

ggplot(results, aes(x = logFC, y=-log10(FDR))) + 
    geom_point(aes(fill=FDR < 0.05), pch=21, size=2)

Composing plots with ggbio

We will now take a brief look at one of the visualisation packages in Bioconductor that takes advantage of the GenomicRanges and GenomicFeatures object-types. In this section we will show a worked example of how to combine several types of genomic data on the same plot. The documentation for ggbio is very extensive and contains lots of examples.

http://www.tengfei.name/ggbio/docs/

The Gviz package is another Bioconductor package that specialising in genomic visualisations, but we will not explore this package in the course.

The Manhattan plot is a common way of visualising genome-wide results, especially when one is concerned with the results of a GWAS study and identifying strongly-associated hits.

The profile is supposed to resemble the Manhattan skyline with particular skyscrapers towering about the lower level buildings.

This type of plot is implemented as the plotGrandLinear function. We have to supply a value to display on the y-axis using the aes function, which is inherited from ggplot2. The positioning of points on the x-axis is handled automatically by ggbio, using the ranges information to get the genomic coordinates of the ranges of interest.

To stop the plots from being too cluttered we will consider the top 200 genes only.

library(ggbio)
Need specific help about ggbio? try mailing 
 the maintainer or visit http://tengfei.github.com/ggbio/

Attaching package: 'ggbio'

The following objects are masked from 'package:ggplot2':

    geom_bar, geom_rect, geom_segment, ggsave, stat_bin, stat_identity, xlim
top200 <- sigRegions[order(sigRegions$LR,decreasing = TRUE)[1:200]]
plotGrandLinear(top200 , aes(y = logFC))
using coord:genome to parse x scale

ggbio has alternated the colours of the chromosomes. However, an appealing feature of ggplot2 is the ability to map properties of your plot to variables present in your data. For example, we could create a variable to distinguish between up- and down-regulated genes. The variables used for aesthetic mapping must be present in the mcols section of your ranges object.

mcols(top200)$UpRegulated <- mcols(top200)$logFC > 0
plotGrandLinear(top200, aes(y = logFC, col = UpRegulated))
using coord:genome to parse x scale

plotGrandLinear is a special function in ggbio with preset options for the manhattan style of plot. More often, users will call the autoplot function and ggbio will choose the most appropriate layout. One such layout is the karyogram.

autoplot(top200,layout="karyogram",aes(color=UpRegulated,
                                       fill=UpRegulated))
Scale for 'x' is already present. Adding another scale for 'x', which will replace the existing scale.
Scale for 'x' is already present. Adding another scale for 'x', which will replace the existing scale.

ggbio is also able to plot the structure of genes according to a particular model represented by a GenomicFeatures object, such as the object we created earlier with the exon coordinates for each gene in the mm10 genome.

autoplot(tx, which=exo[["110308"]])
Parsing transcripts...
Parsing exons...
Parsing cds...
Parsing utrs...
------exons...
------cdss...
------introns...
------utr...
aggregating...
Done
Constructing graphics...

We can even plot the location of sequencing reads if they have been imported using readGAlignments function (or similar).

myreg <- flank(reduce(exo[["110308"]]), 1000, both = T)
bam <- readGappedReads(file="data/bams_small/MCL1.DG.small.bam",
                       param=ScanBamParam(which=myreg),use.names = TRUE)
autoplot(bam, geom = "rect") + 
    xlim(GRanges("chr15", IRanges(101707000, 101713000)))
extracting information...

Like ggplot2, ggbio plots can be saved as objects that can later be modified, or combined together to form more complicated plots. If saved in this way, the plot will only be displayed on a plotting device when we query the object. This strategy is useful when we want to add a common element (such as an ideogram) to a plot composition and don’t want to repeat the code to generate the plot every time.

#idPlot <- plotIdeogram(genome = "mm10",subchr = "chr1")
#idPlot
geneMod <- autoplot(tx, which = myreg)
Parsing transcripts...
Parsing exons...
Parsing cds...
Parsing utrs...
------exons...
------cdss...
------introns...
------utr...
aggregating...
Done
Constructing graphics...
reads.MCL1.DG <- autoplot(bam, stat = "coverage")  + 
    xlim(GRanges("chr15", IRanges(101707000, 101713000))) +
    labs(title="MCL1.DG")
extracting information...
Scale for 'x' is already present. Adding another scale for 'x', which will replace the existing scale.
tracks(mm10=geneMod, MCL1.DG=reads.MCL1.DG ) 

Challenge

Create tracks to compare the coverage of the gene Krt5 for the samples MCL1.DG, MCL1.DH, MCL1.LA and MCL1.LB

---
title: "RNA-seq Analysis in R"
subtitle: "Annotation and Visualisation of RNA-seq results"
author: "Stephane Ballereau, Mark Dunning, Oscar Rueda, Ashley Sawle"
date: '`r format(Sys.time(), "Last modified: %d %b %Y")`'
output:
  html_document:
    toc: yes
  html_notebook:
    toc: yes
minutes: 300
layout: page
bibliography: ref.bib
---

**Original Authors: Belinda Phipson, Anna Trigos, Matt Ritchie, Maria Doyle, Harriet Dashnow, Charity Law**
Based on the course [RNAseq analysis in R](http://combine-australia.github.io/2016-05-11-RNAseq/) delivered on May 11/12th 2016


```{r setup, include=FALSE}
library(edgeR)
```

Before starting this section, we will make sure we have all the relevant objects from the Differential Expression analysis present.

```{r loadData}
load("Robjects/DE.Rdata")
```

# Overview

- Visualising DE results
- Getting annotation
- Retrieving gene models
- Exporting browser traecks
- Visualising results with respect to genomic location

# Main Materials

## Create a table of differential expression results

We have a list of significantly differentially expressed genes, but the only annotation we can see is the Entrez Gene ID, which is not very informative. 

```{r createDeTable}
results <- as.data.frame(topTags(lrt_BvsL, n = Inf))
head(results)
```

### Visualise results with a 'Smear plot'

`edgeR` provides a function `plotSmear` that allows us to visualise the results of a DE analysis. 
This plot shows the log-fold change against log-counts per million, with DE genes highlighted:

```{r plotSmear}
# identify differentially expressed genes
de <- decideTestsDGE(lrt_BvsL)
summary(de)

# create a vector of differentially expressed genes
detags <- rownames(dgeObj)[as.logical(de)]

# plot smear highlighting DE genes
plotSmear(lrt_BvsL, de.tags = detags)
```
However, on such a plot it would be nice to add labels to highlight the genes with most evidence for being DE, or our favourite genes. To perform such a task we need to map between the identifiers we have in the `edgeR` output and more familiar names.

Finally, we will look at sophisticated visualisations that allow us to incorporate information about the structure of a gene, level of sequencing coverage.

## Adding annotation to the edgeR results

There are a number of ways to add annotation, but we will demonstrate how to do this using the *org.Mm.eg.db* package. This package is one of several *organism-level* packages which are re-built every 6 months. These packages are listed on the [annotation section](http://bioconductor.org/packages/release/BiocViews.html#___AnnotationData) of the Bioconductor, and are installed in the same way as regular Bioconductor packages. An alternative approach is to use `biomaRt`, an interface to the [BioMart](http://www.biomart.org/) resource. BioMart is much more comprehensive, but the organism packages fit better into the Bioconductor workflow.


```{r installingDBs, eval=FALSE}
# source("http://www.bioconductor.org/biocLite.R")
# # for mouse
# biocLite("org.Mm.eg.db")
# #For Human
# biocLite("org.Hs.eg.db")
```

The packages are larger in size that Bioconductor software pacakges, but essentially they are databases that can be used to make *offline* queries. 

```{r loadDB, message=FALSE}
library(org.Mm.eg.db)
```


First we need to decide what information we want. In order to see what we can extract we can run the `columns` function on the annotation database.

```{r checkColumns}
columns(org.Mm.eg.db)
```

We are going to filter the database by a key or set of keys in order to extract the information we want. Valid names for the key can be retrieved with the `keytypes` function.

```{r checkKeytypes}
keytypes(org.Mm.eg.db)
```

We should see `ENTREZID`, which is the type of key we are going to use in this case. If we are unsure what values are acceptable for the key, we can check what keys are valid with `keys`

```{r viewENTREZkey}
keys(org.Mm.eg.db, keytype="ENTREZID")[1:10]
```

It is a useful sanity check to make sure that the keys you want to use are all 
valid. We could use `%in%` in this case.

```{r compareKeys}
## Build up the query step-by-step
my.keys <- rownames(results)[1:10]
my.keys %in% keys(org.Mm.eg.db, keytype="ENTREZID")
all(my.keys %in% keys(org.Mm.eg.db, keytype="ENTREZID"))
```

To annotate our results, we definitely want gene symbols and perhaps the full gene name. Let's build up our annotation information in a separate data frame using the `select` function.

```{r makeAnnotation}
annot <- select(org.Mm.eg.db,
              keys=rownames(results), 
              columns=c("ENTREZID","SYMBOL","GENENAME"))

# Have a look at the annotation
head(annot)
```

Let's double check that the `ENTREZID` column matches exactly to our `results` rownames.

```{r checkAnnotation}
table(annot$ENTREZID==rownames(results))
```

We can bind in the annotation information to the `results` data frame. (Please note that if the `select` function returns a 1:many mapping then you can't just append the annotation to the fit object.)

```{r annotateResults}
results.annotated <- cbind(results, annot)
head(results.annotated)
```


We can save the results table using the `write.csv` function, which writes the results out to a csv file that you can open in excel.

```{r outputDEtable}
write.csv(results.annotated,
          file="B.PregVsLacResults.csv",
          row.names=FALSE)
```

**A note about deciding how many genes are significant**: In order to decide which genes are differentially expressed, we usually take a cut-off of 0.05 on the adjusted p-value, NOT the raw p-value. This is because we are testing more than 15000 genes, and the chances of finding differentially expressed genes is very high when you do that many tests. Hence we need to control the false discovery rate, which is the adjusted p-value column in the results table. What this means is that if 100 genes are significant at a 5\% false discovery rate, we are willing to accept that 5 will be false positives. Note that the `decideTests` function displays significant genes at 5\% FDR.

> ### Challenge {.challenge}
>
> Re-visit the `plotSmear` plot from above and use the `text` function to add labels for the names of the top 20 most DE genes
>

```{r solutionChallenge1,echo=FALSE,fig.height=5,fig.width=10}

```

## Further visualisation

### Volcano plot

Another common visualisation is the 
[*volcano plot*](https://en.wikipedia.org/wiki/Volcano_plot_(statistics)) which 
displays a measure of significance on the y-axis and fold-change on the x-axis. 

```{r volcanoPlot,fig.height=5,fig.width=10}

signif <- -log10(results.annotated$FDR)

plot(results.annotated$logFC, 
     signif,
     bg="black",
     pch=21,
     xlab="log2(Fold Change)")
points(results.annotated[detags,"logFC"],
       -log10(results.annotated[detags,"FDR"]),
       pch=21,
       bg="red")

```

### Strip chart for gene expression

Before following up on the DE genes with further lab work, a recommended *sanity check* is to have a look at the expression levels of the individual samples for the genes of interest. We can quickly look at grouped expression using `stripchart`. We can use the normalised log expression values in the  `dgeCounts` object (`dgeCounts$counts`).

```{r geneCountStripchart, fig.width=5, fig.height=5, fig.align="center"}
normCounts <- cpm(dgeObj, log = T)

# Let's look at the first gene in the topTable, Krt5, which has a rowname 110308
par(mar=c(8,4,2,2)) #adjust the plot margins the x-labels are visible - see ?par
stripchart(normCounts["110308",]~sampleinfo$Group,
           xlab="",
           ylab="log2(Counts)",
           vertical=TRUE,
           las=2,
           pch=21,
           col=1:6,
           cex=2)
```

### Interactive StripChart with Glimma

An interactive version of the volcano plot above that includes the raw per sample values in a separate panel is possible via the `glXYPlot` function in the *Glimma* package.


```{r}
library(Glimma)
group <- sampleinfo$Group
levels(group) <- c("basal.lactate","basal.preg","basal.virgin","lum.lactate", "lum.preg", "lum.virgin")
annot.mod <- annot
colnames(annot.mod)[1] <- "GeneID"
de <- as.numeric(results$FDR<=0.01)
glXYPlot(x=results$logFC, y=-log10(results$FDR),
         xlab="logFC", ylab="B", main="B.PregVsLac",
         counts=normCounts, groups=group, status=de,
         anno=annot.mod, id.column="ENTREZID", folder="volcano")
```

This function creates an html page (./volcano/XY-Plot.html) with a volcano plot on the left and a plot showing the log-CPM per sample for a selected gene on the right. A search bar is available to search for genes of interest.

## Retrieving Genomic Locations

It might seem natural to add genomic locations to our annotation table, and possibly a bit odd that the `org.Mm.eg.db` package does not supply such mappings. In fact, there is a whole suite of package for performing this, and more-advanced queries that relate to the location of genes. These are listed on the Bioconductor [annotation page](http://bioconductor.org/packages/release/BiocViews.html#___AnnotationData) and have the prefix `TxDb.`

The package we will be using is `TxDb.Mmusculus.UCSC.mm10.knownGene`. Packages are available for other organisms and genome builds. It is even possible to *build your own database* if one does not exist. See `vignette("GenomicFeatures")` for details

```{r installTxDb, eval=FALSE}
# source("http://www.bioconductor.org/biocLite.R")
## For mouse
# biocLite("TxDb.Mmusculus.UCSC.mm10.knownGene")
## For Humans
# biocLite("TxDb.Hsapiens.UCSC.hg19.knownGene")
```

We load the library in the usual fashion and create a new object to save some typing. As with the `org.` packages, we can query what columns are available with `columns`,

```{r loadTxDB, message=FALSE}
library(TxDb.Mmusculus.UCSC.mm10.knownGene)
tx <- TxDb.Mmusculus.UCSC.mm10.knownGene
columns(tx)
```

The `select` function is used in the same manner as the `org.Mm.eg.db` packages. 


> ### Challenge 2 {.challenge}
>
> Use the TxDb.Mmusculus.UCSC.mm10.knownGene package to retrieve the exon coordinates for the genes `50916`, `110308`, `12293` 
>

```{r solutionChallenge2, echo=FALSE, warning=FALSE, message=FALSE}




```

## Overview of GenomicRanges

One of the real strengths of the `txdb..` packages is the ability of interface with `GenomicRanges`, which is the object type used throughout Bioconductor [to manipulate Genomic Intervals](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3738458/pdf/pcbi.1003118.pdf). 

These object types permit us to perform common operations on intervals such as overlapping and counting. We can define the chromosome, start and end position of each region (also strand too, but not shown here).

```{r simpleGR}
library(GenomicRanges)
simple_range <- GRanges(seqnames = "1", ranges = IRanges(start=1000, end=2000))
simple_range

```

We don't have to have all our ranges located on the same chromosome

```{r grForThreeGenes}
chrs <- c("chr13", "chr15", "chr5")
start <- c(73000000, 101000000, 15000000)
end <- c(74000000, 102000000, 16000000)

my_ranges <- GRanges(seqnames = rep(chrs, 3),
                     ranges = IRanges(start = rep(start, each = 3),
                                      end = rep(end, each = 3))
                     )
my_ranges
```

There are a number of useful functions for calculating properties of the data (such as *coverage* or sorting). Not so much for RNA-seq analysis, but `GenomicRanges` are used throughout Bioconductor for the analysis of NGS data. 

For instance, we can quickly identify overlapping regions between two `GenomicRanges`. 

```{r findOverlaps}
keys <- c("50916", "110308", "12293")
genePos <- select(tx,
                  keys = keys,
                  keytype = "GENEID",
                  columns = c("EXONCHROM", "EXONSTART", "EXONEND")
                  )

geneRanges <- GRanges(genePos$EXONCHROM, 
                      ranges = IRanges(genePos$EXONSTART, genePos$EXONEND), 
                      GENEID = genePos$GENEID)
geneRanges

findOverlaps(my_ranges, geneRanges)
```

However, we have to pay attention to the naming convention used for each object. 
`seqlevelsStyle` can help.

```{r seqNamingStyle}
seqlevelsStyle(geneRanges)
seqlevelsStyle(my_ranges)
seqlevelsStyle(simple_range)
```


### Exporting tracks

It is also possible to save the results of a Bioconductor analysis in a browser to enable interactive analysis and integration with other data types, or sharing with collaborators. For instance, we might want a browser track to indicate where our differentially-expressed genes are located. We shall use the `bed` format to display these locations. We will annotate the ranges with information from our analysis such as the fold-change and significance.

First we create a data frame for just the DE genes.
```{r tableOfDEGenes}
sigGenes <- results.annotated[detags,]
message("Number of significantly DE genes: ", nrow(sigGenes))
head(sigGenes)
```

### Create a genomic ranges object

Several convenience functions exist to retrieve the structure of every gene from
a given TxDb object in one list. The output of `exonsBy` is a list, where each 
item in the list is the exon co-ordinates of a particular gene, however, we do 
not need this level of granularity for the bed output, so we will collapse to a 
single region for each gene. 

First we use the `range` function to obtain a single range for every gene and 
tranform to a more convenient object with `unlist`.

```{r getGeneRanges}
exo <- exonsBy(tx, "gene")
exoRanges <- unlist(range(exo))
sigRegions <- exoRanges[na.omit(match(sigGenes$ENTREZID, names(exoRanges)))]
sigRegions
```

For visualisation purposes, we are going to restrict the data to genes that are located on chromosomes 1 to 19 and the sex chromosomes. This can be done with the `keepSeqLevels` function.

```{r trimSequences}
seqlevels(sigRegions)
sigRegions <- keepSeqlevels(sigRegions, 
                            value = paste0("chr", c(1:19,"X","Y")),
                            pruning.mode="tidy")
seqlevels(sigRegions)
```

### Add metadata to GRanges object

A useful propery of GenomicRanges is that we can attach *metadata* to each range using the `mcols`
function. The metadata can be supplied in the form of a data frame.

```{r addDEResults}
mcols(sigRegions) <- sigGenes[match(names(sigRegions), rownames(sigGenes)), ]
sigRegions
```

### Scores and colour on exported tracks

The `.bed` file format is commonly used to store genomic locations for display 
in genome browsers (e.g. the UCSC browser or IGV) as tracks. Rather than just 
representing the genomic locations, the `.bed` format is also able to colour 
each range according to some property of the analysis (e.g. direction and 
magnitude of change) to help highlight particular regions of interest. A score
can also be displayed when a particular region is clicked-on.

For the score we can use the $-log_{10}$ of the adjusted p-value and 
colour scheme for the regions based on the fold-change

`colorRampPalette` is a useful function in base R for constructing a palette between two extremes. **When choosing colour palettes, make sure they are colour blind friendly**. The red / green colour scheme traditionally-applied to microarrays is a ***bad*** choice.

We will also truncate the fold-changes to between -5 and 5 to and divide this range into 10 equal bins

```{r}
rbPal <- colorRampPalette(c("red", "blue"))
logfc <- pmax(sigRegions$logFC, -5)
logfc <- pmin(logfc , 5)

Cols <- rbPal(10)[as.numeric(cut(logfc, breaks = 10))]
```

The colours and score have to be saved in the GRanges object as `score` and `itemRgb` columns respectively, and will be used to construct the browser track. The rtracklayer package can be used to import and export browsers tracks.

Now we can export the signifcant results from the DE analysis as a `.bed` track using `rtracklayer`. You can load the resulting file in IGV, if you wish.

```{r}
mcols(sigRegions)$score <- -log10(sigRegions$FDR)
mcols(sigRegions)$itemRgb <- Cols
sigRegions

library(rtracklayer)
export(sigRegions , con = "topHits.bed")
```

# Additional Materials

## Extracting Reads

As we have been using counts as our starting point, we haven't investigated the aligned reads from our experiment, and how they are represented. As you may be aware, aligned reads are usually stored in a *bam* file that can be manipulated with open-source command-line tools such as [*samtools*](http://www.htslib.org/) and [*picard*](https://broadinstitute.github.io/picard/). Bioconductor provide a low-level interface to data/bam/sam files in the form of the `Rsamtools` package. The `GenomicAlignments` package can also be used to retrieve the reads mapping to a particular genomic region in an efficient manner.

```{r message=FALSE}
library(GenomicAlignments)
```

In the directory `bams_small` there should be `.bam` files for some of the samples in the example study. The workflow to produce these files is described in a [supplmentary page](../Supplementary_Materials/S1_Getting_raw_reads_from_SRA.nb.html) for the course. In brief, the raw reads (`fastq`) were downloaded from the Short Read Archive (SRA) and aligned with `hisat2`. Each bam file was named according to the file name in SRA, but we have renamed the files according to their name in the study. An index file (`.bai`) has been generated for each bam file. In order to reduce the size, the bam files used here only contain a subset of the reads that were aligned in the region chr15:101707000-101713000.


```{r}
list.files("data/bams_small/")
```

The `readGAlignments` function provides a simple interface to interrogate the aligned reads for a particular sample. It can also utilise the *index* file in order to retrieve only the reads that correspond to a specific region in an efficient manner. The output includes the genomic location of each aligned read and the CIGAR (**C**ompact **I**diosyncratic **G**apped **A**lignment **R**eport); where *M* denotes an match to the genome and *I*, *D* correspond to insertions and deletions.

```{r}
generegion <- exo[["110308"]]

my.reads <- readGAlignments(file="data/bams_small/MCL1.DG.small.bam",
                       param=ScanBamParam(which=generegion))
my.reads
```

It is possible to tweak the function to retrieve other potentially-useful information from the bam file, such as the mapping quality and flag.

```{r}
my.reads <- readGAlignments(file="data/bams_small//MCL1.DG.small.bam",
                       param=ScanBamParam(which=generegion,
                                          what=c("seq","mapq","flag")))
my.reads
```

The flag can represent useful QC information. e.g.

  + Read is unmapped
  + Read is paired / unpaired
  + Read failed QC
  + Read is a PCR duplicate (see later)

The combination of any of these properties is used to derive a numeric value, as illustrated in this useful [resource](https://broadinstitute.github.io/picard/explain-flags.html)

Particular attributes of the reads can be extracted and visualised

```{r}
hist(mcols(my.reads)$mapq, main="", xlab="MAPQ")
```

However, there are more-sophisticated visualisation options for aligned reads and range data. We will use the `ggbio` package, which first requires some discussion of the `ggplot2` plotting package.


## Brief Introduction to ggplot2

The [`ggplot2`](http://ggplot2.tidyverse.org/) package has emerged as an attractive alternative to the traditional plots provided by base R. A full overview of all capabilities of the package is available from the [cheatsheet](https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf).

A simple scatter plot, equivalent to `plotSmear` from before, can be generated as follows:-

```{r,fig.width=12,fig.height=5}
library(ggplot2)
ggplot(results, aes(x = logCPM, y=logFC)) + geom_point() 

```

In brief:-

- `results` is our data frame containing the variables we wish to plot
- `aes` creates a mpping between the variables in our data frame to the *aes*thetic proprties of the plot
    + the x-axis is mapped to `logCPM`, y-axis is mapped to `logFC`
- `geom_point` specifies the particular type of plot we want (in this case a scatter plot)
    + see [the cheatsheet](https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf) for other plot types

The real advantage of `ggplot2` is the ability to change the appearance of our plot by mapping other variables to aspects of the plot. For example, we could colour the points based on a p-value cut-off. The colours are automatically chosen by `ggplot2`, but we can specifiy particular values.

```{r,fig.width=12,fig.height=5}
ggplot(results, aes(x = logCPM, y=logFC, fill=FDR < 0.05)) + 
    geom_point(pch=21)

ggplot(results, aes(x = logCPM, y=logFC, fill=FDR < 0.05)) + 
    geom_point(alpha=0.4, pch=21) + 
    scale_fill_manual(values=c("yellow","red"))
```

The volcano plot can be constructed in a similar manner

```{r,fig.width=12, fig.height=5}
ggplot(results, aes(x = logFC, y=-log10(FDR))) + 
    geom_point(aes(fill=FDR < 0.05), pch=21, size=2)
```


## Composing plots with ggbio

We will now take a brief look at one of the visualisation packages in Bioconductor that takes advantage
of the GenomicRanges and GenomicFeatures object-types. In this section we will show a worked
example of how to combine several types of genomic data on the same plot. The documentation for
ggbio is very extensive and contains lots of examples.

http://www.tengfei.name/ggbio/docs/

The `Gviz` package is another Bioconductor package that specialising in genomic visualisations, but we
will not explore this package in the course.

The Manhattan plot is a common way of visualising genome-wide results, especially when one is concerned with the results of a GWAS study and identifying strongly-associated hits. 

The profile is supposed to resemble the Manhattan skyline with particular skyscrapers towering about the lower level buildings.

![](https://upload.wikimedia.org/wikipedia/commons/1/12/Manhattan_Plot.png)

This type of plot is implemented as the `plotGrandLinear` function. We have to supply a value to display on the y-axis using the `aes` function,
which is inherited from ggplot2. The positioning of points on the x-axis is handled automatically by
ggbio, using the ranges information to get the genomic coordinates of the ranges of interest.

To stop the plots from being too cluttered we will consider the top 200 genes only.

```{r,fig.width=12,fig.height=5}
library(ggbio)
top200 <- sigRegions[order(sigRegions$LR,decreasing = TRUE)[1:200]]

plotGrandLinear(top200 , aes(y = logFC))

```

`ggbio` has alternated the colours of the chromosomes. However, an appealing feature of `ggplot2` is the ability to map properties of your plot to variables present in your data. For example, we could create a variable to distinguish between up- and down-regulated genes. The variables used for aesthetic mapping must be present in the `mcols` section of your ranges object.

```{r,fig.width=12,fig.height=5}
mcols(top200)$UpRegulated <- mcols(top200)$logFC > 0

plotGrandLinear(top200, aes(y = logFC, col = UpRegulated))
```

`plotGrandLinear` is a special function in `ggbio` with preset options for the manhattan style of plot. More often, users will call the `autoplot` function and `ggbio` will choose the most appropriate layout. One such layout is the *karyogram*. 

```{r,fig.width=12,fig.height=5}

autoplot(top200,layout="karyogram",aes(color=UpRegulated,
                                       fill=UpRegulated))

```



`ggbio` is also able to plot the structure of genes according to a particular model represented by a `GenomicFeatures` object, such as the object we created earlier with the exon coordinates for each gene in the mm10 genome.


```{r}
autoplot(tx, which=exo[["110308"]])
```

We can even plot the location of sequencing reads if they have been imported using readGAlignments function (or similar).

```{r}
myreg <- flank(reduce(exo[["110308"]]), 1000, both = T)

bam <- readGappedReads(file="data/bams_small/MCL1.DG.small.bam",
                       param=ScanBamParam(which=myreg),use.names = TRUE)

autoplot(bam, geom = "rect") + 
    xlim(GRanges("chr15", IRanges(101707000, 101713000)))
```

Like ggplot2, ggbio plots can be saved as objects that can later be modified, or combined together to
form more complicated plots. If saved in this way, the plot will only be displayed on a plotting device
when we query the object. This strategy is useful when we want to add a common element (such as
an ideogram) to a plot composition and don’t want to repeat the code to generate the plot every time.

```{r}
#idPlot <- plotIdeogram(genome = "mm10",subchr = "chr1")
#idPlot
geneMod <- autoplot(tx, which = myreg)
reads.MCL1.DG <- autoplot(bam, stat = "coverage")  + 
    xlim(GRanges("chr15", IRanges(101707000, 101713000))) +
    labs(title="MCL1.DG")
tracks(mm10=geneMod, MCL1.DG=reads.MCL1.DG ) 
```

> ## Challenge {.challenge}
>
> Create tracks to compare the coverage of the gene Krt5 for the samples MCL1.DG, MCL1.DH, MCL1.LA and MCL1.LB
>

```{r,echo=FALSE,fig.height=5,fig.width=10}


```

