library(biomaRt)
library(DESeq2)
library(tidyverse)
package ‘dplyr’ was built under R version 3.5.1

Load in the datasets from the Annotation and Visualisation main session

load("../Course_Materials/Robjects/Annotated_Results_LvV.RData")

Working with Genomic Locations - Transcript database packages

There is a whole suite of annotation packages that can be used to access and for perform advanced queries on information about the genomic location of genes, trancripts and exons. These are listed on the Bioconductor annotation page and have the prefix TxDb. (where “tx” is “transcript”). In addition there are a large number of packages that make use of these annotations for downstream analyses and visualisations.

Unfortunately, these packages do not cover all species and tend only to be available for UCSC genomes. Thankfully, there is a way to build your own database from either a GTF file or from various online resources such as Biomart using the package GenomicFeatures.

Creating a TxDb database from a GTF file

The created database is only loaded into the current R session. You will need to run this command each time - it can be a little slow.

library(GenomicFeatures)
txMm <- makeTxDbFromGFF(file = "../Course_Materials/counts/Mus_musculus.GRCm38.80.gtf", format = "gtf")

Creating a TxDb database from Biomart

The created database is only loaded into the current R session. You will need to run this command each time.

library(GenomicFeatures)
txMm <- makeTxDbFromBiomart(dataset="mmusculus_gene_ensembl")

Creating a TxDb package from Biomart

This creates an R package that can be installed just like a package that you might download from Bioconductor or CRAN. This can then loaded as normal whenever it is needed, saving you having to build the database each time.

A little extra work is needed at the command line to build the package from the files produced by this method. Feel free to skip this section if you want - the previous two methods are adequate if you can tolderate the short wait each time you create the database.

library(GenomicFeatures)
makeTxDbPackageFromBiomart(version="0.01",
                           destDir = "~",
                           maintainer="Some One <so@someplace.org>",
                           author="Some One <so@someplace.com>",
                           dataset="mmusculus_gene_ensembl")

This creates a new folder in your home directory (~ is a shortcut for home).

We will not go into detail about how to contruct an R package, or the contents of the package directory. This method generates all the files you need. More information on contructing R packages can be found in Hadely Wickham’s “R Packages” book.

The directory created will be something like TxDb.Mmusculus.BioMart.ENSEMBLMARTENSEMBL.GRCm38.p6. This going to be the packaged name and is referenced in various files in the package directory. We recommend changing it to something more manageable such as TxDb.Mmusculus.Ens.GRCm38. We need to change the directory name, the database file name and each reference in the DESCRIPTION and man/package.Rd files.

To do this run the following chunk of code. This code is actually run in ‘bash’, the language used in the terminal rather than R. This shows how versatile RStudio is, you can run code from languages other than R when it is more convenient for you to do so.

OldName=TxDb.Mmusculus.BioMart.ENSEMBLMARTENSEMBL.GRCm38.p6
NewName=TxDb.Mmusculus.Ens.GRCm38

cd ~
# rename the package directory
mv ${OldName} ${NewName}

cd ${NewName}
# rename the database file
mv inst/extdata/${OldName}.sqlite inst/extdata/${NewName}.sqlite

# replace the references in the old directory
sed -i s/${OldName}/${NewName}/ DESCRIPTION 
sed -i s/${OldName}/${NewName}/ man/package.Rd 

cd ~
# Build the package from the directory of files created by the above command
R CMD build TxDb.Mmusculus.Ens.GRCm38
# Install the package from the tarball created
R CMD INSTALL TxDb.Mmusculus.Ens.GRCm38.p6_0.01.tar.gz
library(TxDb.Mmusculus.Ens.GRCm38)
txMm <- TxDb.Mmusculus.Ens.GRCm38

Retrieving information from TxDb packages

Accessing the information in these TxDb databases is similar to the way in which we accessed information using biomaRt except that filters (the information we are filtering on) are now called keys and attributes (things we want to retrieve) are columns.

First we need to decide what information we want. In order to see what we can extract we can run the columns function on the annotation database.

txMm <- txMm()
Error in txMm() : could not find function "txMm"

We are going to filter the database by a key or set of keys in order to extract the information we want. Valid names for the key can be retrieved with the keytypes function.

keytypes(txMm)
[1] "CDSID"    "CDSNAME"  "EXONID"   "EXONNAME" "GENEID"   "TXID"    
[7] "TXNAME"  

To extract information we use the select function. Let’s get transcript information for our most highly differentially expressed gene.

keyList <- ensemblAnnot$GeneID[ensemblAnnot$Symbol=="Wap"]
select(txMm, 
       keys=keyList,
       keytype = "GENEID",
       columns=c("TXNAME", "TXCHROM", "TXSTART", "TXEND", "TXSTRAND", "TXTYPE")
      )
'select()' returned 1:many mapping between keys and
columns

Challenge 3

Use the txMm to retrieve the exon coordinates for the genes: + ENSMUSG00000021604 + ENSMUSG00000022146 + ENSMUSG00000040118

Overview of GenomicRanges

One of the real strengths of the txdb.. databases is the ability to interface with GenomicRanges, which is the object type used throughout Bioconductor to manipulate Genomic Intervals.

These object types permit us to perform common operations on intervals such as overlapping and counting. We can define the chromosome, start and end position of each region (also strand too, but not shown here).

library(GenomicRanges)
simple_range <- GRanges(seqnames = "1", ranges = IRanges(start=1000, end=2000))
simple_range
GRanges object with 1 range and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]        1 1000-2000      *
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

We don’t have to have all our ranges located on the same chromosome

chrs <- c("13", "15", "5")
start <- c(73000000, 6800000, 15000000)
end <- c(74000000, 6900000, 16000000)
my_ranges <- GRanges(seqnames = rep(chrs, 3),
                     ranges = IRanges(start = rep(start, each = 3),
                                      end = rep(end, each = 3))
                     )
my_ranges
GRanges object with 9 ranges and 0 metadata columns:
      seqnames            ranges strand
         <Rle>         <IRanges>  <Rle>
  [1]       13 73000000-74000000      *
  [2]       15 73000000-74000000      *
  [3]        5 73000000-74000000      *
  [4]       13   6800000-6900000      *
  [5]       15   6800000-6900000      *
  [6]        5   6800000-6900000      *
  [7]       13 15000000-16000000      *
  [8]       15 15000000-16000000      *
  [9]        5 15000000-16000000      *
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths

There are a number of useful functions for calculating properties of the data (such as coverage or sorting). Not so much for RNA-seq analysis, but GenomicRanges are used throughout Bioconductor for the analysis of NGS data.

For instance, we can quickly identify overlapping regions between two GenomicRanges.

keys <- c("ENSMUSG00000021604", "ENSMUSG00000022146", "ENSMUSG00000040118")
genePos <- select(txMm,
                  keys = keys,
                  keytype = "GENEID",
                  columns = c("EXONCHROM", "EXONSTART", "EXONEND")
                  )
'select()' returned 1:many mapping between keys and
columns
geneRanges <- GRanges(genePos$EXONCHROM, 
                      ranges = IRanges(genePos$EXONSTART, genePos$EXONEND), 
                      GENEID = genePos$GENEID)
geneRanges
GRanges object with 96 ranges and 1 metadata column:
       seqnames            ranges strand |             GENEID
          <Rle>         <IRanges>  <Rle> |        <character>
   [1]       13 73260479-73260653      * | ENSMUSG00000021604
   [2]       13 73264848-73264979      * | ENSMUSG00000021604
   [3]       13 73265458-73265709      * | ENSMUSG00000021604
   [4]       13 73266596-73266708      * | ENSMUSG00000021604
   [5]       13 73267504-73267832      * | ENSMUSG00000021604
   ...      ...               ...    ... .                ...
  [92]        5 16327973-16329883      * | ENSMUSG00000040118
  [93]        5 16326151-16326383      * | ENSMUSG00000040118
  [94]        5 16340707-16341059      * | ENSMUSG00000040118
  [95]        5 16361395-16361875      * | ENSMUSG00000040118
  [96]        5 16362265-16362326      * | ENSMUSG00000040118
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths
findOverlaps(my_ranges, geneRanges)
Hits object with 40 hits and 0 metadata columns:
       queryHits subjectHits
       <integer>   <integer>
   [1]         1           1
   [2]         1           2
   [3]         1           3
   [4]         1           4
   [5]         1           5
   ...       ...         ...
  [36]         9          36
  [37]         9          75
  [38]         9          84
  [39]         9          85
  [40]         9          87
  -------
  queryLength: 9 / subjectLength: 96

However, we have to pay attention to the naming convention used for each object. seqlevelsStyle can help.

seqlevelsStyle(simple_range)
[1] "NCBI"    "Ensembl" "MSU6"    "AGPvF"  
seqlevelsStyle(my_ranges)
[1] "NCBI"    "Ensembl" "JGI2.F" 
seqlevelsStyle(geneRanges)
[1] "NCBI"    "Ensembl" "JGI2.F" 

Exporting tracks

It is also possible to save the results of a Bioconductor analysis in a browser to enable interactive analysis and integration with other data types, or sharing with collaborators. For instance, we might want a browser track to indicate where our differentially-expressed genes are located. We shall use the bed format to display these locations. We will annotate the ranges with information from our analysis such as the fold-change and significance.

First we create a data frame for just the DE genes.

sigGenes <- filter(shrinkLvV, FDR <= 0.01)
message("Number of significantly DE genes: ", nrow(sigGenes))
Number of significantly DE genes: 4279
head(sigGenes)

Create a genomic ranges object

Several convenience functions exist to retrieve the structure of every gene from a given TxDb object in one list. The output of exonsBy is a list, where each item in the list is the exon co-ordinates of a particular gene, however, we do not need this level of granularity for the bed output, so we will collapse to a single region for each gene.

First we use the range function to obtain a single range for every gene and tranform to a more convenient object with unlist.

exoRanges <- exonsBy(txMm, "gene") %>% 
    range() %>% 
    unlist()
sigRegions <- exoRanges[na.omit(match(sigGenes$GeneID, names(exoRanges)))]
sigRegions
GRanges object with 4269 ranges and 0 metadata columns:
                     seqnames          ranges strand
                        <Rle>       <IRanges>  <Rle>
  ENSMUSG00000025903        1 4807788-4848410      +
  ENSMUSG00000103280        1 4905751-4906861      -
  ENSMUSG00000033793        1 5070018-5162529      +
  ENSMUSG00000051285        1 7088920-7173628      +
  ENSMUSG00000103509        1 7148110-7152137      +
                 ...      ...             ...    ...
  ENSMUSG00000064354       MT       7013-7696      +
  ENSMUSG00000064357       MT       7927-8607      +
  ENSMUSG00000064363       MT     10167-11544      +
  ENSMUSG00000064367       MT     11742-13565      +
  ENSMUSG00000064368       MT     13552-14070      -
  -------
  seqinfo: 139 sequences (1 circular) from an unspecified genome

For visualisation purposes, we are going to restrict the data to genes that are located on chromosomes 1 to 19 and the sex chromosomes. This can be done with the keepSeqLevels function.

seqlevels(sigRegions)
  [1] "CHR_CAST_EI_MMCHR11_CTG4"  "CHR_CAST_EI_MMCHR11_CTG5" 
  [3] "CHR_MG104_PATCH"           "CHR_MG117_PATCH"          
  [5] "CHR_MG132_PATCH"           "CHR_MG153_PATCH"          
  [7] "CHR_MG171_PATCH"           "CHR_MG184_PATCH"          
  [9] "CHR_MG190_MG3751_PATCH"    "CHR_MG191_PATCH"          
 [11] "CHR_MG209_PATCH"           "CHR_MG3172_PATCH"         
 [13] "CHR_MG3231_PATCH"          "CHR_MG3251_PATCH"         
 [15] "CHR_MG3490_PATCH"          "CHR_MG3496_PATCH"         
 [17] "CHR_MG3530_PATCH"          "CHR_MG3561_PATCH"         
 [19] "CHR_MG3562_PATCH"          "CHR_MG3609_PATCH"         
 [21] "CHR_MG3618_PATCH"          "CHR_MG3627_PATCH"         
 [23] "CHR_MG3648_PATCH"          "CHR_MG3656_PATCH"         
 [25] "CHR_MG3683_PATCH"          "CHR_MG3686_PATCH"         
 [27] "CHR_MG3699_PATCH"          "CHR_MG3700_PATCH"         
 [29] "CHR_MG3712_PATCH"          "CHR_MG3714_PATCH"         
 [31] "CHR_MG3829_PATCH"          "CHR_MG3833_MG4220_PATCH"  
 [33] "CHR_MG3835_PATCH"          "CHR_MG3836_PATCH"         
 [35] "CHR_MG3999_PATCH"          "CHR_MG4136_PATCH"         
 [37] "CHR_MG4138_PATCH"          "CHR_MG4151_PATCH"         
 [39] "CHR_MG4162_PATCH"          "CHR_MG4180_PATCH"         
 [41] "CHR_MG4198_PATCH"          "CHR_MG4200_PATCH"         
 [43] "CHR_MG4209_PATCH"          "CHR_MG4211_PATCH"         
 [45] "CHR_MG4212_PATCH"          "CHR_MG4213_PATCH"         
 [47] "CHR_MG4214_PATCH"          "CHR_MG4222_MG3908_PATCH"  
 [49] "CHR_MG4243_PATCH"          "CHR_MG4248_PATCH"         
 [51] "CHR_MG4249_PATCH"          "CHR_MG4254_PATCH"         
 [53] "CHR_MG4255_PATCH"          "CHR_MG4259_PATCH"         
 [55] "CHR_MG4261_PATCH"          "CHR_MG4264_PATCH"         
 [57] "CHR_MG4265_PATCH"          "CHR_MG4266_PATCH"         
 [59] "CHR_MG4281_PATCH"          "CHR_MG4288_PATCH"         
 [61] "CHR_MG4308_PATCH"          "CHR_MG4310_MG4311_PATCH"  
 [63] "CHR_MG51_PATCH"            "CHR_MG65_PATCH"           
 [65] "CHR_MG74_PATCH"            "CHR_MG89_PATCH"           
 [67] "CHR_MMCHR1_CHORI29_IDD5_1" "CHR_PWK_PHJ_MMCHR11_CTG1" 
 [69] "CHR_PWK_PHJ_MMCHR11_CTG2"  "CHR_PWK_PHJ_MMCHR11_CTG3" 
 [71] "CHR_WSB_EIJ_MMCHR11_CTG1"  "CHR_WSB_EIJ_MMCHR11_CTG2" 
 [73] "CHR_WSB_EIJ_MMCHR11_CTG3"  "1"                        
 [75] "2"                         "3"                        
 [77] "4"                         "5"                        
 [79] "6"                         "7"                        
 [81] "8"                         "9"                        
 [83] "10"                        "11"                       
 [85] "12"                        "13"                       
 [87] "14"                        "15"                       
 [89] "16"                        "17"                       
 [91] "18"                        "19"                       
 [93] "X"                         "Y"                        
 [95] "MT"                        "GL456210.1"               
 [97] "GL456211.1"                "GL456212.1"               
 [99] "GL456213.1"                "GL456216.1"               
[101] "GL456219.1"                "GL456221.1"               
[103] "GL456233.1"                "GL456239.1"               
[105] "GL456350.1"                "GL456354.1"               
[107] "GL456359.1"                "GL456360.1"               
[109] "GL456366.1"                "GL456367.1"               
[111] "GL456368.1"                "GL456370.1"               
[113] "GL456372.1"                "GL456378.1"               
[115] "GL456379.1"                "GL456381.1"               
[117] "GL456382.1"                "GL456383.1"               
[119] "GL456385.1"                "GL456387.1"               
[121] "GL456389.1"                "GL456390.1"               
[123] "GL456392.1"                "GL456393.1"               
[125] "GL456394.1"                "GL456396.1"               
[127] "JH584292.1"                "JH584293.1"               
[129] "JH584294.1"                "JH584295.1"               
[131] "JH584296.1"                "JH584297.1"               
[133] "JH584298.1"                "JH584299.1"               
[135] "JH584300.1"                "JH584301.1"               
[137] "JH584302.1"                "JH584303.1"               
[139] "JH584304.1"               
sigRegions <- keepSeqlevels(sigRegions, 
                            value = c(1:19,"X","Y"),
                            pruning.mode="tidy")
seqlevels(sigRegions)
 [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11"
[12] "12" "13" "14" "15" "16" "17" "18" "19" "X"  "Y" 

Add metadata to GRanges object

A useful propery of GenomicRanges is that we can attach metadata to each range using the mcols function. The metadata can be supplied in the form of a data frame.

mcols(sigRegions) <- sigGenes[match(names(sigRegions), sigGenes$GeneID), ]
sigRegions
GRanges object with 4261 ranges and 16 metadata columns:
                     seqnames            ranges strand |
                        <Rle>         <IRanges>  <Rle> |
  ENSMUSG00000025903        1   4807788-4848410      + |
  ENSMUSG00000103280        1   4905751-4906861      - |
  ENSMUSG00000033793        1   5070018-5162529      + |
  ENSMUSG00000051285        1   7088920-7173628      + |
  ENSMUSG00000103509        1   7148110-7152137      + |
                 ...      ...               ...    ... .
  ENSMUSG00000033478       19 57361009-57389594      + |
  ENSMUSG00000040022       19 59902884-59943654      - |
  ENSMUSG00000024993       19 60811585-60836227      + |
  ENSMUSG00000024997       19 60864051-60874556      - |
  ENSMUSG00000074733       19 61053840-61140840      - |
                                 GeneID         baseMean
                            <character>        <numeric>
  ENSMUSG00000025903 ENSMUSG00000025903 724.446609497753
  ENSMUSG00000103280 ENSMUSG00000103280 11.0727087099247
  ENSMUSG00000033793 ENSMUSG00000033793 1263.66334600512
  ENSMUSG00000051285 ENSMUSG00000051285  1483.9749407736
  ENSMUSG00000103509 ENSMUSG00000103509 25.8677212181917
                 ...                ...              ...
  ENSMUSG00000033478 ENSMUSG00000033478 604.940764140757
  ENSMUSG00000040022 ENSMUSG00000040022 420.277348654596
  ENSMUSG00000024993 ENSMUSG00000024993 273.706275359975
  ENSMUSG00000024997 ENSMUSG00000024997 1155.47226515427
  ENSMUSG00000074733 ENSMUSG00000074733 151.521609493818
                                 logFC             lfcSE
                             <numeric>         <numeric>
  ENSMUSG00000025903  0.64787137782662 0.144229304891253
  ENSMUSG00000103280 -1.58750612880118 0.434503203076042
  ENSMUSG00000033793 0.877213503228488 0.106454855140229
  ENSMUSG00000051285  1.29960059994037 0.176033709290278
  ENSMUSG00000103509  1.18134725817846 0.299145205180617
                 ...               ...               ...
  ENSMUSG00000033478 0.485457965666896 0.143244108885519
  ENSMUSG00000040022  1.04903357170258 0.177594139807896
  ENSMUSG00000024993 0.570208882336257 0.163411073403972
  ENSMUSG00000024997 0.896987748935108 0.184738597369257
  ENSMUSG00000074733 0.831352686057778 0.159578210641601
                                  stat               pvalue
                             <numeric>            <numeric>
  ENSMUSG00000025903  4.50342169142557 6.68680141243707e-06
  ENSMUSG00000103280 -3.55360486193175  0.00037998967372985
  ENSMUSG00000033793  8.25203716658962 1.55716974876409e-16
  ENSMUSG00000051285  7.35737302703237 1.87564671714221e-13
  ENSMUSG00000103509  3.84551924990226 0.000120297419390057
                 ...               ...                  ...
  ENSMUSG00000033478  3.38170660330777 0.000720370394442405
  ENSMUSG00000040022  5.87606192866585 4.20141204139746e-09
  ENSMUSG00000024993  3.45286302118763 0.000554670584402014
  ENSMUSG00000024997  4.88409037207611  1.0390741291339e-06
  ENSMUSG00000074733  5.14752752400069 2.63942285548567e-07
                                      FDR      Entrez
                                <numeric> <character>
  ENSMUSG00000025903 6.82491399525833e-05       18777
  ENSMUSG00000103280  0.00227338281229412        <NA>
  ENSMUSG00000033793 1.34050472716004e-14      108664
  ENSMUSG00000051285 9.64437264692718e-12      319263
  ENSMUSG00000103509 0.000849854587410265        <NA>
                 ...                  ...         ...
  ENSMUSG00000033478  0.00387098901394704      226252
  ENSMUSG00000040022 9.31606807547629e-08       74998
  ENSMUSG00000024993  0.00311011136700511       67894
  ENSMUSG00000024997 1.31438732092901e-05       11757
  ENSMUSG00000074733 3.88962198494305e-06      414758
                          Symbol
                     <character>
  ENSMUSG00000025903      Lypla1
  ENSMUSG00000103280     Gm37277
  ENSMUSG00000033793     Atp6v1h
  ENSMUSG00000051285      Pcmtd1
  ENSMUSG00000103509     Gm38372
                 ...         ...
  ENSMUSG00000033478    Fam160b1
  ENSMUSG00000040022   Rab11fip2
  ENSMUSG00000024993      Fam45a
  ENSMUSG00000024997       Prdx3
  ENSMUSG00000074733      Zfp950
                                                                                                                          Description
                                                                                                                          <character>
  ENSMUSG00000025903                                                          lysophospholipase 1 [Source:MGI Symbol;Acc:MGI:1344588]
  ENSMUSG00000103280                                                        predicted gene, 37277 [Source:MGI Symbol;Acc:MGI:5610505]
  ENSMUSG00000033793                              ATPase, H+ transporting, lysosomal V1 subunit H [Source:MGI Symbol;Acc:MGI:1914864]
  ENSMUSG00000051285 protein-L-isoaspartate (D-aspartate) O-methyltransferase domain containing 1 [Source:MGI Symbol;Acc:MGI:2441773]
  ENSMUSG00000103509                                                        predicted gene, 38372 [Source:MGI Symbol;Acc:MGI:5611600]
                 ...                                                                                                              ...
  ENSMUSG00000033478                               family with sequence similarity 160, member B1 [Source:MGI Symbol;Acc:MGI:2147545]
  ENSMUSG00000040022                                 RAB11 family interacting protein 2 (class I) [Source:MGI Symbol;Acc:MGI:1922248]
  ENSMUSG00000024993                                 family with sequence similarity 45, member A [Source:MGI Symbol;Acc:MGI:1915144]
  ENSMUSG00000024997                                                                peroxiredoxin 3 [Source:MGI Symbol;Acc:MGI:88034]
  ENSMUSG00000074733                                                      zinc finger protein 950 [Source:MGI Symbol;Acc:MGI:2652824]
                            Biotype         Chr     Start
                        <character> <character> <integer>
  ENSMUSG00000025903 protein_coding           1   4807788
  ENSMUSG00000103280            TEC           1   4905751
  ENSMUSG00000033793 protein_coding           1   5070018
  ENSMUSG00000051285 protein_coding           1   7088920
  ENSMUSG00000103509            TEC           1   7148110
                 ...            ...         ...       ...
  ENSMUSG00000033478 protein_coding          19  57361009
  ENSMUSG00000040022 protein_coding          19  59902884
  ENSMUSG00000024993 protein_coding          19  60811585
  ENSMUSG00000024997 protein_coding          19  60864051
  ENSMUSG00000074733 protein_coding          19  61053840
                           End    Strand medianTxLength
                     <integer> <integer>      <numeric>
  ENSMUSG00000025903   4848410         1          903.5
  ENSMUSG00000103280   4906861        -1           1111
  ENSMUSG00000033793   5162529         1           1930
  ENSMUSG00000051285   7173628         1           1906
  ENSMUSG00000103509   7152137         1           4028
                 ...       ...       ...            ...
  ENSMUSG00000033478  57389594         1           5630
  ENSMUSG00000040022  59943654        -1           4482
  ENSMUSG00000024993  60836227         1            826
  ENSMUSG00000024997  60874556        -1           1478
  ENSMUSG00000074733  61140840        -1          758.5
  -------
  seqinfo: 21 sequences from an unspecified genome

Scores and colour on exported tracks

The .bed file format is commonly used to store genomic locations for display in genome browsers (e.g. the UCSC browser or IGV) as tracks. Rather than just representing the genomic locations, the .bed format is also able to colour each range according to some property of the analysis (e.g. direction and magnitude of change) to help highlight particular regions of interest. A score can also be displayed when a particular region is clicked-on.

For the score we can use the \(-log_{10}\) of the adjusted p-value and colour scheme for the regions based on the fold-change

colorRampPalette is a useful function in base R for constructing a palette between two extremes. When choosing colour palettes, make sure they are colour blind friendly. The red / green colour scheme traditionally-applied to microarrays is a bad choice.

We will also truncate the fold-changes to between -5 and 5 to and divide this range into 10 equal bins

rbPal <- colorRampPalette(c("red", "blue"))
logFC <- pmax(sigRegions$logFC, -5)
logFC <- pmin(logFC , 5)
Cols <- rbPal(10)[as.numeric(cut(logFC, breaks = 10))]

The colours and score have to be saved in the GRanges object as score and itemRgb columns respectively, and will be used to construct the browser track. The rtracklayer package can be used to import and export browsers tracks.

Now we can export the signifcant results from the DE analysis as a .bed track using rtracklayer. You can load the resulting file in IGV, if you wish.

mcols(sigRegions)$score <- -log10(sigRegions$FDR)
mcols(sigRegions)$itemRgb <- Cols
sigRegions
GRanges object with 4261 ranges and 18 metadata columns:
                     seqnames            ranges strand |
                        <Rle>         <IRanges>  <Rle> |
  ENSMUSG00000025903        1   4807788-4848410      + |
  ENSMUSG00000103280        1   4905751-4906861      - |
  ENSMUSG00000033793        1   5070018-5162529      + |
  ENSMUSG00000051285        1   7088920-7173628      + |
  ENSMUSG00000103509        1   7148110-7152137      + |
                 ...      ...               ...    ... .
  ENSMUSG00000033478       19 57361009-57389594      + |
  ENSMUSG00000040022       19 59902884-59943654      - |
  ENSMUSG00000024993       19 60811585-60836227      + |
  ENSMUSG00000024997       19 60864051-60874556      - |
  ENSMUSG00000074733       19 61053840-61140840      - |
                                 GeneID         baseMean
                            <character>        <numeric>
  ENSMUSG00000025903 ENSMUSG00000025903 724.446609497753
  ENSMUSG00000103280 ENSMUSG00000103280 11.0727087099247
  ENSMUSG00000033793 ENSMUSG00000033793 1263.66334600512
  ENSMUSG00000051285 ENSMUSG00000051285  1483.9749407736
  ENSMUSG00000103509 ENSMUSG00000103509 25.8677212181917
                 ...                ...              ...
  ENSMUSG00000033478 ENSMUSG00000033478 604.940764140757
  ENSMUSG00000040022 ENSMUSG00000040022 420.277348654596
  ENSMUSG00000024993 ENSMUSG00000024993 273.706275359975
  ENSMUSG00000024997 ENSMUSG00000024997 1155.47226515427
  ENSMUSG00000074733 ENSMUSG00000074733 151.521609493818
                                 logFC             lfcSE
                             <numeric>         <numeric>
  ENSMUSG00000025903  0.64787137782662 0.144229304891253
  ENSMUSG00000103280 -1.58750612880118 0.434503203076042
  ENSMUSG00000033793 0.877213503228488 0.106454855140229
  ENSMUSG00000051285  1.29960059994037 0.176033709290278
  ENSMUSG00000103509  1.18134725817846 0.299145205180617
                 ...               ...               ...
  ENSMUSG00000033478 0.485457965666896 0.143244108885519
  ENSMUSG00000040022  1.04903357170258 0.177594139807896
  ENSMUSG00000024993 0.570208882336257 0.163411073403972
  ENSMUSG00000024997 0.896987748935108 0.184738597369257
  ENSMUSG00000074733 0.831352686057778 0.159578210641601
                                  stat               pvalue
                             <numeric>            <numeric>
  ENSMUSG00000025903  4.50342169142557 6.68680141243707e-06
  ENSMUSG00000103280 -3.55360486193175  0.00037998967372985
  ENSMUSG00000033793  8.25203716658962 1.55716974876409e-16
  ENSMUSG00000051285  7.35737302703237 1.87564671714221e-13
  ENSMUSG00000103509  3.84551924990226 0.000120297419390057
                 ...               ...                  ...
  ENSMUSG00000033478  3.38170660330777 0.000720370394442405
  ENSMUSG00000040022  5.87606192866585 4.20141204139746e-09
  ENSMUSG00000024993  3.45286302118763 0.000554670584402014
  ENSMUSG00000024997  4.88409037207611  1.0390741291339e-06
  ENSMUSG00000074733  5.14752752400069 2.63942285548567e-07
                                      FDR      Entrez
                                <numeric> <character>
  ENSMUSG00000025903 6.82491399525833e-05       18777
  ENSMUSG00000103280  0.00227338281229412        <NA>
  ENSMUSG00000033793 1.34050472716004e-14      108664
  ENSMUSG00000051285 9.64437264692718e-12      319263
  ENSMUSG00000103509 0.000849854587410265        <NA>
                 ...                  ...         ...
  ENSMUSG00000033478  0.00387098901394704      226252
  ENSMUSG00000040022 9.31606807547629e-08       74998
  ENSMUSG00000024993  0.00311011136700511       67894
  ENSMUSG00000024997 1.31438732092901e-05       11757
  ENSMUSG00000074733 3.88962198494305e-06      414758
                          Symbol
                     <character>
  ENSMUSG00000025903      Lypla1
  ENSMUSG00000103280     Gm37277
  ENSMUSG00000033793     Atp6v1h
  ENSMUSG00000051285      Pcmtd1
  ENSMUSG00000103509     Gm38372
                 ...         ...
  ENSMUSG00000033478    Fam160b1
  ENSMUSG00000040022   Rab11fip2
  ENSMUSG00000024993      Fam45a
  ENSMUSG00000024997       Prdx3
  ENSMUSG00000074733      Zfp950
                                                                                                                          Description
                                                                                                                          <character>
  ENSMUSG00000025903                                                          lysophospholipase 1 [Source:MGI Symbol;Acc:MGI:1344588]
  ENSMUSG00000103280                                                        predicted gene, 37277 [Source:MGI Symbol;Acc:MGI:5610505]
  ENSMUSG00000033793                              ATPase, H+ transporting, lysosomal V1 subunit H [Source:MGI Symbol;Acc:MGI:1914864]
  ENSMUSG00000051285 protein-L-isoaspartate (D-aspartate) O-methyltransferase domain containing 1 [Source:MGI Symbol;Acc:MGI:2441773]
  ENSMUSG00000103509                                                        predicted gene, 38372 [Source:MGI Symbol;Acc:MGI:5611600]
                 ...                                                                                                              ...
  ENSMUSG00000033478                               family with sequence similarity 160, member B1 [Source:MGI Symbol;Acc:MGI:2147545]
  ENSMUSG00000040022                                 RAB11 family interacting protein 2 (class I) [Source:MGI Symbol;Acc:MGI:1922248]
  ENSMUSG00000024993                                 family with sequence similarity 45, member A [Source:MGI Symbol;Acc:MGI:1915144]
  ENSMUSG00000024997                                                                peroxiredoxin 3 [Source:MGI Symbol;Acc:MGI:88034]
  ENSMUSG00000074733                                                      zinc finger protein 950 [Source:MGI Symbol;Acc:MGI:2652824]
                            Biotype         Chr     Start
                        <character> <character> <integer>
  ENSMUSG00000025903 protein_coding           1   4807788
  ENSMUSG00000103280            TEC           1   4905751
  ENSMUSG00000033793 protein_coding           1   5070018
  ENSMUSG00000051285 protein_coding           1   7088920
  ENSMUSG00000103509            TEC           1   7148110
                 ...            ...         ...       ...
  ENSMUSG00000033478 protein_coding          19  57361009
  ENSMUSG00000040022 protein_coding          19  59902884
  ENSMUSG00000024993 protein_coding          19  60811585
  ENSMUSG00000024997 protein_coding          19  60864051
  ENSMUSG00000074733 protein_coding          19  61053840
                           End    Strand medianTxLength
                     <integer> <integer>      <numeric>
  ENSMUSG00000025903   4848410         1          903.5
  ENSMUSG00000103280   4906861        -1           1111
  ENSMUSG00000033793   5162529         1           1930
  ENSMUSG00000051285   7173628         1           1906
  ENSMUSG00000103509   7152137         1           4028
                 ...       ...       ...            ...
  ENSMUSG00000033478  57389594         1           5630
  ENSMUSG00000040022  59943654        -1           4482
  ENSMUSG00000024993  60836227         1            826
  ENSMUSG00000024997  60874556        -1           1478
  ENSMUSG00000074733  61140840        -1          758.5
                                score     itemRgb
                            <numeric> <character>
  ENSMUSG00000025903 4.16590281705205     #71008D
  ENSMUSG00000103280 2.64332742777324     #AA0055
  ENSMUSG00000033793  13.872731650181     #71008D
  ENSMUSG00000051285 11.0157260173554     #5500AA
  ENSMUSG00000103509 3.07065537697718     #5500AA
                 ...              ...         ...
  ENSMUSG00000033478 2.41217806122792     #71008D
  ENSMUSG00000040022 7.03076734659767     #5500AA
  ENSMUSG00000024993 2.50722405945875     #71008D
  ENSMUSG00000024997 4.88127663891956     #71008D
  ENSMUSG00000074733 5.41009260377211     #71008D
  -------
  seqinfo: 21 sequences from an unspecified genome
library(rtracklayer)
export(sigRegions , con = "../Course_Materials/results/topHits.bed")

Extracting Reads

As we have been using counts as our starting point, we haven’t investigated the aligned reads from our experiment, and how they are represented. As you may be aware, aligned reads are usually stored in a bam file that can be manipulated with open-source command-line tools such as samtools and picard. Bioconductor provide a low-level interface to data/bam/sam files in the form of the Rsamtools package. The GenomicAlignments package can also be used to retrieve the reads mapping to a particular genomic region in an efficient manner.

library(GenomicAlignments)

In the directory small_bams there should be .bam files for some of the samples in the example study. The workflow to produce these files is described in a supplmentary page for the course. In brief, the raw reads (fastq) were downloaded from the Short Read Archive (SRA) and aligned with hisat2. Each bam file was named according to the file name in SRA, but we have renamed the files according to their name in the study. An index file (.bai) has been generated for each bam file. In order to reduce the size, the bam files used here only contain a subset of the reads that were aligned in the region chr15:101707000-101713000.

list.files("../Course_Materials/counts/small_bams/")
 [1] "MCL1.DG.15.sm.bam"               
 [2] "MCL1.DG.15.sm.bam.bai"           
 [3] "MCL1.DH.15.sm.bam"               
 [4] "MCL1.DH.15.sm.bam.bai"           
 [5] "MCL1.DI.15.sm.bam"               
 [6] "MCL1.DI.15.sm.bam.bai"           
 [7] "MCL1.DJ.15.sm.bam"               
 [8] "MCL1.DJ.15.sm.bam.bai"           
 [9] "MCL1.DK.15.sm.bam"               
[10] "MCL1.DK.15.sm.bam.bai"           
[11] "MCL1.DL.15.sm.bam"               
[12] "MCL1.DL.15.sm.bam.bai"           
[13] "MCL1.LA.15.sm.bam"               
[14] "MCL1.LA.15.sm.bam.bai"           
[15] "MCL1.LB.15.sm.bam"               
[16] "MCL1.LB.15.sm.bam.bai"           
[17] "MCL1.LC.15.sm.bam"               
[18] "MCL1.LC.15.sm.bam.bai"           
[19] "MCL1.LD.15.sm.bam"               
[20] "MCL1.LD.15.sm.bam.bai"           
[21] "MCL1.LE.15.sm.bam"               
[22] "MCL1.LE.15.sm.bam.bai"           
[23] "MCL1.LF.15.sm.bam"               
[24] "MCL1.LF.15.sm.bam.bai"           
[25] "Mus_musculus.GRCm38.80.chr15.gtf"

The readGAlignments function provides a simple interface to interrogate the aligned reads for a particular sample. It can also utilise the index file in order to retrieve only the reads that correspond to a specific region in an efficient manner. The output includes the genomic location of each aligned read and the CIGAR (Compact Idiosyncratic Gapped Alignment Report); where M denotes an match to the genome and I, D correspond to insertions and deletions.

exo <- exonsBy(txMm, "gene") 
generegion <- exo[["ENSMUSG00000022146"]] %>% 
    keepSeqlevels(value = 15, pruning.mode="tidy")
my.reads <- readGAlignments(file="../Course_Materials/counts/small_bams/MCL1.DG.15.sm.bam",
                       param=ScanBamParam(which=generegion))
my.reads
GAlignments object with 25419 alignments and 0 metadata columns:
          seqnames strand          cigar    qwidth     start
             <Rle>  <Rle>    <character> <integer> <integer>
      [1]       15      + 81M53311N11M8S       100   6799340
      [2]       15      +           100M       100   6813575
      [3]       15      +          3S97M       100   6813579
      [4]       15      +          6S94M       100   6813579
      [5]       15      +           100M       100   6813580
      ...      ...    ...            ...       ...       ...
  [25415]       15      -           100M       100   6874937
  [25416]       15      -           100M       100   6874941
  [25417]       15      -          99M1S       100   6874945
  [25418]       15      +           100M       100   6874962
  [25419]       15      -           100M       100   6874966
                end     width     njunc
          <integer> <integer> <integer>
      [1]   6852742     53403         1
      [2]   6813674       100         0
      [3]   6813675        97         0
      [4]   6813672        94         0
      [5]   6813679       100         0
      ...       ...       ...       ...
  [25415]   6875036       100         0
  [25416]   6875040       100         0
  [25417]   6875043        99         0
  [25418]   6875061       100         0
  [25419]   6875065       100         0
  -------
  seqinfo: 66 sequences from an unspecified genome

It is possible to tweak the function to retrieve other potentially-useful information from the bam file, such as the mapping quality and flag.

my.reads <- readGAlignments(file="../Course_Materials/counts/small_bams/MCL1.DG.15.sm.bam",
                       param=ScanBamParam(which=generegion,
                                          what=c("seq","mapq","flag")))
my.reads
GAlignments object with 25419 alignments and 3 metadata columns:
          seqnames strand          cigar    qwidth     start
             <Rle>  <Rle>    <character> <integer> <integer>
      [1]       15      + 81M53311N11M8S       100   6799340
      [2]       15      +           100M       100   6813575
      [3]       15      +          3S97M       100   6813579
      [4]       15      +          6S94M       100   6813579
      [5]       15      +           100M       100   6813580
      ...      ...    ...            ...       ...       ...
  [25415]       15      -           100M       100   6874937
  [25416]       15      -           100M       100   6874941
  [25417]       15      -          99M1S       100   6874945
  [25418]       15      +           100M       100   6874962
  [25419]       15      -           100M       100   6874966
                end     width     njunc |
          <integer> <integer> <integer> |
      [1]   6852742     53403         1 |
      [2]   6813674       100         0 |
      [3]   6813675        97         0 |
      [4]   6813672        94         0 |
      [5]   6813679       100         0 |
      ...       ...       ...       ... .
  [25415]   6875036       100         0 |
  [25416]   6875040       100         0 |
  [25417]   6875043        99         0 |
  [25418]   6875061       100         0 |
  [25419]   6875065       100         0 |
                              seq      mapq      flag
                   <DNAStringSet> <integer> <integer>
      [1] GTTTGGAAGT...TCTCCTAAAC        60         0
      [2] GAAATGTTTT...ATCAATGTCA        60         0
      [3] TTTTGTTTTA...TCAATGTCAT        60         0
      [4] TTTTTTTGTT...AAATCAATGT        60         0
      [5] GTTTTAATTT...TGTCATTAAC        60         0
      ...                     ...       ...       ...
  [25415] TCTCTTTATG...TTCCCACCAG        60        16
  [25416] TTTATGGCTG...CACCAGTCGC        60        16
  [25417] TGGCTGCATG...AGTCGCCAGA        60        16
  [25418] GTCCACAGCC...GCCTGGAGAA        60         0
  [25419] ACAGCCACGT...GGAGAACCGC        60        16
  -------
  seqinfo: 66 sequences from an unspecified genome

The flag can represent useful QC information. e.g.

The combination of any of these properties is used to derive a numeric value, as illustrated in this useful resource

Particular attributes of the reads can be extracted and visualised

hist(mcols(my.reads)$mapq, main="", xlab="MAPQ")

However, there are more-sophisticated visualisation options for aligned reads and range data. We will use the ggbio package, which first requires some discussion of the ggplot2 plotting package.

Composing plots with ggbio

We will now take a brief look at one of the visualisation packages in Bioconductor that takes advantage of the GenomicRanges and GenomicFeatures object-types. In this section we will show a worked example of how to combine several types of genomic data on the same plot. The documentation for ggbio is very extensive and contains lots of examples.

http://www.tengfei.name/ggbio/docs/

The Gviz package is another Bioconductor package that specialising in genomic visualisations, but we will not explore this package in the course.

The Manhattan plot is a common way of visualising genome-wide results, especially when one is concerned with the results of a GWAS study and identifying strongly-associated hits.

The profile is supposed to resemble the Manhattan skyline with particular skyscrapers towering about the lower level buildings.