Strip chart for gene expression
- Interactive StripChart with Glimma
Working with Genomic Locations - Transcript database packages

library(DESeq2)
library(tidyverse)

Load in the datasets from the Annotation and Visualisation main session

ddsObj <- readRDS("RObjects/DESeqDataSet.interaction.rds")
shrinkTab.11 <- readRDS("RObjects/Shrunk_Results.d11.rds")
sampleinfo <- read_tsv("data/samplesheet_corrected.tsv")

Strip chart for gene expression

Before following up on the DE genes with further lab work, a recommended sanity check is to have a look at the expression levels of the individual samples for the genes of interest. We can quickly look at grouped expression by using plotCounts function of DESeq2 to retrieve the normalised expression values from the ddsObj object and then plotting with ggplot2.

We are going investigate the following gene:

GeneID	Symbol	Description	logFC	FDR
ENSMUSG00000032089	Il10ra	interleukin 10 receptor, alpha	1.48	1.85e-10

geneID <- filter(shrinkTab.11, Symbol=="Il10ra") %>% pull(GeneID)

plotCounts(ddsObj, 
           gene = geneID, 
           intgroup = c("TimePoint", "Status", "Replicate"),
           returnData = T) %>% 
    ggplot(aes(x=Status, y=log2(count))) +
      geom_point(aes(fill=Replicate), shape=21, size=2) +
      facet_wrap(~TimePoint) +
      expand_limits(y=0) +
      labs(title = "Normalised counts - Interleukin 10 receptor, alpha")

Interactive StripChart with Glimma

An interactive version of the volcano plot above that includes the raw per sample values in a separate panel is possible via the glXYPlot function in the Glimma package.

library(Glimma)

group <- sampleinfo %>% 
  mutate(Group=str_c(Status, ".", TimePoint)) %>% 
  pull(Group)

de <- as.integer(shrinkTab.11$FDR <= 0.05 & !is.na(shrinkTab.11$FDR))

normCounts <- rlog(ddsObj) %>% 
  assay()

glXYPlot(
  x = shrinkTab.11$logFC,
  y = -log10(shrinkTab.11$pvalue),
  xlab = "logFC",
  ylab = "FDR",
  main = "Infected v Uninfected - day 11",
  counts = normCounts,
  groups = group,
  status = de,
  anno = shrinkTab.11[, c("GeneID", "Symbol", "Description")],
  folder = "volcano"
)

This function creates an html page (at ./volcano/XY-Plot.html) with a volcano plot on the left and a plot showing the log-CPM per sample for a selected gene on the right. A search bar is available to search for genes of interest.

Working with Genomic Locations - Transcript database packages

There is a whole suite of annotation packages that can be used to access and to perform advanced queries on information about the genomic location of genes, trancripts and exons. These are listed on the Bioconductor annotation page and have the prefix TxDb. (where “tx” is “transcript”). In addition there are a large number of packages that make use of these annotations for downstream analyses and visualizations.

Unfortunately, these TxDb packages do not cover all species and tend only to be available for UCSC genomes. Thankfully, there is a way to build your own database from either a GTF file or from various online resources such as Biomart using the package GenomicFeatures.

library(GenomicFeatures)

Creating a TxDb database from Ensembl

The created database is only loaded into the current R session.

Note: you may need to install the RMariaDB package in order to use this command.

txMm <- makeTxDbFromEnsembl(organism="Mus musculus", release=102)

## Fetch transcripts and genes from Ensembl ... OK
##   (fetched 144778 transcripts from 56305 genes)
## Fetch exons and CDS from Ensembl ... OK
## Fetch chromosome names and lengths from Ensembl ...OK
## Gather the metadata ... OK
## Make the TxDb object ... OK

In order to avoid having to query Ensembl each time we want to use the database we can save the database.

saveDb(txMm, file = "RObjects/TxDb.GRCm38.102.sqlite")

## TxDb object:
## # Db type: TxDb
## # Supporting package: GenomicFeatures
## # Data source: Ensembl
## # Organism: Mus musculus
## # Ensembl release: 102
## # Ensembl database: mus_musculus_core_102_38
## # MySQL server: ensembldb.ensembl.org
## # Full dataset: yes
## # Nb of transcripts: 144778
## # Db created by: GenomicFeatures package from Bioconductor
## # Creation time: 2021-04-19 12:02:51 +0100 (Mon, 19 Apr 2021)
## # GenomicFeatures version at creation time: 1.42.3
## # RSQLite version at creation time: 2.2.6
## # DBSCHEMAVERSION: 1.2

To reload the database:

txMm <- loadDb("RObjects/TxDb.GRCm38.102.sqlite")

Retrieving information from TxDb packages

Accessing the information in these TxDb databases is similar to the way in which we accessed information using biomaRt except that filters (the information we are filtering on) are now called keys and attributes (things we want to retrieve) are columns.

First we need to decide what information we want. In order to see what we can extract we can run the columns function on the annotation database.

columns(txMm)

##  [1] "CDSCHROM"   "CDSEND"     "CDSID"      "CDSNAME"    "CDSPHASE"  
##  [6] "CDSSTART"   "CDSSTRAND"  "EXONCHROM"  "EXONEND"    "EXONID"    
## [11] "EXONNAME"   "EXONRANK"   "EXONSTART"  "EXONSTRAND" "GENEID"    
## [16] "TXCHROM"    "TXEND"      "TXID"       "TXNAME"     "TXSTART"   
## [21] "TXSTRAND"   "TXTYPE"

We are going to filter the database by a key or set of keys in order to extract the information we want. Valid names for the key can be retrieved with the keytypes function.

keytypes(txMm)

## [1] "CDSID"    "CDSNAME"  "EXONID"   "EXONNAME" "GENEID"   "TXID"     "TXNAME"

To extract information we use the select function. Let’s get transcript information for our most highly differentially expressed gene.

keyList <- filter(shrinkTab.11, Symbol=="Il10ra") %>% pull(GeneID)
select(txMm, 
       keys=keyList,
       keytype = "GENEID",
       columns=c("TXNAME", "TXCHROM", "TXSTART", "TXEND", "TXSTRAND", "TXTYPE")
      )

## 'select()' returned 1:many mapping between keys and columns

##               GENEID             TXNAME                  TXTYPE TXCHROM
## 1 ENSMUSG00000032089 ENSMUST00000034594          protein_coding       9
## 2 ENSMUSG00000032089 ENSMUST00000176808 nonsense_mediated_decay       9
## 3 ENSMUSG00000032089 ENSMUST00000176222          protein_coding       9
##   TXSTRAND  TXSTART    TXEND
## 1        - 45253837 45269149
## 2        - 45253840 45269149
## 3        - 45254527 45269146

Overview of GenomicRanges

One of the real strengths of the txdb.. databases is the ability to interface with GenomicRanges, which is the object type used throughout Bioconductor to manipulate Genomic Intervals.

These object types permit us to perform common operations on intervals such as overlapping and counting. We can define the chromosome, start and end position of each region (also strand too, but not shown here).

library(GenomicRanges)
simple_range <- GRanges(seqnames = "1", ranges = IRanges(start=1000, end=2000))
simple_range

## GRanges object with 1 range and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]        1 1000-2000      *
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

We don’t have to have all our ranges located on the same chromosome

chrs <- c("13", "15", "5")
start <- c(73000000, 6800000, 15000000)
end <- c(74000000, 6900000, 16000000)

my_ranges <- GRanges(seqnames = rep(chrs, 3),
                     ranges = IRanges(start = rep(start, each = 3),
                                      end = rep(end, each = 3))
                     )
my_ranges

## GRanges object with 9 ranges and 0 metadata columns:
##       seqnames            ranges strand
##          <Rle>         <IRanges>  <Rle>
##   [1]       13 73000000-74000000      *
##   [2]       15 73000000-74000000      *
##   [3]        5 73000000-74000000      *
##   [4]       13   6800000-6900000      *
##   [5]       15   6800000-6900000      *
##   [6]        5   6800000-6900000      *
##   [7]       13 15000000-16000000      *
##   [8]       15 15000000-16000000      *
##   [9]        5 15000000-16000000      *
##   -------
##   seqinfo: 3 sequences from an unspecified genome; no seqlengths

There are a number of useful functions for calculating properties of the data (such as coverage or sorting). Not so much for RNA-seq analysis, but GenomicRanges are used throughout Bioconductor for the analysis of NGS data.

For instance, we can quickly identify overlapping regions between two GenomicRanges.

keys <- c("ENSMUSG00000021604", "ENSMUSG00000022146", "ENSMUSG00000040118")
genePos <- select(txMm,
                  keys = keys,
                  keytype = "GENEID",
                  columns = c("EXONCHROM", "EXONSTART", "EXONEND")
                  )

## 'select()' returned 1:many mapping between keys and columns

geneRanges <- GRanges(genePos$EXONCHROM, 
                      ranges = IRanges(genePos$EXONSTART, genePos$EXONEND), 
                      GENEID = genePos$GENEID)
geneRanges

## GRanges object with 96 ranges and 1 metadata column:
##        seqnames            ranges strand |             GENEID
##           <Rle>         <IRanges>  <Rle> |        <character>
##    [1]       13 73260479-73260653      * | ENSMUSG00000021604
##    [2]       13 73264848-73264979      * | ENSMUSG00000021604
##    [3]       13 73265458-73265709      * | ENSMUSG00000021604
##    [4]       13 73266596-73266708      * | ENSMUSG00000021604
##    [5]       13 73267504-73267832      * | ENSMUSG00000021604
##    ...      ...               ...    ... .                ...
##   [92]        5 16327973-16329883      * | ENSMUSG00000040118
##   [93]        5 16326151-16326383      * | ENSMUSG00000040118
##   [94]        5 16340707-16341059      * | ENSMUSG00000040118
##   [95]        5 16361395-16361875      * | ENSMUSG00000040118
##   [96]        5 16362265-16362326      * | ENSMUSG00000040118
##   -------
##   seqinfo: 3 sequences from an unspecified genome; no seqlengths

findOverlaps(my_ranges, geneRanges)

## Hits object with 40 hits and 0 metadata columns:
##        queryHits subjectHits
##        <integer>   <integer>
##    [1]         1           1
##    [2]         1           2
##    [3]         1           3
##    [4]         1           4
##    [5]         1           5
##    ...       ...         ...
##   [36]         9          36
##   [37]         9          75
##   [38]         9          84
##   [39]         9          85
##   [40]         9          87
##   -------
##   queryLength: 9 / subjectLength: 96

However, we have to pay attention to the naming convention used for each object. seqlevelsStyle can help.

seqlevelsStyle(simple_range)

## [1] "NCBI"    "Ensembl" "MSU6"    "AGPvF"

seqlevelsStyle(my_ranges)

## [1] "NCBI"    "Ensembl" "JGI2.F"

seqlevelsStyle(geneRanges)

## [1] "NCBI"    "Ensembl" "JGI2.F"

Exporting tracks

It is also possible to save the results of a Bioconductor analysis as a browser track for viewing in genome browsers such as IGV or the UCSC genome browser. This enables interactive analysis and integration with other data types, or sharing with collaborators.

For instance, we might want a browser track to indicate where our differentially-expressed genes are located. We shall use the bed format to display these locations. We will annotate the ranges with information from our analysis such as the fold-change and significance.

First we create a data frame for just the DE genes.

sigGenes <- filter(shrinkTab.11, FDR <= 0.01)
message("Number of significantly DE genes: ", nrow(sigGenes))

## Number of significantly DE genes: 778

head(sigGenes)

##               GeneID   baseMean     logFC     lfcSE       pvalue          FDR
## 1 ENSMUSG00000000078 1297.09024 0.3762144 0.2067458 3.589610e-04 8.418705e-03
## 2 ENSMUSG00000000204   91.58889 5.3942430 1.1113229 1.164827e-07 5.713459e-06
## 3 ENSMUSG00000000275  595.50222 1.9529231 0.2286427 1.144556e-18 1.397700e-16
## 4 ENSMUSG00000000290  654.15189 2.7878353 0.3636266 6.125412e-15 5.708562e-13
## 5 ENSMUSG00000000317   73.03682 1.3103677 0.4880660 7.577248e-05 2.167534e-03
## 6 ENSMUSG00000000386  145.26601 5.8158264 0.7299876 1.531434e-16 1.585796e-14
##   Entrez Symbol
## 1  23849   Klf6
## 2  20558  Slfn4
## 3 217069 Trim25
## 4  16414  Itgb2
## 5  12029  Bcl6b
## 6     NA    Mx1
##                                                           Description
## 1           Kruppel-like factor 6 [Source:MGI Symbol;Acc:MGI:1346318]
## 2                      schlafen 4 [Source:MGI Symbol;Acc:MGI:1329010]
## 3   tripartite motif-containing 25 [Source:MGI Symbol;Acc:MGI:102749]
## 4                   integrin beta 2 [Source:MGI Symbol;Acc:MGI:96611]
## 5 B cell CLL/lymphoma 6, member B [Source:MGI Symbol;Acc:MGI:1278332]
## 6          MX dynamin-like GTPase 1 [Source:MGI Symbol;Acc:MGI:97243]
##                  Biotype Chr    Start      End Strand medianTxLength
## 1         protein_coding  13  5861482  5870394      1           1115
## 2         protein_coding  11 83175186 83190216      1           3703
## 3         protein_coding  11 88999376 89020293      1           5566
## 4         protein_coding  10 77530252 77565708      1           2411
## 5         protein_coding  11 70224128 70229798     -1            652
## 6 polymorphic_pseudogene  16 97447035 97462907     -1           1163

Create a genomic ranges object

Several convenience functions exist to retrieve the structure of every gene from a given TxDb object in one list. The output of exonsBy is a list, where each item in the list is the exon co-ordinates of a particular gene, however, we do not need this level of granularity for the bed output, so we will collapse to a single region for each gene.

First we use the range function to obtain a single range for every gene and then transform to a more convenient object with unlist.

exoRanges <- exonsBy(txMm, "gene") %>% 
    range() %>% 
    unlist()

sigRegions <- exoRanges[na.omit(match(sigGenes$GeneID, names(exoRanges)))]
sigRegions

## GRanges object with 778 ranges and 0 metadata columns:
##                                    seqnames            ranges strand
##                                       <Rle>         <IRanges>  <Rle>
##   ENSMUSG00000000078                     13   5861482-5870394      +
##   ENSMUSG00000000204                     11 83175186-83190216      +
##   ENSMUSG00000000275                     11 88999376-89020293      +
##   ENSMUSG00000000290                     10 77530252-77565708      +
##   ENSMUSG00000000317                     11 70224128-70229798      -
##                  ...                    ...               ...    ...
##   ENSMUSG00000112148                     10 51490956-51496611      +
##   ENSMUSG00000115219                     16 20611593-20619011      +
##   ENSMUSG00000115338                     14 50931082-50965237      +
##   ENSMUSG00000115886 CHR_WSB_EIJ_MMCHR11_.. 49001537-49002781      -
##   ENSMUSG00000116526 CHR_CAST_EI_MMCHR11_.. 71207858-71242913      -
##   -------
##   seqinfo: 139 sequences (1 circular) from an unspecified genome

For visualization purposes, we are going to restrict the data to genes that are located on chromosomes 1 to 19 and the sex chromosomes. This can be done with the keepSeqLevels function.

seqlevels(sigRegions)

##   [1] "CHR_CAST_EI_MMCHR11_CTG4"  "CHR_CAST_EI_MMCHR11_CTG5" 
##   [3] "CHR_MG104_PATCH"           "CHR_MG117_PATCH"          
##   [5] "CHR_MG132_PATCH"           "CHR_MG153_PATCH"          
##   [7] "CHR_MG171_PATCH"           "CHR_MG184_PATCH"          
##   [9] "CHR_MG190_MG3751_PATCH"    "CHR_MG191_PATCH"          
##  [11] "CHR_MG209_PATCH"           "CHR_MG3172_PATCH"         
##  [13] "CHR_MG3231_PATCH"          "CHR_MG3251_PATCH"         
##  [15] "CHR_MG3490_PATCH"          "CHR_MG3496_PATCH"         
##  [17] "CHR_MG3530_PATCH"          "CHR_MG3561_PATCH"         
##  [19] "CHR_MG3562_PATCH"          "CHR_MG3609_PATCH"         
##  [21] "CHR_MG3618_PATCH"          "CHR_MG3627_PATCH"         
##  [23] "CHR_MG3648_PATCH"          "CHR_MG3656_PATCH"         
##  [25] "CHR_MG3683_PATCH"          "CHR_MG3686_PATCH"         
##  [27] "CHR_MG3699_PATCH"          "CHR_MG3700_PATCH"         
##  [29] "CHR_MG3712_PATCH"          "CHR_MG3714_PATCH"         
##  [31] "CHR_MG3829_PATCH"          "CHR_MG3833_MG4220_PATCH"  
##  [33] "CHR_MG3835_PATCH"          "CHR_MG3836_PATCH"         
##  [35] "CHR_MG3999_PATCH"          "CHR_MG4136_PATCH"         
##  [37] "CHR_MG4138_PATCH"          "CHR_MG4151_PATCH"         
##  [39] "CHR_MG4162_PATCH"          "CHR_MG4180_PATCH"         
##  [41] "CHR_MG4198_PATCH"          "CHR_MG4200_PATCH"         
##  [43] "CHR_MG4209_PATCH"          "CHR_MG4211_PATCH"         
##  [45] "CHR_MG4212_PATCH"          "CHR_MG4213_PATCH"         
##  [47] "CHR_MG4214_PATCH"          "CHR_MG4222_MG3908_PATCH"  
##  [49] "CHR_MG4243_PATCH"          "CHR_MG4248_PATCH"         
##  [51] "CHR_MG4249_PATCH"          "CHR_MG4254_PATCH"         
##  [53] "CHR_MG4255_PATCH"          "CHR_MG4259_PATCH"         
##  [55] "CHR_MG4261_PATCH"          "CHR_MG4264_PATCH"         
##  [57] "CHR_MG4265_PATCH"          "CHR_MG4266_PATCH"         
##  [59] "CHR_MG4281_PATCH"          "CHR_MG4288_PATCH"         
##  [61] "CHR_MG4308_PATCH"          "CHR_MG4310_MG4311_PATCH"  
##  [63] "CHR_MG51_PATCH"            "CHR_MG65_PATCH"           
##  [65] "CHR_MG74_PATCH"            "CHR_MG89_PATCH"           
##  [67] "CHR_MMCHR1_CHORI29_IDD5_1" "CHR_PWK_PHJ_MMCHR11_CTG1" 
##  [69] "CHR_PWK_PHJ_MMCHR11_CTG2"  "CHR_PWK_PHJ_MMCHR11_CTG3" 
##  [71] "CHR_WSB_EIJ_MMCHR11_CTG1"  "CHR_WSB_EIJ_MMCHR11_CTG2" 
##  [73] "CHR_WSB_EIJ_MMCHR11_CTG3"  "1"                        
##  [75] "2"                         "3"                        
##  [77] "4"                         "5"                        
##  [79] "6"                         "7"                        
##  [81] "8"                         "9"                        
##  [83] "10"                        "11"                       
##  [85] "12"                        "13"                       
##  [87] "14"                        "15"                       
##  [89] "16"                        "17"                       
##  [91] "18"                        "19"                       
##  [93] "X"                         "Y"                        
##  [95] "MT"                        "GL456210.1"               
##  [97] "GL456211.1"                "GL456212.1"               
##  [99] "GL456213.1"                "GL456216.1"               
## [101] "GL456219.1"                "GL456221.1"               
## [103] "GL456233.1"                "GL456239.1"               
## [105] "GL456350.1"                "GL456354.1"               
## [107] "GL456359.1"                "GL456360.1"               
## [109] "GL456366.1"                "GL456367.1"               
## [111] "GL456368.1"                "GL456370.1"               
## [113] "GL456372.1"                "GL456378.1"               
## [115] "GL456379.1"                "GL456381.1"               
## [117] "GL456382.1"                "GL456383.1"               
## [119] "GL456385.1"                "GL456387.1"               
## [121] "GL456389.1"                "GL456390.1"               
## [123] "GL456392.1"                "GL456393.1"               
## [125] "GL456394.1"                "GL456396.1"               
## [127] "JH584292.1"                "JH584293.1"               
## [129] "JH584294.1"                "JH584295.1"               
## [131] "JH584296.1"                "JH584297.1"               
## [133] "JH584298.1"                "JH584299.1"               
## [135] "JH584300.1"                "JH584301.1"               
## [137] "JH584302.1"                "JH584303.1"               
## [139] "JH584304.1"

sigRegions <- keepSeqlevels(sigRegions, 
                            value = c(1:19,"X","Y"),
                            pruning.mode="tidy")
seqlevels(sigRegions)

##  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12" "13" "14" "15"
## [16] "16" "17" "18" "19" "X"  "Y"

Add metadata to GRanges object

A useful property of GenomicRanges is that we can attach metadata to each range using the mcols function. The metadata can be supplied in the form of a data frame.

mcols(sigRegions) <- sigGenes[match(names(sigRegions), sigGenes$GeneID), ]
sigRegions

## GRanges object with 775 ranges and 15 metadata columns:
##                      seqnames            ranges strand |             GeneID
##                         <Rle>         <IRanges>  <Rle> |        <character>
##   ENSMUSG00000000078       13   5861482-5870394      + | ENSMUSG00000000078
##   ENSMUSG00000000204       11 83175186-83190216      + | ENSMUSG00000000204
##   ENSMUSG00000000275       11 88999376-89020293      + | ENSMUSG00000000275
##   ENSMUSG00000000290       10 77530252-77565708      + | ENSMUSG00000000290
##   ENSMUSG00000000317       11 70224128-70229798      - | ENSMUSG00000000317
##                  ...      ...               ...    ... .                ...
##   ENSMUSG00000106379        5 22746059-23275597      + | ENSMUSG00000106379
##   ENSMUSG00000112023       10 51480632-51486703      + | ENSMUSG00000112023
##   ENSMUSG00000112148       10 51490956-51496611      + | ENSMUSG00000112148
##   ENSMUSG00000115219       16 20611593-20619011      + | ENSMUSG00000115219
##   ENSMUSG00000115338       14 50931082-50965237      + | ENSMUSG00000115338
##                       baseMean     logFC     lfcSE      pvalue         FDR
##                      <numeric> <numeric> <numeric>   <numeric>   <numeric>
##   ENSMUSG00000000078 1297.0902  0.376214  0.206746 3.58961e-04 8.41871e-03
##   ENSMUSG00000000204   91.5889  5.394243  1.111323 1.16483e-07 5.71346e-06
##   ENSMUSG00000000275  595.5022  1.952923  0.228643 1.14456e-18 1.39770e-16
##   ENSMUSG00000000290  654.1519  2.787835  0.363627 6.12541e-15 5.70856e-13
##   ENSMUSG00000000317   73.0368  1.310368  0.488066 7.57725e-05 2.16753e-03
##                  ...       ...       ...       ...         ...         ...
##   ENSMUSG00000106379 1030.4365 -0.723935  0.260638 6.55536e-05 1.90914e-03
##   ENSMUSG00000112023   88.1297  4.023570  0.701758 3.02281e-09 1.83304e-07
##   ENSMUSG00000112148  218.3858  3.602317  0.602734 8.15026e-10 5.26703e-08
##   ENSMUSG00000115219   38.1574  1.755783  0.448223 3.95427e-06 1.47097e-04
##   ENSMUSG00000115338  861.4575  0.935192  0.184634 8.34511e-08 4.15075e-06
##                         Entrez      Symbol            Description
##                      <integer> <character>            <character>
##   ENSMUSG00000000078     23849        Klf6 Kruppel-like factor ..
##   ENSMUSG00000000204     20558       Slfn4 schlafen 4 [Source:M..
##   ENSMUSG00000000275    217069      Trim25 tripartite motif-con..
##   ENSMUSG00000000290     16414       Itgb2 integrin beta 2 [Sou..
##   ENSMUSG00000000317     12029       Bcl6b B cell CLL/lymphoma ..
##                  ...       ...         ...                    ...
##   ENSMUSG00000106379    269629      Lhfpl3 lipoma HMGIC fusion ..
##   ENSMUSG00000112023     14727      Lilr4b leukocyte immunoglob..
##   ENSMUSG00000112148     14728     Lilrb4a leukocyte immunoglob..
##   ENSMUSG00000115219 110599566   Eef1akmt4 EEF1A lysine methylt..
##   ENSMUSG00000115338     18950         Pnp purine-nucleoside ph..
##                             Biotype         Chr     Start       End    Strand
##                         <character> <character> <integer> <integer> <integer>
##   ENSMUSG00000000078 protein_coding          13   5861482   5870394         1
##   ENSMUSG00000000204 protein_coding          11  83175186  83190216         1
##   ENSMUSG00000000275 protein_coding          11  88999376  89020293         1
##   ENSMUSG00000000290 protein_coding          10  77530252  77565708         1
##   ENSMUSG00000000317 protein_coding          11  70224128  70229798        -1
##                  ...            ...         ...       ...       ...       ...
##   ENSMUSG00000106379 protein_coding           5  22746059  23275597         1
##   ENSMUSG00000112023 protein_coding          10  51480632  51486703         1
##   ENSMUSG00000112148 protein_coding          10  51490956  51496611         1
##   ENSMUSG00000115219 protein_coding          16  20611593  20619011         1
##   ENSMUSG00000115338 protein_coding          14  50931082  50965237         1
##                      medianTxLength
##                           <numeric>
##   ENSMUSG00000000078           1115
##   ENSMUSG00000000204           3703
##   ENSMUSG00000000275           5566
##   ENSMUSG00000000290           2411
##   ENSMUSG00000000317            652
##                  ...            ...
##   ENSMUSG00000106379           2852
##   ENSMUSG00000112023           1804
##   ENSMUSG00000112148           1832
##   ENSMUSG00000115219           1299
##   ENSMUSG00000115338            928
##   -------
##   seqinfo: 21 sequences from an unspecified genome

Scores and colour on exported tracks

The .bed file format is commonly used to store genomic locations for display in genome browsers (e.g. the UCSC browser or IGV) as tracks. Rather than just representing the genomic locations, the .bed format is also able to colour each range according to some property of the analysis (e.g. direction and magnitude of change) to help highlight particular regions of interest. A score can also be displayed when a particular region is clicked-on.

For the score we can use the \(-log_{10}\) of the adjusted p-value and colour scheme for the regions based on the fold-change

colorRampPalette is a useful function in base R for constructing a palette between two extremes. When choosing colour palettes, make sure they are colour blind friendly. The red / green colour scheme traditionally-applied to microarrays is a bad choice.

We will also truncate the fold-changes to between -5 and 5 to and divide this range into 10 equal bins.

rbPal <- colorRampPalette(c("red", "blue"))
logFC <- pmax(sigRegions$logFC, -5)
logFC <- pmin(logFC , 5)

Cols <- rbPal(10)[as.numeric(cut(logFC, breaks = 10))]

The colours and score have to be saved in the GRanges object as score and itemRgb columns respectively, and will be used to construct the browser track. The rtracklayer package can be used to import and export browsers tracks.

Now we can export the signifcant results from the DE analysis as a .bed track using rtracklayer. You can load the resulting file in IGV, if you wish.

mcols(sigRegions)$score <- -log10(sigRegions$FDR)
mcols(sigRegions)$itemRgb <- Cols
sigRegions

## GRanges object with 775 ranges and 17 metadata columns:
##                      seqnames            ranges strand |             GeneID
##                         <Rle>         <IRanges>  <Rle> |        <character>
##   ENSMUSG00000000078       13   5861482-5870394      + | ENSMUSG00000000078
##   ENSMUSG00000000204       11 83175186-83190216      + | ENSMUSG00000000204
##   ENSMUSG00000000275       11 88999376-89020293      + | ENSMUSG00000000275
##   ENSMUSG00000000290       10 77530252-77565708      + | ENSMUSG00000000290
##   ENSMUSG00000000317       11 70224128-70229798      - | ENSMUSG00000000317
##                  ...      ...               ...    ... .                ...
##   ENSMUSG00000106379        5 22746059-23275597      + | ENSMUSG00000106379
##   ENSMUSG00000112023       10 51480632-51486703      + | ENSMUSG00000112023
##   ENSMUSG00000112148       10 51490956-51496611      + | ENSMUSG00000112148
##   ENSMUSG00000115219       16 20611593-20619011      + | ENSMUSG00000115219
##   ENSMUSG00000115338       14 50931082-50965237      + | ENSMUSG00000115338
##                       baseMean     logFC     lfcSE      pvalue         FDR
##                      <numeric> <numeric> <numeric>   <numeric>   <numeric>
##   ENSMUSG00000000078 1297.0902  0.376214  0.206746 3.58961e-04 8.41871e-03
##   ENSMUSG00000000204   91.5889  5.394243  1.111323 1.16483e-07 5.71346e-06
##   ENSMUSG00000000275  595.5022  1.952923  0.228643 1.14456e-18 1.39770e-16
##   ENSMUSG00000000290  654.1519  2.787835  0.363627 6.12541e-15 5.70856e-13
##   ENSMUSG00000000317   73.0368  1.310368  0.488066 7.57725e-05 2.16753e-03
##                  ...       ...       ...       ...         ...         ...
##   ENSMUSG00000106379 1030.4365 -0.723935  0.260638 6.55536e-05 1.90914e-03
##   ENSMUSG00000112023   88.1297  4.023570  0.701758 3.02281e-09 1.83304e-07
##   ENSMUSG00000112148  218.3858  3.602317  0.602734 8.15026e-10 5.26703e-08
##   ENSMUSG00000115219   38.1574  1.755783  0.448223 3.95427e-06 1.47097e-04
##   ENSMUSG00000115338  861.4575  0.935192  0.184634 8.34511e-08 4.15075e-06
##                         Entrez      Symbol            Description
##                      <integer> <character>            <character>
##   ENSMUSG00000000078     23849        Klf6 Kruppel-like factor ..
##   ENSMUSG00000000204     20558       Slfn4 schlafen 4 [Source:M..
##   ENSMUSG00000000275    217069      Trim25 tripartite motif-con..
##   ENSMUSG00000000290     16414       Itgb2 integrin beta 2 [Sou..
##   ENSMUSG00000000317     12029       Bcl6b B cell CLL/lymphoma ..
##                  ...       ...         ...                    ...
##   ENSMUSG00000106379    269629      Lhfpl3 lipoma HMGIC fusion ..
##   ENSMUSG00000112023     14727      Lilr4b leukocyte immunoglob..
##   ENSMUSG00000112148     14728     Lilrb4a leukocyte immunoglob..
##   ENSMUSG00000115219 110599566   Eef1akmt4 EEF1A lysine methylt..
##   ENSMUSG00000115338     18950         Pnp purine-nucleoside ph..
##                             Biotype         Chr     Start       End    Strand
##                         <character> <character> <integer> <integer> <integer>
##   ENSMUSG00000000078 protein_coding          13   5861482   5870394         1
##   ENSMUSG00000000204 protein_coding          11  83175186  83190216         1
##   ENSMUSG00000000275 protein_coding          11  88999376  89020293         1
##   ENSMUSG00000000290 protein_coding          10  77530252  77565708         1
##   ENSMUSG00000000317 protein_coding          11  70224128  70229798        -1
##                  ...            ...         ...       ...       ...       ...
##   ENSMUSG00000106379 protein_coding           5  22746059  23275597         1
##   ENSMUSG00000112023 protein_coding          10  51480632  51486703         1
##   ENSMUSG00000112148 protein_coding          10  51490956  51496611         1
##   ENSMUSG00000115219 protein_coding          16  20611593  20619011         1
##   ENSMUSG00000115338 protein_coding          14  50931082  50965237         1
##                      medianTxLength     score     itemRgb
##                           <numeric> <numeric> <character>
##   ENSMUSG00000000078           1115   2.07475     #71008D
##   ENSMUSG00000000204           3703   5.24310     #0000FF
##   ENSMUSG00000000275           5566  15.85459     #5500AA
##   ENSMUSG00000000290           2411  12.24347     #3800C6
##   ENSMUSG00000000317            652   2.66403     #5500AA
##                  ...            ...       ...         ...
##   ENSMUSG00000106379           2852   2.71916     #8D0071
##   ENSMUSG00000112023           1804   6.73683     #0000FF
##   ENSMUSG00000112148           1832   7.27843     #1C00E2
##   ENSMUSG00000115219           1299   3.83240     #5500AA
##   ENSMUSG00000115338            928   5.38187     #71008D
##   -------
##   seqinfo: 21 sequences from an unspecified genome

library(rtracklayer)
export(sigRegions , con = "results/topHits.bed")

Extracting Reads

As we have been using counts as our starting point, we haven’t investigated the aligned reads from our experiment, and how they are represented. The aligned reads are usually stored in a bam file that can be manipulated with open-source command-line tools such as samtools and picard.

Bioconductor provide a low-level interface to data/bam/sam files in the form of the Rsamtools package. The GenomicAlignments package can also be used to retrieve the reads mapping to a particular genomic region in an efficient manner.

library(GenomicAlignments)

In the directory small_bams there should be .bam files for some of the samples in the example study. The workflow to produce these files is described in a supplmentary page for the course. In brief, the raw reads (fastq) were downloaded from the Short Read Archive (SRA) and aligned with hisat2. Each bam file was named according to the file name in SRA, but we have renamed the files according to their name in the study. An index file (.bai) has been generated for each bam file. In order to reduce the size, the bam files used here only contain a subset of the reads that were aligned in the region chr15:101707000-101713000.

list.files("small_bams/")

##  [1] "SRR7657872.sorted.small.bam"     "SRR7657872.sorted.small.bam.bai"
##  [3] "SRR7657874.sorted.small.bam"     "SRR7657874.sorted.small.bam.bai"
##  [5] "SRR7657877.sorted.small.bam"     "SRR7657877.sorted.small.bam.bai"
##  [7] "SRR7657878.sorted.small.bam"     "SRR7657878.sorted.small.bam.bai"
##  [9] "SRR7657883.sorted.small.bam"     "SRR7657883.sorted.small.bam.bai"

The readGAlignments function provides a simple interface to interrogate the aligned reads for a particular sample. It can also utilise the index file in order to retrieve only the reads that correspond to a specific region in an efficient manner. The output includes the genomic location of each aligned read and the CIGAR (Compact Idiosyncratic Gapped Alignment Report); where M denotes an match to the genome and I, D correspond to insertions and deletions.

exo <- exonsBy(txMm, "gene") 
geneID  <- filter(shrinkTab.11, Symbol=="Il10ra") %>% pull(GeneID)
generegion <- exo[[geneID]] %>% 
    keepSeqlevels(value = 9, pruning.mode="tidy")

my.reads <- readGAlignments(file="small_bams/SRR7657872.sorted.small.bam",
                       param=ScanBamParam(which=generegion))
my.reads

## GAlignments object with 6520 alignments and 0 metadata columns:
##          seqnames strand          cigar    qwidth     start       end     width
##             <Rle>  <Rle>    <character> <integer> <integer> <integer> <integer>
##      [1]        9      + 5S9M47494N136M       150  45220421  45268059     47639
##      [2]        9      +           150M       150  45253715  45253864       150
##      [3]        9      +           150M       150  45253850  45253999       150
##      [4]        9      +           150M       150  45253850  45253999       150
##      [5]        9      +           150M       150  45253855  45254004       150
##      ...      ...    ...            ...       ...       ...       ...       ...
##   [6516]        9      -           150M       150  45268954  45269103       150
##   [6517]        9      -           150M       150  45268965  45269114       150
##   [6518]        9      -           150M       150  45268968  45269117       150
##   [6519]        9      -           150M       150  45268971  45269120       150
##   [6520]        9      -           150M       150  45268971  45269120       150
##              njunc
##          <integer>
##      [1]         1
##      [2]         0
##      [3]         0
##      [4]         0
##      [5]         0
##      ...       ...
##   [6516]         0
##   [6517]         0
##   [6518]         0
##   [6519]         0
##   [6520]         0
##   -------
##   seqinfo: 66 sequences from an unspecified genome

It is possible to tweak the function to retrieve other potentially-useful information from the bam file, such as the mapping quality and flag.

my.reads <- readGAlignments(file="small_bams/SRR7657872.sorted.small.bam",
                       param=ScanBamParam(which=generegion,
                                          what=c("seq","mapq","flag")))
my.reads

## GAlignments object with 6520 alignments and 3 metadata columns:
##          seqnames strand          cigar    qwidth     start       end     width
##             <Rle>  <Rle>    <character> <integer> <integer> <integer> <integer>
##      [1]        9      + 5S9M47494N136M       150  45220421  45268059     47639
##      [2]        9      +           150M       150  45253715  45253864       150
##      [3]        9      +           150M       150  45253850  45253999       150
##      [4]        9      +           150M       150  45253850  45253999       150
##      [5]        9      +           150M       150  45253855  45254004       150
##      ...      ...    ...            ...       ...       ...       ...       ...
##   [6516]        9      -           150M       150  45268954  45269103       150
##   [6517]        9      -           150M       150  45268965  45269114       150
##   [6518]        9      -           150M       150  45268968  45269117       150
##   [6519]        9      -           150M       150  45268971  45269120       150
##   [6520]        9      -           150M       150  45268971  45269120       150
##              njunc |                     seq      mapq      flag
##          <integer> |          <DNAStringSet> <integer> <integer>
##      [1]         1 | CAAGGGGGCC...TGAATTCTCC        60        99
##      [2]         0 | AGAGAATATG...TTTTTATTAG        60        99
##      [3]         0 | CTGACTTTTT...AGGAAAACTG        60       163
##      [4]         0 | CTGACTTTTT...AGGAAAACTG        60       163
##      [5]         0 | TTTTTATTAG...AACTGAGGCT        60       163
##      ...       ... .                     ...       ...       ...
##   [6516]         0 | TCTCCCGCGA...TCCACTGGAG        60        83
##   [6517]         0 | TGTGAACTTT...CGGCCTTTAC        60        83
##   [6518]         0 | GAACTTTAAA...CCTTTACGCC        60        83
##   [6519]         0 | CTTTAAAGCA...TTACGCCTCC        60        83
##   [6520]         0 | CTTTAAAGCA...TTACGCCTCC        60        83
##   -------
##   seqinfo: 66 sequences from an unspecified genome

Particular attributes of the reads can be extracted and visualised

hist(mcols(my.reads)$mapq, main="", xlab="MAPQ")

However, there are more-sophisticated visualisation options for aligned reads and range data. We will use the ggbio package, which first requires some discussion of the ggplot2 plotting package.

Composing plots with ggbio

We will now take a brief look at one of the visualisation packages in Bioconductor that takes advantage of the GenomicRanges and GenomicFeatures object-types. In this section we will show a worked example of how to combine several types of genomic data on the same plot. The documentation for ggbio is very extensive and contains lots of examples.

http://www.tengfei.name/ggbio/docs/

The Gviz package is another Bioconductor package that specialising in genomic visualisations, but we will not explore this package in the course.

The Manhattan plot is a common way of visualising genome-wide results, especially when one is concerned with the results of a GWAS study and identifying strongly-associated hits.

The profile is supposed to resemble the Manhattan skyline with particular skyscrapers towering about the lower level buildings.

This type of plot is implemented as the plotGrandLinear function. We have to supply a value to display on the y-axis using the aes function, which is inherited from ggplot2. The positioning of points on the x-axis is handled automatically by ggbio, using the ranges information to get the genomic coordinates of the ranges of interest.

To stop the plots from being too cluttered we will consider 200 random genes only.

library(ggbio)

set.seed(144032) 
sigRegions.200 <- sigRegions[sample(length(sigRegions), 200)] 

plotGrandLinear(sigRegions.200 , aes(y = logFC))

ggbio has alternated the colours of the chromosomes. However, an appealing feature of ggplot2 is the ability to map properties of your plot to variables present in your data. For example, we could create a variable to distinguish between up- and down-regulated genes. The variables used for aesthetic mapping must be present in the mcols section of your ranges object.

mcols(sigRegions.200)$UpRegulated <- mcols(sigRegions.200)$logFC > 0

plotGrandLinear(sigRegions.200,
                aes(y = logFC, shape = UpRegulated, fill = UpRegulated),
                size = 4) +
    scale_shape_manual(values=c(25, 24)) +
    scale_colour_manual(values=rep(c("white", "black"), 10))

## using coord:genome to parse x scale

## Scale for 'colour' is already present. Adding another scale for 'colour',
## which will replace the existing scale.

plotGrandLinear is a special function in ggbio with preset options for the manhattan style of plot. More often, users will call the autoplot function and ggbio will choose the most appropriate layout. One such layout is the karyogram.

autoplot(sigRegions.200, 
         layout="karyogram", 
         aes(color=UpRegulated, fill=UpRegulated))

ggbio is also able to plot the structure of genes according to a particular model represented by a GenomicFeatures object, such as the object we created earlier with the exon coordinates for each gene in the GRCm38 genome.

autoplot(txMm, which=exo[[geneID]])

We can even plot the location of sequencing reads if they have been imported using readGAlignments function (or similar).

myreg <- exo[[geneID]] %>% 
    GenomicRanges::reduce() %>% 
    flank(width = 1000, both = T) %>% 
    keepSeqlevels(value = 9, pruning.mode="tidy")

bam <- readGappedReads(file="small_bams/SRR7657872.sorted.small.bam",
                       param=ScanBamParam(which=myreg), use.names = TRUE)

autoplot(bam, geom = "rect") + 
    xlim(GRanges("9", IRanges(45253000, 45270000)))

## extracting information...

Like ggplot2, ggbio plots can be saved as objects that can later be modified, or combined together to form more complicated plots. If saved in this way, the plot will only be displayed on a plotting device when we query the object. The ggbio command tracks allows us to display multiple tracks in a single plot.

sampleinfo %>% 
  filter(Replicate==1)

## # A tibble: 4 x 4
##   SampleName Replicate Status     TimePoint
##   <chr>          <dbl> <chr>      <chr>    
## 1 SRR7657878         1 Infected   d11      
## 2 SRR7657874         1 Infected   d33      
## 3 SRR7657877         1 Uninfected d11      
## 4 SRR7657883         1 Uninfected d33

bam.78 <- readGappedReads(file="small_bams/SRR7657878.sorted.small.bam",
                       param=ScanBamParam(which=myreg), use.names = TRUE)
bam.74 <- readGappedReads(file="small_bams/SRR7657874.sorted.small.bam",
                       param=ScanBamParam(which=myreg), use.names = TRUE)
bam.77 <- readGappedReads(file="small_bams/SRR7657877.sorted.small.bam",
                       param=ScanBamParam(which=myreg), use.names = TRUE)
bam.83 <- readGappedReads(file="small_bams/SRR7657883.sorted.small.bam",
                       param=ScanBamParam(which=myreg), use.names = TRUE)

geneMod <- autoplot(txMm, which = myreg)  + 
    xlim(GRanges("9", IRanges(45253000, 45270000)))
reads.SRR7657878 <- autoplot(bam.78, stat = "coverage")  + 
    xlim(GRanges("9", IRanges(45253000, 45270000))) +
    scale_y_continuous(limits = c(0, 270)) +
    labs(title="SRR7657878")
reads.SRR7657874 <- autoplot(bam.74, stat = "coverage")  + 
    xlim(GRanges("9", IRanges(45253000, 45270000))) +
    scale_y_continuous(limits = c(0, 270)) +
    labs(title="SRR7657874")
reads.SRR7657877 <- autoplot(bam.77, stat = "coverage")  +
    xlim(GRanges("9", IRanges(45253000, 45270000))) +
    scale_y_continuous(limits = c(0, 270)) +
    labs(title="SRR7657877")
reads.SRR7657883 <- autoplot(bam.83, stat = "coverage")  +
    xlim(GRanges("9", IRanges(45253000, 45270000))) +
    scale_y_continuous(limits = c(0, 270)) +
    labs(title="SRR7657883")

tracks(GRCm38=geneMod, 
       Inftected.d11=reads.SRR7657878, 
       Inftected.d33=reads.SRR7657874, 
       Uninftected.d11=reads.SRR7657877,
       Uninftected.d33=reads.SRR7657883,
       heights=c(4, 2, 2, 2, 2),
       track.plot.color = c("darkgrey", "#D76280", "#D73280", "#62D770", "#32D770"),
       title = "Read Coverage - Interleukin 10 receptor, alpha")

Introduction to Bulk RNAseq data analysis

Additional Visualisation of RNA-seq results - Supplementary Materials

Last modified: 19 Apr 2021