library(DESeq2)
library(tidyverse)
Load in the datasets from the Annotation and Visualisation main session
load("~/Course_Materials/RNAseq/Robjects/Ensembl_annotations.RData")
load("~/Course_Materials/RNAseq/Robjects/Annotated_Results_LvV.RData")
load("~/Course_Materials/RNAseq/Robjects/DE.RData")
Before following up on the DE genes with further lab work, a recommended sanity check is to have a look at the expression levels of the individual samples for the genes of interest. We can quickly look at grouped expression by using plotCounts
function of DESeq2
to retrieve the normalised expression values from the ddsObj
object and then plotting with ggplot2
.
# Let's look at the most significantly differentially expressed gene
topgene <- filter(shrinkLvV, Symbol=="Wap")
geneID <- topgene$GeneID
plotCounts(ddsObj, gene = geneID, intgroup = c("CellType", "Status"),
returnData = T) %>%
ggplot(aes(x=Status, y=log2(count))) +
geom_point(aes(fill=Status), shape=21, size=2) +
facet_wrap(~CellType) +
expand_limits(y=0)
An interactive version of the volcano plot above that includes the raw per sample values in a separate panel is possible via the glXYPlot
function in the Glimma package.
library(Glimma)
group <- sampleinfo %>%
mutate(Group=str_c(CellType, "_", Status) %>%
str_remove_all("[aeiou]")) %>%
pull(Group)
de <- as.integer(shrinkLvV$FDR <= 0.05 & !is.na(shrinkLvV$FDR))
normCounts <- log2(counts(ddsObj))
glXYPlot(
x = shrinkLvV$logFC,
y = -log10(shrinkLvV$pvalue),
xlab = "logFC",
ylab = "FDR",
main = "Lactating v Virgin",
counts = normCounts,
groups = group,
status = de,
anno = shrinkLvV[, c("GeneID", "Symbol", "Description")],
folder = "volcano"
)
This function creates an html page (./volcano/XY-Plot.html) with a volcano plot on the left and a plot showing the log-CPM per sample for a selected gene on the right. A search bar is available to search for genes of interest.
There is a whole suite of annotation packages that can be used to access and for perform advanced queries on information about the genomic location of genes, trancripts and exons. These are listed on the Bioconductor annotation page and have the prefix TxDb.
(where "tx" is "transcript"). In addition there are a large number of packages that make use of these annotations for downstream analyses and visualisations.
Unfortunately, these packages do not cover all species and tend only to be available for UCSC genomes. Thankfully, there is a way to build your own database from either a GTF file or from various online resources such as Biomart using the package GenomicFeatures
.
The created database is only loaded into the current R session. You will need to run this command each time - it can be a little slow.
library(GenomicFeatures)
txMm <- makeTxDbFromGFF(file = "references/Mus_musculus.GRCm38.97.gtf", format = "gtf")
The created database is only loaded into the current R session. You will need to run this command each time.
library(biomaRt)
library(GenomicFeatures)
txMm <- makeTxDbFromBiomart(dataset="mmusculus_gene_ensembl")
This creates an R package that can be installed just like a package that you might download from Bioconductor or CRAN. This can then loaded as normal whenever it is needed, saving you having to build the database each time.
A little extra work is needed at the command line to build the package from the files produced by this method. Feel free to skip this section if you want - the previous two methods are adequate if you can tolerate the short wait each time you create the database.
library(GenomicFeatures)
makeTxDbPackageFromBiomart(version="0.01",
destDir = "~/Course_Materials/RNAseq/references/",
maintainer="Some One <so@someplace.org>",
author="Some One <so@someplace.com>",
dataset="mmusculus_gene_ensembl",
circ_seqs = "MT")
This creates a new folder in your home directory (~
is a shortcut for home).
We will not go into detail about how to contruct an R package, or the contents of the package directory. This method generates all the files you need. More information on contructing R packages can be found in Hadely Wickham's "R Packages" book.
The directory created will be something like TxDb.Mmusculus.BioMart.ENSEMBLMARTENSEMBL.GRCm38.p6
. This going to be the packaged name and is referenced in various files in the package directory. We recommend changing it to something more manageable such as TxDb.Mmusculus.Ens.GRCm38
. We need to change the directory name, the database file name and each reference in the DESCRIPTION
and man/package.Rd
files.
To do this run the following chunk of code. This code is actually run in 'bash', the language used in the terminal rather than R. This shows how versatile RStudio is, you can run code from languages other than R when it is more convenient for you to do so.
OldName=TxDb.Mmusculus.BioMart.ENSEMBLMARTENSEMBL.GRCm38.p6
NewName=TxDb.Mmusculus.Ens.GRCm38
cd ~/Course_Materials/RNAseq/references
# rename the package directory
mv ${OldName} ${NewName}
# rename the database file
mv inst/extdata/${OldName}.sqlite inst/extdata/${NewName}.sqlite
# replace the references in the old directory
sed -i s/${OldName}/${NewName}/ ${NewName}DESCRIPTION
sed -i s/${OldName}/${NewName}/ ${NewName}man/package.Rd
# Build the package from the directory of files created by the above command
R CMD build TxDb.Mmusculus.Ens.GRCm38
# Install the package from the tarball created
R CMD INSTALL TxDb.Mmusculus.Ens.GRCm38.p6_0.01.tar.gz
library(TxDb.Mmusculus.Ens.GRCm38)
## Loading required package: GenomicFeatures
## Loading required package: AnnotationDbi
##
## Attaching package: 'AnnotationDbi'
## The following object is masked from 'package:dplyr':
##
## select
txMm <- TxDb.Mmusculus.Ens.GRCm38
Accessing the information in these TxDb databases is similar to the way in which we accessed information using biomaRt
except that filters
(the information we are filtering on) are now called keys
and attributes
(things we want to retrieve) are columns
.
First we need to decide what information we want. In order to see what we can extract we can run the columns
function on the annotation database.
library(GenomicFeatures)
columns(txMm)
## [1] "CDSCHROM" "CDSEND" "CDSID" "CDSNAME" "CDSPHASE"
## [6] "CDSSTART" "CDSSTRAND" "EXONCHROM" "EXONEND" "EXONID"
## [11] "EXONNAME" "EXONRANK" "EXONSTART" "EXONSTRAND" "GENEID"
## [16] "TXCHROM" "TXEND" "TXID" "TXNAME" "TXSTART"
## [21] "TXSTRAND" "TXTYPE"
We are going to filter the database by a key or set of keys in order to extract the information we want. Valid names for the key can be retrieved with the keytypes
function.
keytypes(txMm)
## [1] "CDSID" "CDSNAME" "EXONID" "EXONNAME" "GENEID" "TXID" "TXNAME"
To extract information we use the select
function. Let's get transcript information for our most highly differentially expressed gene.
keyList <- ensemblAnnot$GeneID[ensemblAnnot$Symbol=="Wap"]
select(txMm,
keys=keyList,
keytype = "GENEID",
columns=c("TXNAME", "TXCHROM", "TXSTART", "TXEND", "TXSTRAND", "TXTYPE")
)
## 'select()' returned 1:many mapping between keys and columns
## GENEID TXNAME TXTYPE TXCHROM TXSTRAND
## 1 ENSMUSG00000000381 ENSMUST00000141868 processed_transcript 11 -
## 2 ENSMUSG00000000381 ENSMUST00000102910 protein_coding 11 -
## 3 ENSMUSG00000000381 ENSMUST00000127341 processed_transcript 11 -
## TXSTART TXEND
## 1 6635482 6637466
## 2 6635482 6638637
## 3 6635539 6637455
One of the real strengths of the txdb..
databases is the ability to interface with GenomicRanges
, which is the object type used throughout Bioconductor to manipulate Genomic Intervals.
These object types permit us to perform common operations on intervals such as overlapping and counting. We can define the chromosome, start and end position of each region (also strand too, but not shown here).
library(GenomicRanges)
simple_range <- GRanges(seqnames = "1", ranges = IRanges(start=1000, end=2000))
simple_range
## GRanges object with 1 range and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] 1 1000-2000 *
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
We don't have to have all our ranges located on the same chromosome
chrs <- c("13", "15", "5")
start <- c(73000000, 6800000, 15000000)
end <- c(74000000, 6900000, 16000000)
my_ranges <- GRanges(seqnames = rep(chrs, 3),
ranges = IRanges(start = rep(start, each = 3),
end = rep(end, each = 3))
)
my_ranges
## GRanges object with 9 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] 13 73000000-74000000 *
## [2] 15 73000000-74000000 *
## [3] 5 73000000-74000000 *
## [4] 13 6800000-6900000 *
## [5] 15 6800000-6900000 *
## [6] 5 6800000-6900000 *
## [7] 13 15000000-16000000 *
## [8] 15 15000000-16000000 *
## [9] 5 15000000-16000000 *
## -------
## seqinfo: 3 sequences from an unspecified genome; no seqlengths
There are a number of useful functions for calculating properties of the data (such as coverage or sorting). Not so much for RNA-seq analysis, but GenomicRanges
are used throughout Bioconductor for the analysis of NGS data.
For instance, we can quickly identify overlapping regions between two GenomicRanges
.
keys <- c("ENSMUSG00000021604", "ENSMUSG00000022146", "ENSMUSG00000040118")
genePos <- select(txMm,
keys = keys,
keytype = "GENEID",
columns = c("EXONCHROM", "EXONSTART", "EXONEND")
)
## 'select()' returned 1:many mapping between keys and columns
geneRanges <- GRanges(genePos$EXONCHROM,
ranges = IRanges(genePos$EXONSTART, genePos$EXONEND),
GENEID = genePos$GENEID)
geneRanges
## GRanges object with 96 ranges and 1 metadata column:
## seqnames ranges strand | GENEID
## <Rle> <IRanges> <Rle> | <character>
## [1] 13 73260479-73260653 * | ENSMUSG00000021604
## [2] 13 73264848-73264979 * | ENSMUSG00000021604
## [3] 13 73265458-73265709 * | ENSMUSG00000021604
## [4] 13 73266596-73266708 * | ENSMUSG00000021604
## [5] 13 73267504-73267832 * | ENSMUSG00000021604
## ... ... ... ... . ...
## [92] 5 16327973-16329883 * | ENSMUSG00000040118
## [93] 5 16326151-16326383 * | ENSMUSG00000040118
## [94] 5 16340707-16341059 * | ENSMUSG00000040118
## [95] 5 16361395-16361875 * | ENSMUSG00000040118
## [96] 5 16362265-16362326 * | ENSMUSG00000040118
## -------
## seqinfo: 3 sequences from an unspecified genome; no seqlengths
findOverlaps(my_ranges, geneRanges)
## Hits object with 40 hits and 0 metadata columns:
## queryHits subjectHits
## <integer> <integer>
## [1] 1 1
## [2] 1 2
## [3] 1 3
## [4] 1 4
## [5] 1 5
## ... ... ...
## [36] 9 36
## [37] 9 75
## [38] 9 84
## [39] 9 85
## [40] 9 87
## -------
## queryLength: 9 / subjectLength: 96
However, we have to pay attention to the naming convention used for each object. seqlevelsStyle
can help.
seqlevelsStyle(simple_range)
## [1] "NCBI" "Ensembl" "MSU6" "AGPvF"
seqlevelsStyle(my_ranges)
## [1] "NCBI" "Ensembl" "JGI2.F"
seqlevelsStyle(geneRanges)
## [1] "NCBI" "Ensembl" "JGI2.F"
It is also possible to save the results of a Bioconductor analysis in a browser to enable interactive analysis and integration with other data types, or sharing with collaborators. For instance, we might want a browser track to indicate where our differentially-expressed genes are located. We shall use the bed
format to display these locations. We will annotate the ranges with information from our analysis such as the fold-change and significance.
First we create a data frame for just the DE genes.
sigGenes <- filter(shrinkLvV, FDR <= 0.01)
message("Number of significantly DE genes: ", nrow(sigGenes))
## Number of significantly DE genes: 4818
head(sigGenes)
## GeneID baseMean logFC lfcSE stat pvalue
## 1 ENSMUSG00000103922 143.63201 1.3753350 0.4058885 3.497362 4.698831e-04
## 2 ENSMUSG00000025903 724.71653 0.6631350 0.1440313 4.614104 3.947951e-06
## 3 ENSMUSG00000103280 11.02652 -1.5758478 0.4360944 -3.519573 4.322416e-04
## 4 ENSMUSG00000033793 1271.51193 0.8946715 0.1049078 8.538601 1.358561e-17
## 5 ENSMUSG00000051285 1677.50683 1.3357189 0.1707512 7.797476 6.315776e-15
## 6 ENSMUSG00000103509 25.92348 1.1964589 0.3003958 3.876065 1.061592e-04
## FDR Entrez Symbol
## 1 2.724966e-03 NA Gm6123
## 2 4.366377e-05 18777 Lypla1
## 3 2.540353e-03 NA Gm37277
## 4 1.493302e-15 108664 Atp6v1h
## 5 4.527503e-13 319263 Pcmtd1
## 6 7.645888e-04 NA Gm38372
## Description
## 1 predicted gene 6123 [Source:MGI Symbol;Acc:MGI:3647047]
## 2 lysophospholipase 1 [Source:MGI Symbol;Acc:MGI:1344588]
## 3 predicted gene, 37277 [Source:MGI Symbol;Acc:MGI:5610505]
## 4 ATPase, H+ transporting, lysosomal V1 subunit H [Source:MGI Symbol;Acc:MGI:1914864]
## 5 protein-L-isoaspartate (D-aspartate) O-methyltransferase domain containing 1 [Source:MGI Symbol;Acc:MGI:2441773]
## 6 predicted gene, 38372 [Source:MGI Symbol;Acc:MGI:5611600]
## Biotype Chr Start End Strand medianTxLength TopGeneLabel
## 1 processed_pseudogene 1 4771131 4772199 1 1069.0
## 2 protein_coding 1 4807788 4848410 1 903.5
## 3 TEC 1 4905751 4906861 -1 1111.0
## 4 protein_coding 1 5070018 5162529 1 1930.0
## 5 protein_coding 1 7088920 7173628 1 1906.0
## 6 TEC 1 7148110 7152137 1 4028.0
Several convenience functions exist to retrieve the structure of every gene from a given TxDb object in one list. The output of exonsBy
is a list, where each item in the list is the exon co-ordinates of a particular gene, however, we do not need this level of granularity for the bed output, so we will collapse to a single region for each gene.
First we use the range
function to obtain a single range for every gene and tranform to a more convenient object with unlist
.
exoRanges <- exonsBy(txMm, "gene") %>%
range() %>%
unlist()
sigRegions <- exoRanges[na.omit(match(sigGenes$GeneID, names(exoRanges)))]
sigRegions
## GRanges object with 4812 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## ENSMUSG00000103922 1 4771131-4772199 +
## ENSMUSG00000025903 1 4807788-4848410 +
## ENSMUSG00000103280 1 4905751-4906861 -
## ENSMUSG00000033793 1 5070018-5162529 +
## ENSMUSG00000051285 1 7088920-7173628 +
## ... ... ... ...
## ENSMUSG00000065947 MT 9877-10173 +
## ENSMUSG00000064363 MT 10167-11544 +
## ENSMUSG00000064367 MT 11742-13565 +
## ENSMUSG00000064368 MT 13552-14070 -
## ENSMUSG00000095456 JH584293.1 10948-16486 +
## -------
## seqinfo: 139 sequences (1 circular) from an unspecified genome
For visualisation purposes, we are going to restrict the data to genes that are located on chromosomes 1 to 19 and the sex chromosomes. This can be done with the keepSeqLevels
function.
seqlevels(sigRegions)
## [1] "CHR_CAST_EI_MMCHR11_CTG4" "CHR_CAST_EI_MMCHR11_CTG5"
## [3] "CHR_MG104_PATCH" "CHR_MG117_PATCH"
## [5] "CHR_MG132_PATCH" "CHR_MG153_PATCH"
## [7] "CHR_MG171_PATCH" "CHR_MG184_PATCH"
## [9] "CHR_MG190_MG3751_PATCH" "CHR_MG191_PATCH"
## [11] "CHR_MG209_PATCH" "CHR_MG3172_PATCH"
## [13] "CHR_MG3231_PATCH" "CHR_MG3251_PATCH"
## [15] "CHR_MG3490_PATCH" "CHR_MG3496_PATCH"
## [17] "CHR_MG3530_PATCH" "CHR_MG3561_PATCH"
## [19] "CHR_MG3562_PATCH" "CHR_MG3609_PATCH"
## [21] "CHR_MG3618_PATCH" "CHR_MG3627_PATCH"
## [23] "CHR_MG3648_PATCH" "CHR_MG3656_PATCH"
## [25] "CHR_MG3683_PATCH" "CHR_MG3686_PATCH"
## [27] "CHR_MG3699_PATCH" "CHR_MG3700_PATCH"
## [29] "CHR_MG3712_PATCH" "CHR_MG3714_PATCH"
## [31] "CHR_MG3829_PATCH" "CHR_MG3833_MG4220_PATCH"
## [33] "CHR_MG3835_PATCH" "CHR_MG3836_PATCH"
## [35] "CHR_MG3999_PATCH" "CHR_MG4136_PATCH"
## [37] "CHR_MG4138_PATCH" "CHR_MG4151_PATCH"
## [39] "CHR_MG4162_PATCH" "CHR_MG4180_PATCH"
## [41] "CHR_MG4198_PATCH" "CHR_MG4200_PATCH"
## [43] "CHR_MG4209_PATCH" "CHR_MG4211_PATCH"
## [45] "CHR_MG4212_PATCH" "CHR_MG4213_PATCH"
## [47] "CHR_MG4214_PATCH" "CHR_MG4222_MG3908_PATCH"
## [49] "CHR_MG4243_PATCH" "CHR_MG4248_PATCH"
## [51] "CHR_MG4249_PATCH" "CHR_MG4254_PATCH"
## [53] "CHR_MG4255_PATCH" "CHR_MG4259_PATCH"
## [55] "CHR_MG4261_PATCH" "CHR_MG4264_PATCH"
## [57] "CHR_MG4265_PATCH" "CHR_MG4266_PATCH"
## [59] "CHR_MG4281_PATCH" "CHR_MG4288_PATCH"
## [61] "CHR_MG4308_PATCH" "CHR_MG4310_MG4311_PATCH"
## [63] "CHR_MG51_PATCH" "CHR_MG65_PATCH"
## [65] "CHR_MG74_PATCH" "CHR_MG89_PATCH"
## [67] "CHR_MMCHR1_CHORI29_IDD5_1" "CHR_PWK_PHJ_MMCHR11_CTG1"
## [69] "CHR_PWK_PHJ_MMCHR11_CTG2" "CHR_PWK_PHJ_MMCHR11_CTG3"
## [71] "CHR_WSB_EIJ_MMCHR11_CTG1" "CHR_WSB_EIJ_MMCHR11_CTG2"
## [73] "CHR_WSB_EIJ_MMCHR11_CTG3" "1"
## [75] "2" "3"
## [77] "4" "5"
## [79] "6" "7"
## [81] "8" "9"
## [83] "10" "11"
## [85] "12" "13"
## [87] "14" "15"
## [89] "16" "17"
## [91] "18" "19"
## [93] "X" "Y"
## [95] "MT" "GL456210.1"
## [97] "GL456211.1" "GL456212.1"
## [99] "GL456213.1" "GL456216.1"
## [101] "GL456219.1" "GL456221.1"
## [103] "GL456233.1" "GL456239.1"
## [105] "GL456350.1" "GL456354.1"
## [107] "GL456359.1" "GL456360.1"
## [109] "GL456366.1" "GL456367.1"
## [111] "GL456368.1" "GL456370.1"
## [113] "GL456372.1" "GL456378.1"
## [115] "GL456379.1" "GL456381.1"
## [117] "GL456382.1" "GL456383.1"
## [119] "GL456385.1" "GL456387.1"
## [121] "GL456389.1" "GL456390.1"
## [123] "GL456392.1" "GL456393.1"
## [125] "GL456394.1" "GL456396.1"
## [127] "JH584292.1" "JH584293.1"
## [129] "JH584294.1" "JH584295.1"
## [131] "JH584296.1" "JH584297.1"
## [133] "JH584298.1" "JH584299.1"
## [135] "JH584300.1" "JH584301.1"
## [137] "JH584302.1" "JH584303.1"
## [139] "JH584304.1"
sigRegions <- keepSeqlevels(sigRegions,
value = c(1:19,"X","Y"),
pruning.mode="tidy")
seqlevels(sigRegions)
## [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13" "14" "15"
## [16] "16" "17" "18" "19" "X" "Y"
A useful propery of GenomicRanges is that we can attach metadata to each range using the mcols
function. The metadata can be supplied in the form of a data frame.
mcols(sigRegions) <- sigGenes[match(names(sigRegions), sigGenes$GeneID), ]
sigRegions
## GRanges object with 4802 ranges and 17 metadata columns:
## seqnames ranges strand | GeneID
## <Rle> <IRanges> <Rle> | <character>
## ENSMUSG00000103922 1 4771131-4772199 + | ENSMUSG00000103922
## ENSMUSG00000025903 1 4807788-4848410 + | ENSMUSG00000025903
## ENSMUSG00000103280 1 4905751-4906861 - | ENSMUSG00000103280
## ENSMUSG00000033793 1 5070018-5162529 + | ENSMUSG00000033793
## ENSMUSG00000051285 1 7088920-7173628 + | ENSMUSG00000051285
## ... ... ... ... . ...
## ENSMUSG00000067038 19 59322371-59322766 + | ENSMUSG00000067038
## ENSMUSG00000040022 19 59902884-59943654 - | ENSMUSG00000040022
## ENSMUSG00000024993 19 60811585-60836227 + | ENSMUSG00000024993
## ENSMUSG00000024997 19 60864051-60874556 - | ENSMUSG00000024997
## ENSMUSG00000074733 19 61053840-61140840 - | ENSMUSG00000074733
## baseMean logFC lfcSE stat pvalue
## <numeric> <numeric> <numeric> <numeric> <numeric>
## ENSMUSG00000103922 143.6320 1.375335 0.405888 3.49736 4.69883e-04
## ENSMUSG00000025903 724.7165 0.663135 0.144031 4.61410 3.94795e-06
## ENSMUSG00000103280 11.0265 -1.575848 0.436094 -3.51957 4.32242e-04
## ENSMUSG00000033793 1271.5119 0.894671 0.104908 8.53860 1.35856e-17
## ENSMUSG00000051285 1677.5068 1.335719 0.170751 7.79748 6.31578e-15
## ... ... ... ... ... ...
## ENSMUSG00000067038 235.294 -0.776979 0.150176 -5.14540 2.66951e-07
## ENSMUSG00000040022 421.870 1.063631 0.177581 5.95704 2.56843e-09
## ENSMUSG00000024993 275.394 0.586893 0.162888 3.56435 3.64763e-04
## ENSMUSG00000024997 1154.906 0.912511 0.184537 4.97152 6.64317e-07
## ENSMUSG00000074733 178.169 0.875439 0.168880 5.11262 3.17720e-07
## FDR Entrez Symbol
## <numeric> <integer> <character>
## ENSMUSG00000103922 2.72497e-03 <NA> Gm6123
## ENSMUSG00000025903 4.36638e-05 18777 Lypla1
## ENSMUSG00000103280 2.54035e-03 <NA> Gm37277
## ENSMUSG00000033793 1.49330e-15 108664 Atp6v1h
## ENSMUSG00000051285 4.52750e-13 319263 Pcmtd1
## ... ... ... ...
## ENSMUSG00000067038 4.03229e-06 <NA> Rps12-ps3
## ENSMUSG00000040022 6.08307e-08 74998 Rab11fip2
## ENSMUSG00000024993 2.20742e-03 67894 Fam45a
## ENSMUSG00000024997 8.96506e-06 11757 Prdx3
## ENSMUSG00000074733 4.69009e-06 414758 Zfp950
## Description
## <character>
## ENSMUSG00000103922 predicted gene 6123 [Source:MGI Symbol;Acc:MGI:3647047]
## ENSMUSG00000025903 lysophospholipase 1 [Source:MGI Symbol;Acc:MGI:1344588]
## ENSMUSG00000103280 predicted gene, 37277 [Source:MGI Symbol;Acc:MGI:5610505]
## ENSMUSG00000033793 ATPase, H+ transporting, lysosomal V1 subunit H [Source:MGI Symbol;Acc:MGI:1914864]
## ENSMUSG00000051285 protein-L-isoaspartate (D-aspartate) O-methyltransferase domain containing 1 [Source:MGI Symbol;Acc:MGI:2441773]
## ... ...
## ENSMUSG00000067038 ribosomal protein S12, pseudogene 3 [Source:MGI Symbol;Acc:MGI:3704503]
## ENSMUSG00000040022 RAB11 family interacting protein 2 (class I) [Source:MGI Symbol;Acc:MGI:1922248]
## ENSMUSG00000024993 family with sequence similarity 45, member A [Source:MGI Symbol;Acc:MGI:1915144]
## ENSMUSG00000024997 peroxiredoxin 3 [Source:MGI Symbol;Acc:MGI:88034]
## ENSMUSG00000074733 zinc finger protein 950 [Source:MGI Symbol;Acc:MGI:2652824]
## Biotype Chr Start End
## <character> <character> <integer> <integer>
## ENSMUSG00000103922 processed_pseudogene 1 4771131 4772199
## ENSMUSG00000025903 protein_coding 1 4807788 4848410
## ENSMUSG00000103280 TEC 1 4905751 4906861
## ENSMUSG00000033793 protein_coding 1 5070018 5162529
## ENSMUSG00000051285 protein_coding 1 7088920 7173628
## ... ... ... ... ...
## ENSMUSG00000067038 protein_coding 19 59322290 59322783
## ENSMUSG00000040022 protein_coding 19 59902884 59943654
## ENSMUSG00000024993 protein_coding 19 60811585 60836227
## ENSMUSG00000024997 protein_coding 19 60864051 60874556
## ENSMUSG00000074733 protein_coding 19 61053840 61140840
## Strand medianTxLength TopGeneLabel
## <integer> <numeric> <character>
## ENSMUSG00000103922 1 1069.0
## ENSMUSG00000025903 1 903.5
## ENSMUSG00000103280 -1 1111.0
## ENSMUSG00000033793 1 1930.0
## ENSMUSG00000051285 1 1906.0
## ... ... ... ...
## ENSMUSG00000067038 1 494.0
## ENSMUSG00000040022 -1 4482.0
## ENSMUSG00000024993 1 826.0
## ENSMUSG00000024997 -1 1478.0
## ENSMUSG00000074733 -1 758.5
## -------
## seqinfo: 21 sequences from an unspecified genome
The .bed
file format is commonly used to store genomic locations for display in genome browsers (e.g. the UCSC browser or IGV) as tracks. Rather than just representing the genomic locations, the .bed
format is also able to colour each range according to some property of the analysis (e.g. direction and magnitude of change) to help highlight particular regions of interest. A score can also be displayed when a particular region is clicked-on.
For the score we can use the \(-log_{10}\) of the adjusted p-value and colour scheme for the regions based on the fold-change
colorRampPalette
is a useful function in base R for constructing a palette between two extremes. When choosing colour palettes, make sure they are colour blind friendly. The red / green colour scheme traditionally-applied to microarrays is a bad choice.
We will also truncate the fold-changes to between -5 and 5 to and divide this range into 10 equal bins
rbPal <- colorRampPalette(c("red", "blue"))
logFC <- pmax(sigRegions$logFC, -5)
logFC <- pmin(logFC , 5)
Cols <- rbPal(10)[as.numeric(cut(logFC, breaks = 10))]
The colours and score have to be saved in the GRanges object as score
and itemRgb
columns respectively, and will be used to construct the browser track. The rtracklayer package can be used to import and export browsers tracks.
Now we can export the signifcant results from the DE analysis as a .bed
track using rtracklayer
. You can load the resulting file in IGV, if you wish.
mcols(sigRegions)$score <- -log10(sigRegions$FDR)
mcols(sigRegions)$itemRgb <- Cols
sigRegions
## GRanges object with 4802 ranges and 19 metadata columns:
## seqnames ranges strand | GeneID
## <Rle> <IRanges> <Rle> | <character>
## ENSMUSG00000103922 1 4771131-4772199 + | ENSMUSG00000103922
## ENSMUSG00000025903 1 4807788-4848410 + | ENSMUSG00000025903
## ENSMUSG00000103280 1 4905751-4906861 - | ENSMUSG00000103280
## ENSMUSG00000033793 1 5070018-5162529 + | ENSMUSG00000033793
## ENSMUSG00000051285 1 7088920-7173628 + | ENSMUSG00000051285
## ... ... ... ... . ...
## ENSMUSG00000067038 19 59322371-59322766 + | ENSMUSG00000067038
## ENSMUSG00000040022 19 59902884-59943654 - | ENSMUSG00000040022
## ENSMUSG00000024993 19 60811585-60836227 + | ENSMUSG00000024993
## ENSMUSG00000024997 19 60864051-60874556 - | ENSMUSG00000024997
## ENSMUSG00000074733 19 61053840-61140840 - | ENSMUSG00000074733
## baseMean logFC lfcSE stat pvalue
## <numeric> <numeric> <numeric> <numeric> <numeric>
## ENSMUSG00000103922 143.6320 1.375335 0.405888 3.49736 4.69883e-04
## ENSMUSG00000025903 724.7165 0.663135 0.144031 4.61410 3.94795e-06
## ENSMUSG00000103280 11.0265 -1.575848 0.436094 -3.51957 4.32242e-04
## ENSMUSG00000033793 1271.5119 0.894671 0.104908 8.53860 1.35856e-17
## ENSMUSG00000051285 1677.5068 1.335719 0.170751 7.79748 6.31578e-15
## ... ... ... ... ... ...
## ENSMUSG00000067038 235.294 -0.776979 0.150176 -5.14540 2.66951e-07
## ENSMUSG00000040022 421.870 1.063631 0.177581 5.95704 2.56843e-09
## ENSMUSG00000024993 275.394 0.586893 0.162888 3.56435 3.64763e-04
## ENSMUSG00000024997 1154.906 0.912511 0.184537 4.97152 6.64317e-07
## ENSMUSG00000074733 178.169 0.875439 0.168880 5.11262 3.17720e-07
## FDR Entrez Symbol
## <numeric> <integer> <character>
## ENSMUSG00000103922 2.72497e-03 <NA> Gm6123
## ENSMUSG00000025903 4.36638e-05 18777 Lypla1
## ENSMUSG00000103280 2.54035e-03 <NA> Gm37277
## ENSMUSG00000033793 1.49330e-15 108664 Atp6v1h
## ENSMUSG00000051285 4.52750e-13 319263 Pcmtd1
## ... ... ... ...
## ENSMUSG00000067038 4.03229e-06 <NA> Rps12-ps3
## ENSMUSG00000040022 6.08307e-08 74998 Rab11fip2
## ENSMUSG00000024993 2.20742e-03 67894 Fam45a
## ENSMUSG00000024997 8.96506e-06 11757 Prdx3
## ENSMUSG00000074733 4.69009e-06 414758 Zfp950
## Description
## <character>
## ENSMUSG00000103922 predicted gene 6123 [Source:MGI Symbol;Acc:MGI:3647047]
## ENSMUSG00000025903 lysophospholipase 1 [Source:MGI Symbol;Acc:MGI:1344588]
## ENSMUSG00000103280 predicted gene, 37277 [Source:MGI Symbol;Acc:MGI:5610505]
## ENSMUSG00000033793 ATPase, H+ transporting, lysosomal V1 subunit H [Source:MGI Symbol;Acc:MGI:1914864]
## ENSMUSG00000051285 protein-L-isoaspartate (D-aspartate) O-methyltransferase domain containing 1 [Source:MGI Symbol;Acc:MGI:2441773]
## ... ...
## ENSMUSG00000067038 ribosomal protein S12, pseudogene 3 [Source:MGI Symbol;Acc:MGI:3704503]
## ENSMUSG00000040022 RAB11 family interacting protein 2 (class I) [Source:MGI Symbol;Acc:MGI:1922248]
## ENSMUSG00000024993 family with sequence similarity 45, member A [Source:MGI Symbol;Acc:MGI:1915144]
## ENSMUSG00000024997 peroxiredoxin 3 [Source:MGI Symbol;Acc:MGI:88034]
## ENSMUSG00000074733 zinc finger protein 950 [Source:MGI Symbol;Acc:MGI:2652824]
## Biotype Chr Start End
## <character> <character> <integer> <integer>
## ENSMUSG00000103922 processed_pseudogene 1 4771131 4772199
## ENSMUSG00000025903 protein_coding 1 4807788 4848410
## ENSMUSG00000103280 TEC 1 4905751 4906861
## ENSMUSG00000033793 protein_coding 1 5070018 5162529
## ENSMUSG00000051285 protein_coding 1 7088920 7173628
## ... ... ... ... ...
## ENSMUSG00000067038 protein_coding 19 59322290 59322783
## ENSMUSG00000040022 protein_coding 19 59902884 59943654
## ENSMUSG00000024993 protein_coding 19 60811585 60836227
## ENSMUSG00000024997 protein_coding 19 60864051 60874556
## ENSMUSG00000074733 protein_coding 19 61053840 61140840
## Strand medianTxLength TopGeneLabel score
## <integer> <numeric> <character> <numeric>
## ENSMUSG00000103922 1 1069.0 2.56464
## ENSMUSG00000025903 1 903.5 4.35988
## ENSMUSG00000103280 -1 1111.0 2.59511
## ENSMUSG00000033793 1 1930.0 14.82585
## ENSMUSG00000051285 1 1906.0 12.34414
## ... ... ... ... ...
## ENSMUSG00000067038 1 494.0 5.39445
## ENSMUSG00000040022 -1 4482.0 7.21588
## ENSMUSG00000024993 1 826.0 2.65612
## ENSMUSG00000024997 -1 1478.0 5.04745
## ENSMUSG00000074733 -1 758.5 5.32882
## itemRgb
## <character>
## ENSMUSG00000103922 #5500AA
## ENSMUSG00000025903 #71008D
## ENSMUSG00000103280 #AA0055
## ENSMUSG00000033793 #71008D
## ENSMUSG00000051285 #5500AA
## ... ...
## ENSMUSG00000067038 #8D0071
## ENSMUSG00000040022 #5500AA
## ENSMUSG00000024993 #71008D
## ENSMUSG00000024997 #71008D
## ENSMUSG00000074733 #71008D
## -------
## seqinfo: 21 sequences from an unspecified genome
library(rtracklayer)
export(sigRegions , con = "~/Course_Materials/RNAseq/results/topHits.bed")
As we have been using counts as our starting point, we haven't investigated the aligned reads from our experiment, and how they are represented. As you may be aware, aligned reads are usually stored in a bam file that can be manipulated with open-source command-line tools such as samtools and picard. Bioconductor provide a low-level interface to data/bam/sam files in the form of the Rsamtools
package. The GenomicAlignments
package can also be used to retrieve the reads mapping to a particular genomic region in an efficient manner.
library(GenomicAlignments)
In the directory small_bams
there should be .bam
files for some of the samples in the example study. The workflow to produce these files is described in a supplmentary page for the course. In brief, the raw reads (fastq
) were downloaded from the Short Read Archive (SRA) and aligned with hisat2
. Each bam file was named according to the file name in SRA, but we have renamed the files according to their name in the study. An index file (.bai
) has been generated for each bam file. In order to reduce the size, the bam files used here only contain a subset of the reads that were aligned in the region chr15:101707000-101713000.
list.files("~/Course_Materials/RNAseq/small_bams/")
## [1] "MCL1.DG.15.sm.bam" "MCL1.DG.15.sm.bam.bai"
## [3] "MCL1.DH.15.sm.bam" "MCL1.DH.15.sm.bam.bai"
## [5] "MCL1.DI.15.sm.bam" "MCL1.DI.15.sm.bam.bai"
## [7] "MCL1.DJ.15.sm.bam" "MCL1.DJ.15.sm.bam.bai"
## [9] "MCL1.DK.15.sm.bam" "MCL1.DK.15.sm.bam.bai"
## [11] "MCL1.DL.15.sm.bam" "MCL1.DL.15.sm.bam.bai"
## [13] "MCL1.LA.15.sm.bam" "MCL1.LA.15.sm.bam.bai"
## [15] "MCL1.LB.15.sm.bam" "MCL1.LB.15.sm.bam.bai"
## [17] "MCL1.LC.15.sm.bam" "MCL1.LC.15.sm.bam.bai"
## [19] "MCL1.LD.15.sm.bam" "MCL1.LD.15.sm.bam.bai"
## [21] "MCL1.LE.15.sm.bam" "MCL1.LE.15.sm.bam.bai"
## [23] "MCL1.LF.15.sm.bam" "MCL1.LF.15.sm.bam.bai"
## [25] "Mus_musculus.GRCm38.97.chr15.gtf"
The readGAlignments
function provides a simple interface to interrogate the aligned reads for a particular sample. It can also utilise the index file in order to retrieve only the reads that correspond to a specific region in an efficient manner. The output includes the genomic location of each aligned read and the CIGAR (Compact Idiosyncratic Gapped Alignment Report); where M denotes an match to the genome and I, D correspond to insertions and deletions.
exo <- exonsBy(txMm, "gene")
generegion <- exo[["ENSMUSG00000022146"]] %>%
keepSeqlevels(value = 15, pruning.mode="tidy")
my.reads <- readGAlignments(file="~/Course_Materials/RNAseq/small_bams/MCL1.DG.15.sm.bam",
param=ScanBamParam(which=generegion))
my.reads
## GAlignments object with 25419 alignments and 0 metadata columns:
## seqnames strand cigar qwidth start end
## <Rle> <Rle> <character> <integer> <integer> <integer>
## [1] 15 + 81M53311N11M8S 100 6799340 6852742
## [2] 15 + 100M 100 6813575 6813674
## [3] 15 + 3S97M 100 6813579 6813675
## [4] 15 + 6S94M 100 6813579 6813672
## [5] 15 + 100M 100 6813580 6813679
## ... ... ... ... ... ... ...
## [25415] 15 - 100M 100 6874937 6875036
## [25416] 15 - 100M 100 6874941 6875040
## [25417] 15 - 99M1S 100 6874945 6875043
## [25418] 15 + 100M 100 6874962 6875061
## [25419] 15 - 100M 100 6874966 6875065
## width njunc
## <integer> <integer>
## [1] 53403 1
## [2] 100 0
## [3] 97 0
## [4] 94 0
## [5] 100 0
## ... ... ...
## [25415] 100 0
## [25416] 100 0
## [25417] 99 0
## [25418] 100 0
## [25419] 100 0
## -------
## seqinfo: 66 sequences from an unspecified genome
It is possible to tweak the function to retrieve other potentially-useful information from the bam file, such as the mapping quality and flag.
my.reads <- readGAlignments(file="~/Course_Materials/RNAseq/small_bams/MCL1.DG.15.sm.bam",
param=ScanBamParam(which=generegion,
what=c("seq","mapq","flag")))
my.reads
## GAlignments object with 25419 alignments and 3 metadata columns:
## seqnames strand cigar qwidth start end
## <Rle> <Rle> <character> <integer> <integer> <integer>
## [1] 15 + 81M53311N11M8S 100 6799340 6852742
## [2] 15 + 100M 100 6813575 6813674
## [3] 15 + 3S97M 100 6813579 6813675
## [4] 15 + 6S94M 100 6813579 6813672
## [5] 15 + 100M 100 6813580 6813679
## ... ... ... ... ... ... ...
## [25415] 15 - 100M 100 6874937 6875036
## [25416] 15 - 100M 100 6874941 6875040
## [25417] 15 - 99M1S 100 6874945 6875043
## [25418] 15 + 100M 100 6874962 6875061
## [25419] 15 - 100M 100 6874966 6875065
## width njunc | seq mapq flag
## <integer> <integer> | <DNAStringSet> <integer> <integer>
## [1] 53403 1 | GTTTGGAAGT...TCTCCTAAAC 60 0
## [2] 100 0 | GAAATGTTTT...ATCAATGTCA 60 0
## [3] 97 0 | TTTTGTTTTA...TCAATGTCAT 60 0
## [4] 94 0 | TTTTTTTGTT...AAATCAATGT 60 0
## [5] 100 0 | GTTTTAATTT...TGTCATTAAC 60 0
## ... ... ... . ... ... ...
## [25415] 100 0 | TCTCTTTATG...TTCCCACCAG 60 16
## [25416] 100 0 | TTTATGGCTG...CACCAGTCGC 60 16
## [25417] 99 0 | TGGCTGCATG...AGTCGCCAGA 60 16
## [25418] 100 0 | GTCCACAGCC...GCCTGGAGAA 60 0
## [25419] 100 0 | ACAGCCACGT...GGAGAACCGC 60 16
## -------
## seqinfo: 66 sequences from an unspecified genome
The flag can represent useful QC information. e.g.
The combination of any of these properties is used to derive a numeric value, as illustrated in this useful resource
Particular attributes of the reads can be extracted and visualised
hist(mcols(my.reads)$mapq, main="", xlab="MAPQ")
However, there are more-sophisticated visualisation options for aligned reads and range data. We will use the ggbio
package, which first requires some discussion of the ggplot2
plotting package.
We will now take a brief look at one of the visualisation packages in Bioconductor that takes advantage of the GenomicRanges and GenomicFeatures object-types. In this section we will show a worked example of how to combine several types of genomic data on the same plot. The documentation for ggbio is very extensive and contains lots of examples.
http://www.tengfei.name/ggbio/docs/
The Gviz
package is another Bioconductor package that specialising in genomic visualisations, but we will not explore this package in the course.
The Manhattan plot is a common way of visualising genome-wide results, especially when one is concerned with the results of a GWAS study and identifying strongly-associated hits.
The profile is supposed to resemble the Manhattan skyline with particular skyscrapers towering about the lower level buildings.
This type of plot is implemented as the plotGrandLinear
function. We have to supply a value to display on the y-axis using the aes
function, which is inherited from ggplot2. The positioning of points on the x-axis is handled automatically by ggbio, using the ranges information to get the genomic coordinates of the ranges of interest.
To stop the plots from being too cluttered we will consider the top 200 genes only.
library(ggbio)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
## Need specific help about ggbio? try mailing
## the maintainer or visit http://tengfei.github.com/ggbio/
##
## Attaching package: 'ggbio'
## The following objects are masked from 'package:ggplot2':
##
## geom_bar, geom_rect, geom_segment, ggsave, stat_bin, stat_identity,
## xlim
top200 <- sigRegions[order(sigRegions$FDR)[1:200]]
plotGrandLinear(top200 , aes(y = logFC))
## using coord:genome to parse x scale
ggbio
has alternated the colours of the chromosomes. However, an appealing feature of ggplot2
is the ability to map properties of your plot to variables present in your data. For example, we could create a variable to distinguish between up- and down-regulated genes. The variables used for aesthetic mapping must be present in the mcols
section of your ranges object.
mcols(top200)$UpRegulated <- mcols(top200)$logFC > 0
plotGrandLinear(top200, aes(y = logFC, col = UpRegulated))
## using coord:genome to parse x scale
plotGrandLinear
is a special function in ggbio
with preset options for the manhattan style of plot. More often, users will call the autoplot
function and ggbio
will choose the most appropriate layout. One such layout is the karyogram.
autoplot(top200, layout="karyogram", aes(color=UpRegulated, fill=UpRegulated))
## Scale for 'x' is already present. Adding another scale for 'x', which will
## replace the existing scale.
## Scale for 'x' is already present. Adding another scale for 'x', which will
## replace the existing scale.
ggbio
is also able to plot the structure of genes according to a particular model represented by a GenomicFeatures
object, such as the object we created earlier with the exon coordinates for each gene in the GRCm38 genome.
autoplot(txMm, which=exo[["ENSMUSG00000022146"]])
## Warning: `quo_expr()` is deprecated as of rlang 0.2.0.
## Please use `quo_squash()` instead.
## This warning is displayed once per session.
We can even plot the location of sequencing reads if they have been imported using readGAlignments function (or similar).
myreg <- exo[["ENSMUSG00000022146"]] %>%
GenomicRanges::reduce() %>%
flank(width = 1000, both = T) %>%
keepSeqlevels(value = 15, pruning.mode="tidy")
bam <- readGappedReads(file="~/Course_Materials/RNAseq/small_bams/MCL1.DG.15.sm.bam",
param=ScanBamParam(which=myreg), use.names = TRUE)
autoplot(bam, geom = "rect") +
xlim(GRanges("15", IRanges(6800000, 6900000)))
## extracting information...
Like ggplot2, ggbio plots can be saved as objects that can later be modified, or combined together to form more complicated plots. If saved in this way, the plot will only be displayed on a plotting device when we query the object. This strategy is useful when we want to add a common element (such as an ideogram) to a plot composition and don’t want to repeat the code to generate the plot every time.
geneMod <- autoplot(txMm, which = myreg) +
xlim(GRanges("15", IRanges(6810000, 6880000)))
reads.MCL1.DG <- autoplot(bam, stat = "coverage") +
xlim(GRanges("15", IRanges(6810000, 6880000))) +
labs(title="MCL1.DG")
tracks(GRCm38=geneMod, MCL1.DG=reads.MCL1.DG)