.bam
file into R.bam
file using BioconductorWe will now look at how we can represent and access genomic intervals in R and Bioconductor which fit best within the framework of the course. There are also tools outside of R that are extremely powerful; such as bedtools which are worth considering if you want to go further with NGS analysis.
We can import reads from a .bam
file into Bioconductor. However, to set expectations we are not going to be processing the entire set of reads from a whole-genome sequencing run in this manner. This can be a useful way of diving-into a particular region of interest and exploring the data.
A package that can be used to parse a .bam
file is GenomicAlignments
. You should notice that although the .bam
file is not particularly big (~ 12.5 Million reads), but already takes a little while to read.
library(GenomicAlignments)
mybam <- readGAlignments("/data/test/paired.bam")
mybam
GAlignments object with 12581680 alignments and 0 metadata columns:
seqnames strand cigar qwidth start end width njunc
<Rle> <Rle> <character> <integer> <integer> <integer> <integer> <integer>
[1] 1 + 1S67M 68 10003 10069 67 0
[2] 1 + 68M 68 10036 10103 68 0
[3] 1 + 68M 68 10036 10103 68 0
[4] 1 + 68M 68 10041 10108 68 0
[5] 1 - 68M 68 10041 10108 68 0
... ... ... ... ... ... ... ... ...
[12581676] hs37d5 - 68M 68 35465836 35465903 68 0
[12581677] hs37d5 + 68M 68 35466096 35466163 68 0
[12581678] hs37d5 - 47M1D21M 68 35466133 35466201 69 0
[12581679] hs37d5 + 68M 68 35466270 35466337 68 0
[12581680] hs37d5 - 68M 68 35466357 35466424 68 0
-------
seqinfo: 86 sequences from an unspecified genome
readGAlignments has provided us with an object that can be manipulated using the standard vector conventions.
mybam[1:10]
GAlignments object with 10 alignments and 0 metadata columns:
seqnames strand cigar qwidth start end width njunc
<Rle> <Rle> <character> <integer> <integer> <integer> <integer> <integer>
[1] 1 + 1S67M 68 10003 10069 67 0
[2] 1 + 68M 68 10036 10103 68 0
[3] 1 + 68M 68 10036 10103 68 0
[4] 1 + 68M 68 10041 10108 68 0
[5] 1 - 68M 68 10041 10108 68 0
[6] 1 + 68M 68 10041 10108 68 0
[7] 1 - 68M 68 10046 10113 68 0
[8] 1 + 6M1I61M 68 10103 10169 67 0
[9] 1 - 38M1D3M1D27M 68 10108 10177 70 0
[10] 1 - 36M1D32M 68 10110 10178 69 0
-------
seqinfo: 86 sequences from an unspecified genome
There are also a number of accessor functions that can get particular items from the object; cigar
to obtain the CIGAR strings, start
/ end
to get the start and end positions, width
to get the width of each read.
## space to try out some of the accessor functions
The object we have created, bam
, contains only a small amount of the information available in a .bam
file. If we wish we can import extra fields such as the read sequence, mapping quality and flag:-
bam <- readGAlignments("/data/test/paired.bam",param=ScanBamParam(what=c("seq","mapq","flag","isize","qual")))
bam[1:3]
GAlignments object with 3 alignments and 5 metadata columns:
seqnames strand cigar qwidth start end width njunc | seq mapq flag
<Rle> <Rle> <character> <integer> <integer> <integer> <integer> <integer> | <DNAStringSet> <integer> <integer>
[1] 1 + 1S67M 68 10003 10069 67 0 | GACCCTGACC...CTAACCCTAA 6 163
[2] 1 + 68M 68 10036 10103 68 0 | CTAAGCCTAA...CCCGAACGCT 5 99
[3] 1 + 68M 68 10036 10103 68 0 | CTAACCCTAA...CCCTAACCCT 14 99
isize qual
<integer> <PhredQuality>
[1] 105 S=<====<<>...?=<=>?>?=Q
[2] 77 ##########...##########
[3] 141 S=;?<>>><>...=>@@=?>@?Q
-------
seqinfo: 86 sequences from an unspecified genome
The command takes longer to run, but we get more detail on each of the reads. The extra fields make up the metadata for each read and can be accessed using the mcols
function. If we save this metadata as an object, we can treat it as a data frame
and therefore have the usual $
operator to access the columns
meta <- mcols(bam)
meta
DataFrame with 12581680 rows and 5 columns
seq mapq flag isize qual
<DNAStringSet> <integer> <integer> <integer> <PhredQuality>
1 GACCCTGACC...CTAACCCTAA 6 163 105 S=<====<<>...?=<=>?>?=Q
2 CTAAGCCTAA...CCCGAACGCT 5 99 77 ##########...##########
3 CTAACCCTAA...CCCTAACCCT 14 99 141 S=;?<>>><>...=>@@=?>@?Q
4 CCTAACCCTA...ACCCTAACCC 15 163 137 S==:=;=<=;...;=<>1;>>@P
5 CCTAACCCTA...ACCCTAACCC 18 83 -105 ##########...:>>>9<:>?N
... ... ... ... ... ...
12581676 ATTCCATTCC...AGTGCAGATT 29 147 -183 P?1??=><?;...=<<====9<S
12581677 CACTTCATCA...CCATTCGATT 24 99 105 S=<>=><<>=...??>=?@8?=P
12581678 TTCGAATCTT...TACCATTCGA 13 147 -105 O@9==<=>??...;<>=;==7=S
12581679 ATTCGATTCC...CCATTTGATT 32 163 154 S<==6<<==>...>?>=?>?>>S
12581680 TCCGTTTGAT...ATTCCATTTG 31 83 -154 S@8>??>?=@...<>=?><>>=S
The sequences are stored in Biostrings
format
meta$seq
A DNAStringSet instance of length 12581680
width seq
[1] 68 GACCCTGACCCTAACCCTGACCCTGACCCTAACCCTGACCCTGACCCTAACCCTGACCCTAACCCTAA
[2] 68 CTAAGCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTCAGCCTAACCCGAACGCT
[3] 68 CTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCT
[4] 68 CCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC
[5] 68 CCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC
... ... ...
[12581676] 68 ATTCCATTCCCTTCCTTTCTTTCTTCAGGATCTCACTCTGTCACCCAGTATGGAGTGAAGTGCAGATT
[12581677] 68 CACTTCATCACATTCCATTCGATTTCATTCTACACCATTCGAATCTTTCAATTTCATTCCATTCGATT
[12581678] 68 TTCGAATCTTTCAATTTCATTCCATTCGATTCCATTAGATTCCATTTTTTCCATTCGATACCATTCGA
[12581679] 68 ATTCGATTCCTTTTGATTCCATATGATTCAATTTGATTCAATTGGATTTGATTCAAAACCATTTGATT
[12581680] 68 TCCGTTTGATTCCATTCCATTCGATTCCGTTCCATTCCATTCCACTCCGTTCCATTTCATTCCATTTG
The flags should be valid values as explained online
table(meta$flag)
83 99 147 163 1107 1123 1171 1187
3074883 3127385 3127385 3074312 43835 45023 45023 43834
Finally the mapping qualities are numeric quantities that will vary according to aligner. Mapping qualities of 0 are usually reserved for reads that map to multiple locations. Many calling algorithms will employ a filter on mapping quality; with values of 10 to 20 typically used to discard reads
summary(meta$mapq)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 30.00 39.00 32.87 40.00 53.00
Rather than taking a whole-genome view, we often want to view the reads for a particular gene or region of interest. This we can do using the functions we have already seen.
my.reads <- which(seqnames(mybam)=="17" & start(mybam) > 7577851 & end(mybam) < 7598063)
mybam[my.reads]
GAlignments object with 224 alignments and 0 metadata columns:
seqnames strand cigar qwidth start end width njunc
<Rle> <Rle> <character> <integer> <integer> <integer> <integer> <integer>
[1] 17 + 76M 76 7577918 7577993 76 0
[2] 17 - 76M 76 7577930 7578005 76 0
[3] 17 + 76M 76 7578191 7578266 76 0
[4] 17 + 68M 68 7578195 7578262 68 0
[5] 17 + 68M 68 7578203 7578270 68 0
... ... ... ... ... ... ... ... ...
[220] 17 - 76M 76 7596411 7596486 76 0
[221] 17 + 68M 68 7596523 7596590 68 0
[222] 17 - 68M 68 7596639 7596706 68 0
[223] 17 + 67M1S 68 7597833 7597899 67 0
[224] 17 - 68M 68 7597888 7597955 68 0
-------
seqinfo: 86 sequences from an unspecified genome
However, there are much more efficient ways to do such queries in Bioconductor as we will see later in the course
GenomicAlignments
is part of a family of packages that provide object-types and functionality for dealing with genomic intervals; which are described in a PLoS Computational Biology paper from 2013.
The basic type of interval we can define is an IRanges
object. There is an extensive list of operations that we can perform on this object
We might want to focus on a subset of reads from the start, is when we want to analyse a particular gene. Provided that the .bam
file has been indexed (creating a .bam.bai
file in the same directory), we can very quickly jump to a particular genomic region.
First we need to define a region:-
mygene <- GRanges("17", ranges=IRanges(7577851, 7598063),strand="+")
mygene
GRanges object with 1 range and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] 17 [7577851, 7598063] +
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
The region filer can be used in conjuction with the what
argument to ScanBamParam
function to provide a detailed picture of the reads for your gene.
mygene.reads <- readGAlignments(file="/data/test/paired.bam",
param=ScanBamParam(which=mygene,
what=c("seq","mapq","flag","qual","isize")
)
)
mygene.reads
GAlignments object with 225 alignments and 5 metadata columns:
seqnames strand cigar qwidth start end width njunc | seq mapq flag
<Rle> <Rle> <character> <integer> <integer> <integer> <integer> <integer> | <DNAStringSet> <integer> <integer>
[1] 17 + 76M 76 7577918 7577993 76 0 | TCCGCCTGCC...GGGAGGCCCT 37 163
[2] 17 - 76M 76 7577930 7578005 76 0 | GGCCTCCCAA...GCCTCTGTAA 31 83
[3] 17 + 76M 76 7578191 7578266 76 0 | AGGGCACCAC...CCACTCGGAT 40 99
[4] 17 + 68M 68 7578195 7578262 68 0 | CACCACCACA...CCTTCCACTC 40 163
[5] 17 + 68M 68 7578203 7578270 68 0 | CACTATGTCG...TCGGATAAGA 40 99
... ... ... ... ... ... ... ... ... . ... ... ...
[221] 17 + 68M 68 7596523 7596590 68 0 | CAAAAAATCA...CTAGAGGTCT 30 163
[222] 17 - 68M 68 7596639 7596706 68 0 | CCTGTCAAAT...CCCCTGCCTC 25 83
[223] 17 + 67M1S 68 7597833 7597899 67 0 | TTACAGGCGT...CTGGGGGGGG 28 99
[224] 17 - 68M 68 7597888 7597955 68 0 | TGGCTGGTGT...GAATGCTTGA 35 147
[225] 17 + 68M 68 7598046 7598113 68 0 | TGAGGTGGGA...CCATTGTACT 38 163
qual isize
<PhredQuality> <integer>
[1] H;9476;:9:...@?9<085==J 87
[2] ##########...>?><><<<>S -87
[3] S=>>=<<><<...?>==??8??P 202
[4] Q<;=<;=<<=...@??>??=>?S 97
[5] M<;>;;?=>7...@@4;>:<??I 143
... ... ...
[221] S<<><=<;=<...?######### 183
[222] ##########...##==:>>><L -183
[223] S=;;<@==8;...==?;>2%?## 122
[224] Q>:?>6<><=...==:;<<=;=Q -122
[225] O===>;=>>=...########## 124
-------
seqinfo: 86 sequences from an unspecified genome
We will now look into the couple of ways that we can summarise the data that will motivate some of the methods to come in the rest of the course.
Firstly, we can compute how many bases are observed at each position using the pileupAsGRanges
convenience function from the biovizBase
package.
library(biovizBase)
baseSummary <- pileupAsGRanges("/data/test/paired.bam",regions = mygene)
baseSummary
GRanges object with 4482 ranges and 7 metadata columns:
seqnames ranges strand | A C G T N depth bam
<Rle> <IRanges> <Rle> | <integer> <integer> <integer> <integer> <integer> <numeric> <Rle>
[1] 17 [7577918, 7577918] + | 0 0 0 1 0 1 /data/test/paired.bam
[2] 17 [7577919, 7577919] + | 0 1 0 0 0 1 /data/test/paired.bam
[3] 17 [7577920, 7577920] + | 0 1 0 0 0 1 /data/test/paired.bam
[4] 17 [7577921, 7577921] + | 0 0 1 0 0 1 /data/test/paired.bam
[5] 17 [7577922, 7577922] + | 0 1 0 0 0 1 /data/test/paired.bam
... ... ... ... . ... ... ... ... ... ... ...
[4478] 17 [7598059, 7598059] + | 0 0 0 1 0 1 /data/test/paired.bam
[4479] 17 [7598060, 7598060] + | 0 1 0 0 0 1 /data/test/paired.bam
[4480] 17 [7598061, 7598061] + | 0 0 1 0 0 1 /data/test/paired.bam
[4481] 17 [7598062, 7598062] + | 0 1 0 0 0 1 /data/test/paired.bam
[4482] 17 [7598063, 7598063] + | 0 0 0 1 0 1 /data/test/paired.bam
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
GenomicRanges
object with metadata corresponding to the base counts and depthmcols
functionmeta <- mcols(baseSummary)
meta
data frame to plot the read coverage over this genomic regionchr17:7577851-7598063
in IGV
apply
is used to run the same function on each row of the data framewhich.max
returns the index of the highest value in each rowmycalls
is then a vector of “calls” for each base; the same length as the genomic regionmycalls <- apply(meta[,c("A","T","C","G")], 1,
function(x) names(x)[which.max(x)]
)
# A test to make sure that it's worked for the first few rows
mycalls[1:5]
[1] "T" "C" "C" "G" "C"
baseSummary[1:5]
GRanges object with 5 ranges and 7 metadata columns:
seqnames ranges strand | A C G T N depth bam
<Rle> <IRanges> <Rle> | <integer> <integer> <integer> <integer> <integer> <numeric> <Rle>
[1] 17 [7577918, 7577918] + | 0 0 0 1 0 1 /data/test/paired.bam
[2] 17 [7577919, 7577919] + | 0 1 0 0 0 1 /data/test/paired.bam
[3] 17 [7577920, 7577920] + | 0 1 0 0 0 1 /data/test/paired.bam
[4] 17 [7577921, 7577921] + | 0 0 1 0 0 1 /data/test/paired.bam
[5] 17 [7577922, 7577922] + | 0 1 0 0 0 1 /data/test/paired.bam
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
Another piece of information that would be useful at this point is to determine what the reference base is at each position. This can be achieved using the one of the pre-built genome packages in Bioconductor
biocLite
### Don't run this during the course
source("https://bioconductor.org/biocLite.R")
biocLite("BSgenome.Hsapiens.UCSC.hg19")
Some basic information about the package can be obtained by typing the object name
library("BSgenome.Hsapiens.UCSC.hg19")
Loading required package: BSgenome
Loading required package: rtracklayer
hg19 <- BSgenome.Hsapiens.UCSC.hg19
hg19
Human genome:
# organism: Homo sapiens (Human)
# provider: UCSC
# provider version: hg19
# release date: Feb. 2009
# release name: Genome Reference Consortium GRCh37
# 93 sequences:
# chr1 chr2 chr3 chr4 chr5 chr6
# chr7 chr8 chr9 chr10 chr11 chr12
# chr13 chr14 chr15 chr16 chr17 chr18
# chr19 chr20 chr21 chr22 chrX chrY
# chrM chr1_gl000191_random chr1_gl000192_random chr4_ctg9_hap1 chr4_gl000193_random chr4_gl000194_random
# ... ... ... ... ... ...
# chrUn_gl000223 chrUn_gl000224 chrUn_gl000225 chrUn_gl000226 chrUn_gl000227 chrUn_gl000228
# chrUn_gl000229 chrUn_gl000230 chrUn_gl000231 chrUn_gl000232 chrUn_gl000233 chrUn_gl000234
# chrUn_gl000235 chrUn_gl000236 chrUn_gl000237 chrUn_gl000238 chrUn_gl000239 chrUn_gl000240
# chrUn_gl000241 chrUn_gl000242 chrUn_gl000243 chrUn_gl000244 chrUn_gl000245 chrUn_gl000246
# chrUn_gl000247 chrUn_gl000248 chrUn_gl000249
# (use 'seqnames()' to see all the sequence names, use the '$' or '[[' operator to access a given sequence)
getSeq
function that can obtain the genome sequence for a given genomic regionrefSeq <- getSeq(hg19, baseSummary)
refSeq
renameSeqlevels
functionrefSeq <- getSeq(hg19, renameSeqlevels(baseSummary, c("17"="chr17")))
refSeq
A DNAStringSet instance of length 4482
width seq
[1] 1 T
[2] 1 C
[3] 1 C
[4] 1 G
[5] 1 C
... ... ...
[4478] 1 T
[4479] 1 C
[4480] 1 G
[4481] 1 C
[4482] 1 T
mycalls == refSeq
return?We have explored the properties of bam files using Bioconductor. The techniques and types of object we have learnt about will crop-up again-and-again in the course. The vast majority of NGS analysis tools in Bioconductor will use GenomicRanges
objects in some form.
Due to the high-volume of the dataset, some of the tools and pipelines we use will not be in R. However, you will still be able to interrogate the results you obtain and explore them in more detail using R.