In this session we:-
We will use the freebayes genotype caller for convenience. The main reasons for choosing this were:-
freebayes
is a command-line tool that you will need to run from the terminal. As with all command-line tools, we first need to make sure that we are located in the correct directory. Remember that we can change directory using the unix command cd
.
freebayes
has already been in the CRUK Docker. However, we can make sure that freebayes
is available by typing freebayes
at the terminal. This will give a basic description, a version number, and information about how to cite the package (very important if you end-up using the tool to generate results for your paper). Adding the parameter -h
will display more-detailed help about the tool and some examples of how to use it.
cd /home/participant/Course_Materials/Day2/
freebayes
freebayes -h
We are going to use freebayes
to call SNVs on some 1000 genomes samples. In order to make the SNV-calling run in a reasonable time, we are only considering reads aligned to chromosome 20 in this analysis. In the Appendix you will see the commands used to create the bam files.
The minimal requirements to run freebayes
(and why it is appealing for this practical!) are a reference genome and a .bam
file. The -f
parameter is used to specify the location of a reference genome in .fasta
format.
Please don’t run this next command
freebayes -f /reference_data/Homo_sapiens_assembly19.fasta /data/hapmap/NA12878.chr20.bam
If you did run that command, you would quickly see that the screen gets filled with lots of text. These are the calls that freebayes
is making being printed to the screen (the standard output for some unix commands). If you find yourself in this situation, a swift press of CTRL + C
should stop the tool from running.
What we need to do is direct the output to a file. We can call the output file anything we like, but it is advisable to make the name relatable to the name of the input. If we are in the situation of calling genotypes on many samples, with many different callers, then we want to be able to identify the processing used on each sample.
Although it is not mandatory, we give the output file the extension .vcf
. The .vcf
format is a commonly-adopted standard for variant calls, which we will look into detail now.
We also restrict the analysis to a particular region to speed things up
freebayes -f /reference_data/Homo_sapiens_assembly19.fasta --region 20:500000-800000 /data/hapmap/NA12878.chr20.bam > NA12878.chr20.subset.freebayes.vcf
The vcf format was initially developed by the 1000 Genomes Project, and ownership has been subsequently transferred to Global Alliance for Genomics and Health Data Working group file format team. The format can be used to represent information about all kinds of genomic variation. In this session we will just consider SNVs.
We don’t require any specialised software to look at the contents of a vcf file. They can be opened in a bog-standard text editor.For now, navigate to the folder and right-click on NA12878.chr20.subset.freebayes.vcf
. Select the option to open with gedit
(a common text editor for Ubuntu).
In a similar vein to the .bam
and .sam
files we saw earlier, the .vcf
files contains many lines of header information.
##fileformat=VCFv4.2
##fileDate=20170721
##source=freeBayes v1.1.0-44-gd784cf8
##reference=/reference_data/Homo_sapiens_assembly19.fasta
##contig=<ID=1,length=249250621>
##contig=<ID=2,length=243199373>
##contig=<ID=3,length=198022430>
##contig=<ID=4,length=191154276>
##contig=<ID=5,length=180915260>
##contig=<ID=6,length=171115067>
After many more lines of information, we finally get to the details of the actual calls themsevles. This part of the file is tab-delimited; with 10 columns for every call. The vcf specification page gives details of what should be contained in each column
Shown here is the information about the first five calls
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12878
20 504789 . C T 155.466 . AB=0;ABP=0;AC=2;AF=1;AN=2;AO=5;CIGAR=1X;DP=5;DPB=5;DPRA=0;EPP=3.44459;EPPR=0;GTI=0;LEN=1;MEANALT=1;MQM=60;MQMR=0;NS=1;NUMALT=1;ODDS=11.5366;PAIRED=1;PAIREDR=0;PAO=0;PQA=0;PQR=0;PRO=0;QA=189;QR=0;RO=0;RPL=4;RPP=6.91895;RPPR=0;RPR=1;RUN=1;SAF=3;SAP=3.44459;SAR=2;SRF=0;SRP=0;SRR=0;TYPE=snp;technology.ILLUMINA=1 GT:DP:AD:RO:QR:AO:QA:GL 1/1:5:0,5:0:0:5:189:-17.363,-1.50515,0
20 505298 . T C 48.5303 . AB=0;ABP=0;AC=2;AF=1;AN=2;AO=2;CIGAR=1X;DP=2;DPB=2;DPRA=0;EPP=3.0103;EPPR=0;GTI=0;LEN=1;MEANALT=1;MQM=44.5;MQMR=0;NS=1;NUMALT=1;ODDS=7.37776;PAIRED=1;PAIREDR=0;PAO=0;PQA=0;PQR=0;PRO=0;QA=65;QR=0;RO=0;RPL=2;RPP=7.35324;RPPR=0;RPR=0;RUN=1;SAF=1;SAP=3.0103;SAR=1;SRF=0;SRP=0;SRR=0;TYPE=snp;technology.ILLUMINA=1 GT:DP:AD:RO:QR:AO:QA:GL 1/1:2:0,2:0:0:2:65:-6.05687,-0.60206,0
20 506074 . A G 2.08497e-06 . AB=0;ABP=0;AC=0;AF=0;AN=2;AO=2;CIGAR=1X;DP=7;DPB=7;DPRA=0;EPP=7.35324;EPPR=6.91895;GTI=0;LEN=1;MEANALT=1;MQM=44.5;MQMR=60;NS=1;NUMALT=1;ODDS=14.5493;PAIRED=1;PAIREDR=1;PAO=0;PQA=0;PQR=0;PRO=0;QA=4;QR=189;RO=5;RPL=2;RPP=7.35324;RPPR=6.91895;RPR=0;RUN=1;SAF=2;SAP=7.35324;SAR=0;SRF=0;SRP=13.8677;SRR=5;TYPE=snp;technology.ILLUMINA=1 GT:DP:AD:RO:QR:AO:QA:GL 0/0:7:5,2:5:189:2:4:0,-1.72751,-16.9877
20 506882 . A G 1.15644 . AB=0.333333;ABP=4.45795;AC=1;AF=0.5;AN=2;AO=2;CIGAR=1X;DP=6;DPB=6;DPRA=0;EPP=3.0103;EPPR=3.0103;GTI=0;LEN=1;MEANALT=1;MQM=60;MQMR=60;NS=1;NUMALT=1;ODDS=1.18712;PAIRED=1;PAIREDR=1;PAO=0;PQA=0;PQR=0;PRO=0;QA=54;QR=153;RO=4;RPL=2;RPP=7.35324;RPPR=5.18177;RPR=0;RUN=1;SAF=1;SAP=3.0103;SAR=1;SRF=3;SRP=5.18177;SRR=1;TYPE=snp;technology.ILLUMINA=1 GT:DP:AD:RO:QR:AO:QA:GL 0/1:6:4,2:4:153:2:54:-3.31865,0,-12.3328
The first seven columns should look consistent across different genotype callers. In this exercise we have not annotated against known variants, or applied any filtering, so the ID
and FILTER
columns are blank. In some .vcf
files these columns might be populated with dbSNP IDs or flags such as PASS
/ FAIL
respectively.
The contents of the INFO
and FORMAT
columns will depend on what variant caller has been used. The INFO
column contains metrics and other information related to each variant call as a set of KEY=VALUE
pairs. Each pair is separated by a ;
character.
The INFO
for the first variant reads as:-
AB=0.454545;ABP=3.20771;AC=1;AF=0.5;AN=2;AO=5;CIGAR=1X;DP=11;DPB=11;DPRA=0;EPP=3.44459;EPPR=4.45795;GTI=0;.......
which we can interpret as:-
Key | Value | |
---|---|---|
AB=0.454545 | AB | 0.454545 |
ABP=3.20771 | ABP | 3.20771 |
AC=1 | AC | 1 |
AF=0.5 | AF | 0.5 |
AN=2 | AN | 2 |
AO=5 | AO | 5 |
CIGAR=1X | CIGAR | 1X |
DP=11 | DP | 11 |
DPB=11 | DPB | 11 |
DPRA=0 | DPRA | 0 |
The meaning of each KEY
is given in the header for the .vcf
file.
##INFO=<ID=AB,Number=A,Type=Float,Description="Allele balance at heterozygous sites: a number between 0 and 1 representing the ratio of reads showing the reference allele to all reads, considering only reads from individuals called as heterozygous">##INFO=<ID=ABP,Number=A,Type=Float,Description="Allele balance probability at heterozygous sites: Phred-scaled upper-bounds estimate of the probability of observing the deviation between ABR and ABA given E(ABR/ABA) ~ 0.5, derived using Hoeffding's inequality">##INFO=<ID=AC,Number=A,Type=Integer,Description="Total number of alternate alleles in called genotypes">##INFO=<ID=AF,Number=A,Type=Float,Description="Estimated allele frequency in the range (0,1]">##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">##INFO=<ID=AO,Number=A,Type=Integer,Description="Count of full observations of this alternate haplotype.">##INFO=<ID=CIGAR,Number=A,Type=String,Description="The extended CIGAR representation of each alternate allele, with the exception that '=' is replaced by 'M' to ease VCF parsing. Note that INDEL alleles do not have the first matched base (which is provided by default, per the spec) referred to by the CIGAR.">##INFO=<ID=DP,Number=1,Type=Integer,Description="Total read depth at the locus">##INFO=<ID=DPB,Number=1,Type=Float,Description="Total read depth per bp at the locus; bases in reads overlapping / bases in haplotype">##INFO=<ID=DPRA,Number=A,Type=Float,Description="Alternate allele depth ratio. Ratio between depth in samples with each called alternate allele and those without.">
The final column in the file describes the calls for the sample NA12878
, which was the only sample that we called variants on. In the sample column (NA12878
) for the first variant we see the entry
(0/1:11:11,5:6:245:5:188:-13.9693,0,-19.1141
).
These are values separated by a :
character and they are interpreted in the same order as dictated by the FORMAT
column which is GT:DP:DPR:RO:QR:AO:QA:GL
Key | Value | Description |
---|---|---|
GT | 0/1 | Genotype |
DP | 11 | Read Depth |
DPR | 11,5 | Number of observation for each allele |
RO | 6 | Reference allele observation count |
QR | 245 | Sum of quality of the reference observations |
AO | 5 | Alternate allele observation count |
QA | 188 | Sum of quality of the alternate observations |
GL | -13.9693,0,-19.1141 | Genotype Likelihood, log10-scaled likelihoods of the data given the called genotype for each possible genotype generated from the reference and alternate alleles given the sample ploidy |
So for this particular variant, in the sample NA12878
there is a genotype of 0\1
(heterozygous) and a depth of 11 etc.
To understand the vcf format better, we can use another mode of operation of freebayes
which is to call genotypes on multiple samples simultaneously. We can also specify an exact region of the genome to call genotypes to speed-up the processing.
freebayes -f /reference_data/Homo_sapiens_assembly19.fasta --region 20:500000-800000 /data/hapmap/NA12878.chr20.bam /data/hapmap/NA12873.chr20.bam /data/hapmap/NA12874.chr20.bam > combined.chr20.subset.freebayes.vcf
As before, we can look at the first five calls.
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12873 NA12874 NA12878
20 502620 . G A 124.93 . AB=0;ABP=0;AC=2;AF=0.333333;AN=6;AO=5;CIGAR=1X;DP=24;DPB=24;DPRA=0.526316;EPP=3.44459;EPPR=5.8675;GTI=1;LEN=1;MEANALT=1;MQM=60;MQMR=60;NS=3;NUMALT=1;ODDS=1.91343;PAIRED=1;PAIREDR=1;PAO=0;PQA=0;PQR=0;PRO=0;QA=206;QR=587;RO=19;RPL=2;RPP=3.44459;RPPR=5.8675;RPR=3;RUN=1;SAF=2;SAP=3.44459;SAR=3;SRF=13;SRP=8.61041;SRR=6;TYPE=snp;technology.ILLUMINA=1 GT:DP:AD:RO:QR:AO:QA:GL 1/1:5:0,5:0:0:5:206:-18.9161,-1.50515,0 0/0:17:17,0:17:528:0:0:0,-5.11751,-47.8112 0/0:2:2,0:2:59:0:0:0,-0.60206,-5.60414
20 502793 . G A 508.485 . AB=0;ABP=0;AC=2;AF=0.333333;AN=6;AO=18;CIGAR=1X;DP=36;DPB=36;DPRA=2;EPP=3.49285;EPPR=3.49285;GTI=0;LEN=1;MEANALT=1;MQM=60;MQMR=57;NS=3;NUMALT=1;ODDS=2.14415;PAIRED=1;PAIREDR=1;PAO=0;PQA=0;PQR=0;PRO=0;QA=659;QR=702;RO=18;RPL=10;RPP=3.49285;RPPR=3.0103;RPR=8;RUN=1;SAF=8;SAP=3.49285;SAR=10;SRF=15;SRP=20.3821;SRR=3;TYPE=snp;technology.ILLUMINA=1 GT:DP:AD:RO:QR:AO:QA:GL 0/0:12:12,0:12:479:0:0:0,-3.61236,-42.35 1/1:18:0,18:0:0:18:659:-59.6311,-5.41854,0 0/0:6:6,0:6:223:0:0:0,-1.80618,-20.4049
20 502937 . C T 144.588 . AB=0.5;ABP=3.0103;AC=2;AF=0.333333;AN=6;AO=7;CIGAR=1X;DP=27;DPB=27;DPRA=0.538462;EPP=3.32051;EPPR=3.44459;GTI=0;LEN=1;MEANALT=1;MQM=60;MQMR=60;NS=3;NUMALT=1;ODDS=5.35048;PAIRED=1;PAIREDR=1;PAO=0;PQA=0;PQR=0;PRO=0;QA=270;QR=750;RO=20;RPL=3;RPP=3.32051;RPPR=3.44459;RPR=4;RUN=1;SAF=4;SAP=3.32051;SAR=3;SRF=4;SRP=18.6449;SRR=16;TYPE=snp;technology.ILLUMINA=1 GT:DP:AD:RO:QR:AO:QA:GL 0/1:12:6,6:6:230:6:229:-17.3628,0,-17.4496 0/0:13:13,0:13:482:0:0:0,-3.91339,-43.7103 0/1:2:1,1:1:38:1:41:-3.49251,0,-3.19521
20 504482 . G A 41.2659 . AB=0;ABP=0;AC=2;AF=0.333333;AN=6;AO=3;CIGAR=1X;DP=7;DPB=7;DPRA=1.5;EPP=3.73412;EPPR=3.0103;GTI=1;LEN=1;MEANALT=1;MQM=60;MQMR=54;NS=3;NUMALT=1;ODDS=1.33886;PAIRED=1;PAIREDR=1;PAO=0;PQA=0;PQR=0;PRO=0;QA=112;QR=133;RO=4;RPL=3;RPP=9.52472;RPPR=5.18177;RPR=0;RUN=1;SAF=1;SAP=3.73412;SAR=2;SRF=3;SRP=5.18177;SRR=1;TYPE=snp;technology.ILLUMINA=1 GT:DP:AD:RO:QR:AO:QA:GL 0/0:2:2,0:2:65:0:0:0,-0.60206,-6.06068 1/1:3:0,3:0:0:3:112:-10.4467,-0.90309,0 0/0:2:2,0:2:68:0:0:0,-0.60206,-6.45793
You should notice that we now have columns NA12878
, NA12873
and NA12874
as we have called variants in three samples.
One easy way of being able to visualise the calls is to use the IGV browser. Before we do this however, we can do some extra processing to make it easier to process the files. This series of commands will compress and index the vcf files (similar to how bam files are indexed to produce a .bai
file). IGV would probably cope fine with reading such relatively-small files, but it is good to get into the habit of processing our files in this manner.
bgzip -c combined.chr20.subset.freebayes.vcf > combined.chr20.subset.freebayes.vcf.gz
tabix -p vcf combined.chr20.subset.freebayes.vcf.gz
bgzip -c NA12878.chr20.subset.freebayes.vcf > NA12878.chr20.subset.freebayes.vcf.gz
tabix -p vcf NA12878.chr20.subset.freebayes.vcf.gz
After running these commands, you should see that the files combined.chr20.subset.freebayes.vcf.gz
, combined.chr20.subset.freebayes.vcf.gz.tbi
, NA12878.chr20.subset.freebayes.vcf.gz
and NA12878.chr20.subset.freebayes.vcf.gz.tbi
have been created in your working directory.
These files can be viewing in IGV and as usual we can zoom-in and scroll along the genome. Each .vcf
introduces two tracks into the IGV data panel; the first gives information about the variant, and the second is for sample-specific information.
NA12878.chr20.subset.freebayes.vcf.gz
and combined.chr20.subset.freebayes.vcf.gz
from the File -> Open
menuchr20:500,900-505,630
.vcf
is shown in IGVbam
files to complement our understanding of the callsTo dig-into these files further, we will use tools within Bioconductor. Our goal will be to find calls that are in common between different samples, and what SNPs occur within coding regions. For completeness, we note that other command-line tools can be used to perform these operations (the bedtools suite for example). However, we should be reasonably comfortable with R by now and the package we use interacts nicely with the GenomicRanges
concepts we discussed yesterday.