What makes somatic SNV detection difficult?
CaVEMan and some other calling tools
Statistical approaches to calling somatic SNVs
How well should you expect a tool to perform?
July 2017
What makes somatic SNV detection difficult?
CaVEMan and some other calling tools
Statistical approaches to calling somatic SNVs
How well should you expect a tool to perform?
Somatic SNV callers typically set the expected mutation rate to be around 5 mutations per megabase, i.e. a total of 15,000 mutations across the genome.
Source: ICGC Data Portal
Low cellularity (tumour DNA content)
Intra-tumour heterogeneity in which multiple tumour cell populations (subclones) exist
Aneuploidy
Unbalanced structural variation (deletions, duplications, etc.)
Matched normal contaminated with cancer DNA
adjacent normal tissue may contain residual disease or early tumour-initiating somatic mutations
circulating tumour DNA in blood normals
Sequencing errors
Alignment artefacts
In this example the tumour was sequenced to an average depth of 50.
Is this sufficient?
Consider the 50 observations of our tumour which carries a mutation at this base
Tumour cellularity
In fact the 'tumour' sample has some normal contamination
40% of our reads could easily be from the normal sample
Tumour heterogeneity
The tumour may be heterogeneous, i.e. DNA may have been sampled from a number of subclones
The mutation may exist in one or more subclones but not in others
Copy number
The tumour may not be diploid
If it is triploid, then a mutation may be in only one in three of the tumour reads
Uneven coverage
50-fold coverage is the average — biases, e.g. GC content, mean that coverage is not uniform
For 10% of the genome we only reach a depth of around 40
Sampling
Finally, remember that we are taking a random sample
We may not get lucky!
Germline SNVs  AF = 0 (homozygous reference), 0.5 (heterozygous variant) or 1 (homozygous variant)
Somatic SNVs typically exist at a continuous range of variant allele frequencies.
CaVEMan (Cancer Variants through Expectation Maximization) is the somatic SNV caller in the Sanger CGP pipeline.
Bayesian probabilistic classifier
Can make use of copy number profiles and estimate of normal contamination if available
Considers base quality, lane and read position and orientation to provide more accurate estimates of sequencing error rates
The CaVEMan wrapper in the CGP pipeline contains several post-processing filters applied to increase specificity of calls
cgpCaVEManWrapper [Jones et al., 2016]
Tool | Approach | Source | Publication |
---|---|---|---|
SomaticSniper | Bayesian | Washington University | Larson et al., 2012 |
VarScan 2 | Fisher's exact | Washington University | Koboldt et al., 2012 |
JointSNVmix | Bayesian | British Columbia Cancer Agency | Roth et al., 2012 |
mutationSeq | Machine learning | British Columbia Cancer Agency | Ding et al., 2012 |
Strelka | Bayesian | Illumina | Saunders et al., 2012 |
MuTect | Bayesian | Broad Institute | Cibulskis et al., 2013 |
qSNP | Heuristic | University of Queensland | Kassahn et al., 2013 |
VarDict | Fisher's exact | AstraZeneca | Lai et al., 2016 |
MuSE | Markov substitution model (Bayesian) | MD Anderson Cancer Center | Fan et al., 2016 |
By no means an exhaustive list but includes some of the most popular somatic SNV callers.
Reference base
\(\boldsymbol{C}\)
Normal: 48 reads all supporting reference
\(\boldsymbol{\{\ 48C\ \}}\)
Tumour: 50 reads, 7 of which have an alternate
\(\boldsymbol{T}\) allele
\(\boldsymbol{\{\ 43C,\ 7T\ \}}\)
2 x 2 contingency table
Reference base
\(\boldsymbol{C}\)
Normal: 48 reads all supporting reference
\(\boldsymbol{\{\ 48C\ \}}\)
Tumour: 50 reads, 7 of which have an alternate
\(\boldsymbol{T}\) allele
\(\boldsymbol{\{\ 43C,\ 7T\ \}}\)
2 x 2 contingency table
Allele fraction of the alternate allele in the tumour, \(AF = 7/50 = 0.14\)
Perform a Fisher's exact test to determine whether a variant has a significant difference in AF between the two samples
Method used by VarScan and VarDict
contingency_table <- t(matrix(c(43, 7, 48, 0), nrow = 2, byrow = TRUE, dimnames = list(c("Tumour", "Normal"), c("Ref", "Alt")))) contingency_table
## Tumour Normal ## Ref 43 48 ## Alt 7 0
fisher.test(contingency_table)
## ## Fisher's Exact Test for Count Data ## ## data: contingency_table ## p-value = 0.01254 ## alternative hypothesis: true odds ratio is not equal to 1 ## 95 percent confidence interval: ## 0.0000000 0.6701136 ## sample estimates: ## odds ratio ## 0
Most probabilistic methods for variant calling are based on Bayes' Theorem
\[ P(G \mid D) = \frac{P(D \mid G)P(G)}{P(D)} \]
\(G\) = genotype, e.g. AA, AB, BB
\(D\) = data, i.e. set of sequence reads at the genomic position of interest
\(P(G \mid D)\) is the probability of a given genotype, \(G\), given the observed data.
The probability of observing the given set of sequence reads is the weighted average of the probabilities of observing those reads for each possible genotype:
\[ P(D) = \sum\limits_{i=1}^nP(D \mid G_i)P(G_i) \]
Reference \(\boldsymbol{C}\), observe 6 reads \(\boldsymbol{\{\ 4C,\ 2T\ \}}\) all with base quality \(\mathrm{Q} = 30\ \left(P_{error} = 10^{-\frac{30}{10}} = 0.001\right)\)
Reference \(\boldsymbol{C}\), observe 6 reads \(\boldsymbol{\{\ 4C,\ 2T\ \}}\) all with base quality \(\mathrm{Q} = 30\ \left(P_{error} = 10^{-\frac{30}{10}} = 0.001\right)\)
Likelihood of data
\[ \begin{eqnarray} \boldsymbol{P(D \mid CC)} &=& \textrm{Probability of two Q30 errors} &=& 10^{-3}\times10^{-3} &=&\ \ 10^{-6} \\[5pt] \boldsymbol{P(D \mid TT)} &=& \textrm{Probability of four Q30 errors} &=& \left(10^{-3}\right)^4 &=&\ \ 10^{-12} \\[5pt] \boldsymbol{P(D \mid CT)} &=& \textrm{Probability of }\{C,C,C,C,T,T\} &=& \left(\tfrac{1}{2}\right)^6 &=&\ \ 0.0156 \end{eqnarray} \]
Reference \(\boldsymbol{C}\), observe 6 reads \(\boldsymbol{\{\ 4C,\ 2T\ \}}\) all with base quality \(\mathrm{Q} = 30\ \left(P_{error} = 10^{-\frac{30}{10}} = 0.001\right)\)
Likelihood of data
\[ \begin{eqnarray} \boldsymbol{P(D \mid CC)} &=& \textrm{Probability of two Q30 errors} &=& 10^{-3}\times10^{-3} &=&\ \ 10^{-6} \\[5pt] \boldsymbol{P(D \mid TT)} &=& \textrm{Probability of four Q30 errors} &=& \left(10^{-3}\right)^4 &=&\ \ 10^{-12} \\[5pt] \boldsymbol{P(D \mid CT)} &=& \textrm{Probability of }\{C,C,C,C,T,T\} &=& \left(\tfrac{1}{2}\right)^6 &=&\ \ 0.0156 \end{eqnarray} \]
Priors                    \(\boldsymbol{P(CC)} = 0.9985,\ \ \boldsymbol{P(CT)} = 0.001,\ \ \boldsymbol{P(TT)} = 0.0005\)
Probability of observing \(\boldsymbol{\{ 4C, 2T \}}\)
\[ \begin{eqnarray} \boldsymbol{P(D)} &=& P(D \mid CC)(P(CC)\ +\ P(D \mid CT)P(CT)\ +\ P(D \mid TT)P(TT) \\[5pt] &=& 1.66\times 10^{-5} \end{eqnarray} \]
Reference \(\boldsymbol{C}\), observe 6 reads \(\boldsymbol{\{\ 4C,\ 2T\ \}}\) all with base quality \(\mathrm{Q} = 30\ \left(P_{error} = 10^{-\frac{30}{10}} = 0.001\right)\)
Likelihood of data
\[ \begin{eqnarray} \boldsymbol{P(D \mid CC)} &=& \textrm{Probability of two Q30 errors} &=& 10^{-3}\times10^{-3} &=&\ \ 10^{-6} \\[5pt] \boldsymbol{P(D \mid TT)} &=& \textrm{Probability of four Q30 errors} &=& \left(10^{-3}\right)^4 &=&\ \ 10^{-12} \\[5pt] \boldsymbol{P(D \mid CT)} &=& \textrm{Probability of }\{C,C,C,C,T,T\} &=& \left(\tfrac{1}{2}\right)^6 &=&\ \ 0.0156 \end{eqnarray} \]
Priors                    \(\boldsymbol{P(CC)} = 0.9985,\ \ \boldsymbol{P(CT)} = 0.001,\ \ \boldsymbol{P(TT)} = 0.0005\)
Posterior \[ \boldsymbol{P(CC \mid D)} = \frac{P(D \mid CC)P(CC)}{P(D)} = \frac{10^{-6}\times 0.9985}{1.66\times 10^{-5}} = 0.06 \]
Result\(\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ P(CC \mid D) = 0.06\), \(\ \ \boldsymbol{P(CT \mid D) = 0.94}\), \(\ \ P(TT \mid D) = 3 \times 10^{-11}\)
Joint genotypes
\(G^N, G^T\) = genotypes of normal and tumour
Posterior probability of the joint genotype
\[ \begin{eqnarray} \boldsymbol{P(G^N, G^T \mid D^N, D^T)}\ \ &\propto&\ \ P(D^N, D^T \mid G^N, G^T)P(G^T, G^N) \\[10pt] &\propto&\ \ P(D^N \mid D^T, G^N, G^T)P(D^T \mid G^N, G^T)P(G^N, G^T) \end{eqnarray} \]
\(P(D^N \mid D^T, G^N, G^T) = \boldsymbol{P(D^N \mid G^N)}\)
if we assume that the normal sample does not contain any reads sequenced from the tumour, i.e. \(D^N\) independent of \(G^T\) — treat as in the single sample case.
\(\boldsymbol{P(D^T \mid G^N, G^T)}\) — scope to incorporate estimates of tumour purity, copy number, etc.
\(\boldsymbol{P(G^N, G^T)}\) — prior for joint genotype; could treat as independent events, \(P(G^N, G^T) = P(G^N)P(G^T)\), but not realistic since \(T\) and \(N\) samples from same individual so share germline variants.
Sensitivity = ability to correctly identify the true mutations
Specificity = ability to only call true mutations, i.e. no false positives
Sensitivity and specificity is a function of the biology, the sample and data quality, and the calling method.
Each new tool promises better accuracy than all before, but how will it perform in your hands?
Need some representative datasets with established ground truth
ICGC benchmarking exercise
Medulloblastoma tumour/normal pair sequenced in 6 different centres to combined 300-fold coverage used to establish 'truth'
16 ICGC project teams ran their pipelines on data from one centre (40x)
Alioto et al., Nat Commun. 2015
ICGC-TCGA DREAM Somatic Mutation Calling challenge
6 synthetic datasets based on cell line sequenced to 80x, BAM randomnly split into 2 ('tumour' and 'normal'), mutations added to one computationally
\(TP\) – true positives, i.e. correct calls
\(FP\) – false positives, i.e. incorrect calls
\(FN\) – false negatives, i.e. number of missed calls
\[ \textrm{Precision} = \frac{TP}{TP + FP}\ \ \ \ \ \ \ \ \ \ \ \ \textrm{Recall} = \frac{TP}{TP + FN} \]
\[\textrm{Balanced accuracy} = \tfrac{1}{2}(\textrm{Precision} + \textrm{Recall})\ \ \ \ \ \ \ \ \ \ F\textrm{-score} = \frac{2 \times \textrm{Precision} \times \textrm{Recall}}{\textrm{Precision} + \textrm{Recall}}\ \ \ \ \ \ \ \ \ \ \textrm{Jaccard index} = \frac{TP}{TP + FP + FN}\]
Several factors complicate the detection of somatic SNVs (tumour purity, tumour heterogeneity, copy number changes, low sequencing depth, etc.)
Most somatic SNV callers employ sophisticated Bayesian statistical techniques
Some callers, including CaVEMan, incorporate copy number information and estimates of tumour purity
There is a trade-off between sensitivity and specificity
Additional filtering of SNV calls can be helpful