July 2017

Outline

  • What makes somatic SNV detection difficult?

  • CaVEMan and some other calling tools

  • Statistical approaches to calling somatic SNVs

  • How well should you expect a tool to perform?

Somatic vs Germline SNVs


Numbers of somatic SNVs in different cancer types

Somatic SNV callers typically set the expected mutation rate to be around 5 mutations per megabase, i.e. a total of 15,000 mutations across the genome.

Source: ICGC Data Portal

Several factors complicate somatic SNV calling

  • Low cellularity (tumour DNA content)

  • Intra-tumour heterogeneity in which multiple tumour cell populations (subclones) exist

  • Aneuploidy

  • Unbalanced structural variation (deletions, duplications, etc.)


  • Matched normal contaminated with cancer DNA

    • adjacent normal tissue may contain residual disease or early tumour-initiating somatic mutations

    • circulating tumour DNA in blood normals


  • Sequencing errors

  • Alignment artefacts

Mwenifumbo & Marra, Nat Rev Genet. 2013

Issues affecting mutation detection in cancer

In this example the tumour was sequenced to an average depth of 50.


  • Is this sufficient?

  • Consider the 50 observations of our tumour which carries a mutation at this base

Issues affecting mutation detection in cancer

Tumour cellularity


  • In fact the 'tumour' sample has some normal contamination

  • 40% of our reads could easily be from the normal sample

Issues affecting mutation detection in cancer

Tumour heterogeneity


  • The tumour may be heterogeneous, i.e. DNA may have been sampled from a number of subclones

  • The mutation may exist in one or more subclones but not in others

Issues affecting mutation detection in cancer

Copy number


  • The tumour may not be diploid

  • If it is triploid, then a mutation may be in only one in three of the tumour reads

Issues affecting mutation detection in cancer

Uneven coverage


  • 50-fold coverage is the average — biases, e.g. GC content, mean that coverage is not uniform

  • For 10% of the genome we only reach a depth of around 40

Issues affecting mutation detection in cancer

Sampling


  • Finally, remember that we are taking a random sample

  • We may not get lucky!


Credit: Andy Lynch

Allele frequencies

Germline SNVs  AF = 0 (homozygous reference), 0.5 (heterozygous variant) or 1 (homozygous variant)

Somatic SNVs typically exist at a continuous range of variant allele frequencies.

CaVEMan SNV caller

CaVEMan (Cancer Variants through Expectation Maximization) is the somatic SNV caller in the Sanger CGP pipeline.


  • Bayesian probabilistic classifier

  • Can make use of copy number profiles and estimate of normal contamination if available

  • Considers base quality, lane and read position and orientation to provide more accurate estimates of sequencing error rates

  • The CaVEMan wrapper in the CGP pipeline contains several post-processing filters applied to increase specificity of calls


cgpCaVEManWrapper [Jones et al., 2016]

Other somatic SNV callers

Tool Approach Source Publication
SomaticSniper Bayesian Washington University Larson et al., 2012
VarScan 2 Fisher's exact Washington University Koboldt et al., 2012
JointSNVmix Bayesian British Columbia Cancer Agency Roth et al., 2012
mutationSeq Machine learning British Columbia Cancer Agency Ding et al., 2012
Strelka Bayesian Illumina Saunders et al., 2012
MuTect Bayesian Broad Institute Cibulskis et al., 2013
qSNP Heuristic University of Queensland Kassahn et al., 2013
VarDict Fisher's exact AstraZeneca Lai et al., 2016
MuSE Markov substitution model (Bayesian) MD Anderson Cancer Center Fan et al., 2016

By no means an exhaustive list but includes some of the most popular somatic SNV callers.

Simple statistical approach


Reference base \(\boldsymbol{C}\)
Normal: 48 reads all supporting reference \(\boldsymbol{\{\ 48C\ \}}\)
Tumour: 50 reads, 7 of which have an alternate \(\boldsymbol{T}\) allele \(\boldsymbol{\{\ 43C,\ 7T\ \}}\)


2 x 2 contingency table

Simple statistical approach


Reference base \(\boldsymbol{C}\)
Normal: 48 reads all supporting reference \(\boldsymbol{\{\ 48C\ \}}\)
Tumour: 50 reads, 7 of which have an alternate \(\boldsymbol{T}\) allele \(\boldsymbol{\{\ 43C,\ 7T\ \}}\)


2 x 2 contingency table


  • Allele fraction of the alternate allele in the tumour, \(AF = 7/50 = 0.14\)

  • Perform a Fisher's exact test to determine whether a variant has a significant difference in AF between the two samples

  • Method used by VarScan and VarDict

Fisher's exact test in R

contingency_table <- t(matrix(c(43, 7, 48, 0), nrow = 2, byrow = TRUE,
                              dimnames = list(c("Tumour", "Normal"), c("Ref", "Alt"))))
contingency_table
##     Tumour Normal
## Ref     43     48
## Alt      7      0
fisher.test(contingency_table)
## 
##  Fisher's Exact Test for Count Data
## 
## data:  contingency_table
## p-value = 0.01254
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.0000000 0.6701136
## sample estimates:
## odds ratio 
##          0

Bayesian statistical approach

Most probabilistic methods for variant calling are based on Bayes' Theorem


\[ P(G \mid D) = \frac{P(D \mid G)P(G)}{P(D)} \]

\(G\) = genotype, e.g. AA, AB, BB

\(D\) = data, i.e. set of sequence reads at the genomic position of interest

\(P(G \mid D)\) is the probability of a given genotype, \(G\), given the observed data.


The probability of observing the given set of sequence reads is the weighted average of the probabilities of observing those reads for each possible genotype:

\[ P(D) = \sum\limits_{i=1}^nP(D \mid G_i)P(G_i) \]

Germline SNV calling (single sample)

Reference \(\boldsymbol{C}\), observe 6 reads \(\boldsymbol{\{\ 4C,\ 2T\ \}}\) all with base quality \(\mathrm{Q} = 30\ \left(P_{error} = 10^{-\frac{30}{10}} = 0.001\right)\)

Germline SNV calling (single sample)

Reference \(\boldsymbol{C}\), observe 6 reads \(\boldsymbol{\{\ 4C,\ 2T\ \}}\) all with base quality \(\mathrm{Q} = 30\ \left(P_{error} = 10^{-\frac{30}{10}} = 0.001\right)\)

Likelihood of data

\[ \begin{eqnarray} \boldsymbol{P(D \mid CC)} &=& \textrm{Probability of two Q30 errors} &=& 10^{-3}\times10^{-3} &=&\ \ 10^{-6} \\[5pt] \boldsymbol{P(D \mid TT)} &=& \textrm{Probability of four Q30 errors} &=& \left(10^{-3}\right)^4 &=&\ \ 10^{-12} \\[5pt] \boldsymbol{P(D \mid CT)} &=& \textrm{Probability of }\{C,C,C,C,T,T\} &=& \left(\tfrac{1}{2}\right)^6 &=&\ \ 0.0156 \end{eqnarray} \]

Germline SNV calling (single sample)

Reference \(\boldsymbol{C}\), observe 6 reads \(\boldsymbol{\{\ 4C,\ 2T\ \}}\) all with base quality \(\mathrm{Q} = 30\ \left(P_{error} = 10^{-\frac{30}{10}} = 0.001\right)\)

Likelihood of data

\[ \begin{eqnarray} \boldsymbol{P(D \mid CC)} &=& \textrm{Probability of two Q30 errors} &=& 10^{-3}\times10^{-3} &=&\ \ 10^{-6} \\[5pt] \boldsymbol{P(D \mid TT)} &=& \textrm{Probability of four Q30 errors} &=& \left(10^{-3}\right)^4 &=&\ \ 10^{-12} \\[5pt] \boldsymbol{P(D \mid CT)} &=& \textrm{Probability of }\{C,C,C,C,T,T\} &=& \left(\tfrac{1}{2}\right)^6 &=&\ \ 0.0156 \end{eqnarray} \]


Priors                    \(\boldsymbol{P(CC)} = 0.9985,\ \ \boldsymbol{P(CT)} = 0.001,\ \ \boldsymbol{P(TT)} = 0.0005\)

Probability of observing \(\boldsymbol{\{ 4C, 2T \}}\)

\[ \begin{eqnarray} \boldsymbol{P(D)} &=& P(D \mid CC)(P(CC)\ +\ P(D \mid CT)P(CT)\ +\ P(D \mid TT)P(TT) \\[5pt] &=& 1.66\times 10^{-5} \end{eqnarray} \]

Germline SNV calling (single sample)

Reference \(\boldsymbol{C}\), observe 6 reads \(\boldsymbol{\{\ 4C,\ 2T\ \}}\) all with base quality \(\mathrm{Q} = 30\ \left(P_{error} = 10^{-\frac{30}{10}} = 0.001\right)\)

Likelihood of data

\[ \begin{eqnarray} \boldsymbol{P(D \mid CC)} &=& \textrm{Probability of two Q30 errors} &=& 10^{-3}\times10^{-3} &=&\ \ 10^{-6} \\[5pt] \boldsymbol{P(D \mid TT)} &=& \textrm{Probability of four Q30 errors} &=& \left(10^{-3}\right)^4 &=&\ \ 10^{-12} \\[5pt] \boldsymbol{P(D \mid CT)} &=& \textrm{Probability of }\{C,C,C,C,T,T\} &=& \left(\tfrac{1}{2}\right)^6 &=&\ \ 0.0156 \end{eqnarray} \]


Priors                    \(\boldsymbol{P(CC)} = 0.9985,\ \ \boldsymbol{P(CT)} = 0.001,\ \ \boldsymbol{P(TT)} = 0.0005\)

Posterior \[ \boldsymbol{P(CC \mid D)} = \frac{P(D \mid CC)P(CC)}{P(D)} = \frac{10^{-6}\times 0.9985}{1.66\times 10^{-5}} = 0.06 \]


Result\(\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ P(CC \mid D) = 0.06\), \(\ \ \boldsymbol{P(CT \mid D) = 0.94}\), \(\ \ P(TT \mid D) = 3 \times 10^{-11}\)


Source: Heng Li, Broad Institute MPG workshop 2011

Somatic SNV calling

Joint genotypes


\(G^N, G^T\) = genotypes of normal and tumour

Bayesian approach extended to somatic SNVs

Posterior probability of the joint genotype

\[ \begin{eqnarray} \boldsymbol{P(G^N, G^T \mid D^N, D^T)}\ \ &\propto&\ \ P(D^N, D^T \mid G^N, G^T)P(G^T, G^N) \\[10pt] &\propto&\ \ P(D^N \mid D^T, G^N, G^T)P(D^T \mid G^N, G^T)P(G^N, G^T) \end{eqnarray} \]


\(P(D^N \mid D^T, G^N, G^T) = \boldsymbol{P(D^N \mid G^N)}\)

if we assume that the normal sample does not contain any reads sequenced from the tumour, i.e. \(D^N\) independent of \(G^T\) — treat as in the single sample case.


\(\boldsymbol{P(D^T \mid G^N, G^T)}\) — scope to incorporate estimates of tumour purity, copy number, etc.


\(\boldsymbol{P(G^N, G^T)}\) — prior for joint genotype; could treat as independent events, \(P(G^N, G^T) = P(G^N)P(G^T)\), but not realistic since \(T\) and \(N\) samples from same individual so share germline variants.

VCF output file


How well can we expect a somatic SNV caller to perform?

Sensitivity = ability to correctly identify the true mutations

Specificity = ability to only call true mutations, i.e. no false positives


  • Sensitivity and specificity is a function of the biology, the sample and data quality, and the calling method.

  • Each new tool promises better accuracy than all before, but how will it perform in your hands?

  • Need some representative datasets with established ground truth

    • Recent benchmarking exercises/challenges (TCGA, ICGC, DREAM)

Benchmark datasets

ICGC benchmarking exercise


  • Medulloblastoma tumour/normal pair sequenced in 6 different centres to combined 300-fold coverage used to establish 'truth'

  • 16 ICGC project teams ran their pipelines on data from one centre (40x)

Alioto et al., Nat Commun. 2015


ICGC-TCGA DREAM Somatic Mutation Calling challenge


  • 6 synthetic datasets based on cell line sequenced to 80x, BAM randomnly split into 2 ('tumour' and 'normal'), mutations added to one computationally

    • Synthetic dataset 4: 80% cellularity; 50% and 35% subclone VAF (effectively 30% and 15%)

  • Real challenge: 5 pancreatic and 5 prostate cancer patients

Ewing et al., Nat Methods 2015 [leaderboards]

Assessing SNV calling


\(TP\) – true positives, i.e. correct calls

\(FP\) – false positives, i.e. incorrect calls

\(FN\) – false negatives, i.e. number of missed calls

\[ \textrm{Precision} = \frac{TP}{TP + FP}\ \ \ \ \ \ \ \ \ \ \ \ \textrm{Recall} = \frac{TP}{TP + FN} \]


\[\textrm{Balanced accuracy} = \tfrac{1}{2}(\textrm{Precision} + \textrm{Recall})\ \ \ \ \ \ \ \ \ \ F\textrm{-score} = \frac{2 \times \textrm{Precision} \times \textrm{Recall}}{\textrm{Precision} + \textrm{Recall}}\ \ \ \ \ \ \ \ \ \ \textrm{Jaccard index} = \frac{TP}{TP + FP + FN}\]

ICGC Benchmark – medulloblastoma MB99

Summary

  • Several factors complicate the detection of somatic SNVs (tumour purity, tumour heterogeneity, copy number changes, low sequencing depth, etc.)

  • Most somatic SNV callers employ sophisticated Bayesian statistical techniques

  • Some callers, including CaVEMan, incorporate copy number information and estimates of tumour purity

  • There is a trade-off between sensitivity and specificity

  • Additional filtering of SNV calls can be helpful