Somatic SNV Calling

July 2017

Outline

What makes somatic SNV detection difficult?
CaVEMan and some other calling tools
Statistical approaches to calling somatic SNVs
How well should you expect a tool to perform?

Somatic vs Germline SNVs

Numbers of somatic SNVs in different cancer types

Somatic SNV callers typically set the expected mutation rate to be around 5 mutations per megabase, i.e. a total of 15,000 mutations across the genome.

Source: ICGC Data Portal

Several factors complicate somatic SNV calling

Low cellularity (tumour DNA content)
Intra-tumour heterogeneity in which multiple tumour cell populations (subclones) exist
Aneuploidy
Unbalanced structural variation (deletions, duplications, etc.)

Matched normal contaminated with cancer DNA
- adjacent normal tissue may contain residual disease or early tumour-initiating somatic mutations
- circulating tumour DNA in blood normals

Sequencing errors
Alignment artefacts

Mwenifumbo & Marra, Nat Rev Genet. 2013

Issues affecting mutation detection in cancer

In this example the tumour was sequenced to an average depth of 50.

Is this sufficient?
Consider the 50 observations of our tumour which carries a mutation at this base

Issues affecting mutation detection in cancer

Tumour cellularity

In fact the 'tumour' sample has some normal contamination
40% of our reads could easily be from the normal sample

Issues affecting mutation detection in cancer

Tumour heterogeneity

The tumour may be heterogeneous, i.e. DNA may have been sampled from a number of subclones
The mutation may exist in one or more subclones but not in others

Issues affecting mutation detection in cancer

Copy number

The tumour may not be diploid
If it is triploid, then a mutation may be in only one in three of the tumour reads

Issues affecting mutation detection in cancer

Uneven coverage

50-fold coverage is the average — biases, e.g. GC content, mean that coverage is not uniform
For 10% of the genome we only reach a depth of around 40

Issues affecting mutation detection in cancer

Sampling

Finally, remember that we are taking a random sample
We may not get lucky!

Credit: Andy Lynch

Allele frequencies

Germline SNVs AF = 0 (homozygous reference), 0.5 (heterozygous variant) or 1 (homozygous variant)

Somatic SNVs typically exist at a continuous range of variant allele frequencies.

CaVEMan SNV caller

CaVEMan (Cancer Variants through Expectation Maximization) is the somatic SNV caller in the Sanger CGP pipeline.

Bayesian probabilistic classifier
Can make use of copy number profiles and estimate of normal contamination if available
Considers base quality, lane and read position and orientation to provide more accurate estimates of sequencing error rates
The CaVEMan wrapper in the CGP pipeline contains several post-processing filters applied to increase specificity of calls

cgpCaVEManWrapper [Jones et al., 2016]

Other somatic SNV callers

Tool	Approach	Source	Publication
SomaticSniper	Bayesian	Washington University	Larson et al., 2012
VarScan 2	Fisher's exact	Washington University	Koboldt et al., 2012
JointSNVmix	Bayesian	British Columbia Cancer Agency	Roth et al., 2012
mutationSeq	Machine learning	British Columbia Cancer Agency	Ding et al., 2012
Strelka	Bayesian	Illumina	Saunders et al., 2012
MuTect	Bayesian	Broad Institute	Cibulskis et al., 2013
qSNP	Heuristic	University of Queensland	Kassahn et al., 2013
VarDict	Fisher's exact	AstraZeneca	Lai et al., 2016
MuSE	Markov substitution model (Bayesian)	MD Anderson Cancer Center	Fan et al., 2016

By no means an exhaustive list but includes some of the most popular somatic SNV callers.

Simple statistical approach

Reference base \(\boldsymbol{C}\)
Normal: 48 reads all supporting reference \(\boldsymbol{\{\ 48C\ \}}\)
Tumour: 50 reads, 7 of which have an alternate \(\boldsymbol{T}\) allele \(\boldsymbol{\{\ 43C,\ 7T\ \}}\)

2 x 2 contingency table

Simple statistical approach

2 x 2 contingency table

Allele fraction of the alternate allele in the tumour, \(AF = 7/50 = 0.14\)
Perform a Fisher's exact test to determine whether a variant has a significant difference in AF between the two samples
Method used by VarScan and VarDict

Fisher's exact test in R

contingency_table <- t(matrix(c(43, 7, 48, 0), nrow = 2, byrow = TRUE,
                              dimnames = list(c("Tumour", "Normal"), c("Ref", "Alt"))))
contingency_table

##     Tumour Normal
## Ref     43     48
## Alt      7      0

fisher.test(contingency_table)

## 
##  Fisher's Exact Test for Count Data
## 
## data:  contingency_table
## p-value = 0.01254
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.0000000 0.6701136
## sample estimates:
## odds ratio 
##          0

Bayesian statistical approach

Most probabilistic methods for variant calling are based on Bayes' Theorem

\[ P(G \mid D) = \frac{P(D \mid G)P(G)}{P(D)} \]

\(G\) = genotype, e.g. AA, AB, BB

\(D\) = data, i.e. set of sequence reads at the genomic position of interest

\(P(G \mid D)\) is the probability of a given genotype, \(G\), given the observed data.

The probability of observing the given set of sequence reads is the weighted average of the probabilities of observing those reads for each possible genotype:

\[ P(D) = \sum\limits_{i=1}^nP(D \mid G_i)P(G_i) \]

Germline SNV calling (single sample)

Reference \(\boldsymbol{C}\), observe 6 reads \(\boldsymbol{\{\ 4C,\ 2T\ \}}\) all with base quality \(\mathrm{Q} = 30\ \left(P_{error} = 10^{-\frac{30}{10}} = 0.001\right)\)

Germline SNV calling (single sample)

Reference \(\boldsymbol{C}\), observe 6 reads \(\boldsymbol{\{\ 4C,\ 2T\ \}}\) all with base quality \(\mathrm{Q} = 30\ \left(P_{error} = 10^{-\frac{30}{10}} = 0.001\right)\)

Likelihood of data

\[ \begin{eqnarray} \boldsymbol{P(D \mid CC)} &=& \textrm{Probability of two Q30 errors} &=& 10^{-3}\times10^{-3} &=&\ \ 10^{-6} \\[5pt] \boldsymbol{P(D \mid TT)} &=& \textrm{Probability of four Q30 errors} &=& \left(10^{-3}\right)^4 &=&\ \ 10^{-12} \\[5pt] \boldsymbol{P(D \mid CT)} &=& \textrm{Probability of }\{C,C,C,C,T,T\} &=& \left(\tfrac{1}{2}\right)^6 &=&\ \ 0.0156 \end{eqnarray} \]

Germline SNV calling (single sample)

Reference \(\boldsymbol{C}\), observe 6 reads \(\boldsymbol{\{\ 4C,\ 2T\ \}}\) all with base quality \(\mathrm{Q} = 30\ \left(P_{error} = 10^{-\frac{30}{10}} = 0.001\right)\)

Likelihood of data

Priors \(\boldsymbol{P(CC)} = 0.9985,\ \ \boldsymbol{P(CT)} = 0.001,\ \ \boldsymbol{P(TT)} = 0.0005\)

Probability of observing \(\boldsymbol{\{ 4C, 2T \}}\)

\[ \begin{eqnarray} \boldsymbol{P(D)} &=& P(D \mid CC)(P(CC)\ +\ P(D \mid CT)P(CT)\ +\ P(D \mid TT)P(TT) \\[5pt] &=& 1.66\times 10^{-5} \end{eqnarray} \]

Germline SNV calling (single sample)

Reference \(\boldsymbol{C}\), observe 6 reads \(\boldsymbol{\{\ 4C,\ 2T\ \}}\) all with base quality \(\mathrm{Q} = 30\ \left(P_{error} = 10^{-\frac{30}{10}} = 0.001\right)\)

Likelihood of data

Priors \(\boldsymbol{P(CC)} = 0.9985,\ \ \boldsymbol{P(CT)} = 0.001,\ \ \boldsymbol{P(TT)} = 0.0005\)

Posterior \[ \boldsymbol{P(CC \mid D)} = \frac{P(D \mid CC)P(CC)}{P(D)} = \frac{10^{-6}\times 0.9985}{1.66\times 10^{-5}} = 0.06 \]

Result\(\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ P(CC \mid D) = 0.06\), \(\ \ \boldsymbol{P(CT \mid D) = 0.94}\), \(\ \ P(TT \mid D) = 3 \times 10^{-11}\)

Source: Heng Li, Broad Institute MPG workshop 2011

Somatic SNV calling

Joint genotypes

\(G^N, G^T\) = genotypes of normal and tumour

Bayesian approach extended to somatic SNVs

Posterior probability of the joint genotype

\[ \begin{eqnarray} \boldsymbol{P(G^N, G^T \mid D^N, D^T)}\ \ &\propto&\ \ P(D^N, D^T \mid G^N, G^T)P(G^T, G^N) \\[10pt] &\propto&\ \ P(D^N \mid D^T, G^N, G^T)P(D^T \mid G^N, G^T)P(G^N, G^T) \end{eqnarray} \]

\(P(D^N \mid D^T, G^N, G^T) = \boldsymbol{P(D^N \mid G^N)}\)

if we assume that the normal sample does not contain any reads sequenced from the tumour, i.e. \(D^N\) independent of \(G^T\) — treat as in the single sample case.

\(\boldsymbol{P(D^T \mid G^N, G^T)}\) — scope to incorporate estimates of tumour purity, copy number, etc.

\(\boldsymbol{P(G^N, G^T)}\) — prior for joint genotype; could treat as independent events, \(P(G^N, G^T) = P(G^N)P(G^T)\), but not realistic since \(T\) and \(N\) samples from same individual so share germline variants.

VCF output file

How well can we expect a somatic SNV caller to perform?

Sensitivity = ability to correctly identify the true mutations

Specificity = ability to only call true mutations, i.e. no false positives

Sensitivity and specificity is a function of the biology, the sample and data quality, and the calling method.
Each new tool promises better accuracy than all before, but how will it perform in your hands?
Need some representative datasets with established ground truth
- Recent benchmarking exercises/challenges (TCGA, ICGC, DREAM)

Benchmark datasets

ICGC benchmarking exercise

Medulloblastoma tumour/normal pair sequenced in 6 different centres to combined 300-fold coverage used to establish 'truth'
16 ICGC project teams ran their pipelines on data from one centre (40x)

Alioto et al., Nat Commun. 2015

ICGC-TCGA DREAM Somatic Mutation Calling challenge

6 synthetic datasets based on cell line sequenced to 80x, BAM randomnly split into 2 ('tumour' and 'normal'), mutations added to one computationally
- Synthetic dataset 4: 80% cellularity; 50% and 35% subclone VAF (effectively 30% and 15%)

Real challenge: 5 pancreatic and 5 prostate cancer patients

Ewing et al., Nat Methods 2015 [leaderboards]

Assessing SNV calling

\(TP\) – true positives, i.e. correct calls

\(FP\) – false positives, i.e. incorrect calls

\(FN\) – false negatives, i.e. number of missed calls

\[ \textrm{Precision} = \frac{TP}{TP + FP}\ \ \ \ \ \ \ \ \ \ \ \ \textrm{Recall} = \frac{TP}{TP + FN} \]

\[\textrm{Balanced accuracy} = \tfrac{1}{2}(\textrm{Precision} + \textrm{Recall})\ \ \ \ \ \ \ \ \ \ F\textrm{-score} = \frac{2 \times \textrm{Precision} \times \textrm{Recall}}{\textrm{Precision} + \textrm{Recall}}\ \ \ \ \ \ \ \ \ \ \textrm{Jaccard index} = \frac{TP}{TP + FP + FN}\]

ICGC Benchmark – medulloblastoma MB99

Summary

Several factors complicate the detection of somatic SNVs (tumour purity, tumour heterogeneity, copy number changes, low sequencing depth, etc.)
Most somatic SNV callers employ sophisticated Bayesian statistical techniques
Some callers, including CaVEMan, incorporate copy number information and estimates of tumour purity
There is a trade-off between sensitivity and specificity
Additional filtering of SNV calls can be helpful