Somatic SNV Filtering

July 2017

Outline

What factors complicate somatic SNV calling and why is additional filtering necessary?
CaVEMan filters
Metric-based approach to filtering false positives
Other approaches to filtering (panel of normals, ensemble calling)
Benchmarks for assessing SNV calling and filtering

Recap: several factors complicate somatic SNV calling

Low cellularity (tumour DNA content)
Intra-tumour heterogeneity in which multiple tumour cell populations (subclones) exist
Aneuploidy
Unbalanced structural variation (deletions, duplications, etc.)

Matched normal contaminated with cancer DNA
- adjacent normal tissue may contain residual disease or early tumour-initiating somatic mutations
- circulating tumour DNA in blood normals

Sequencing errors
Alignment artefacts

Mwenifumbo & Marra, Nat Rev Genet. 2013

Sequencing error

Base qualities drop off toward ends of reads on Illumina sequencing platform, errors in base calls more likely

Filter: minimum base quality for variant alleles [CaVEMan --min-base-qual parameter]

Filter: no variant alleles found in first 2/3 of a read [CaVEMan RP filter]

Low base qualities for variant alleles toward ends of reads

Technical bias

Duplicate reads
- PCR amplification during library construction introduces duplicate reads
- Bioinformatic tools identify likely PCR duplicates based on aligned start positions of both ends of a read pair

Post-alignment: mark duplicate reads and exclude from SNV calling

Strand bias
- Calls supported mainly by reads aligning to one strand may also be PCR artefacts

Filter: variant alleles in reads aligning in one direction only [SE filter]

Strand bias

Technical bias

GC bias
- Stretches of low GC content tend to be under-represented leading to uneven coverage across the genome
- Poor confidence in calls at these low coverage regions as germline SNVs more likely to be mistaken for somatic SNVs due to inadequate sampling in the matched normal

Filter: minimum read depth at variant position particularly in the normal

Filter: minimum number of reads supporting variant call

Alignment issues

Alignment issues are common source of false positive SNV calls

Missing sequence in the reference genome causes misalignments, usually with mismatches

Alignment: use decoy sequence included in reference genome used by 1000 Genomes project

Filter: variants supported by reads with many mismatches

Assembly in region around variant may differ from the the reference sequence causing incorrect alignments, e.g. indels, rearrangements

Post-alignment: Perform local realignment around indels [GATK IndelRealigner]

Filter: variants within or close to germline indels [GI filter]

Filter: variant position always toward beginning/end of alignment [RP filter]

Missing sequence from reference genome assembly

Problematic alignment around indels

Alignment issues

Repetitive regions in the genome cause difficulties for alignment tools
- Aligners assign a mapping quality of 0 if they cannot uniquely place a read
- Regions of low mappability are usually out of bounds for short read sequencing

Filter: minimum mapping quality of variant reads [MQ filter]

Filter: calls from low-complexity and low-mappability regions [SR, CR filters]

Low mapping quality

CaVEMan filters

HCC1143 dataset: 138042 of 159001 (87%) of initial SNV calls made by CaVEMan are filtered

Filtering strategies

Hard filters

Based on summary statistics, or metrics, computed from the sequence reads covering the variant position, e.g. average mapping or base quality scores

Ensemble calling

Consensus call set from majority voting on candidate SNVs called by multiple callers

Blacklisting

Exclude list of problematic genomic positions and/or substitutions, e.g. based on panel of normals

Machine learning techniques

Computers can potentially be trained to recognize characteristics that distinguish true variants from erroneous calls [GATK Variant Quality Score Recalibration, mutationSeq]

Creating, tuning and testing filters

Benchmark datasets can be used to tune and assess filters

Ideally need to test filters on separate dataset to that used for training
Danger of overfitting

Approach:
- Plot distribution of true and false positive variants for variety of metrics
- Choose threshold to best distinguish between TP and FP

Here we use the ICGC medulloblastoma benchmark dataset to derive filters and the ICGC-TCGA DREAM Challenge sythentic dataset 4 to test these filters.

Benchmark datasets

ICGC benchmarking exercise

Medulloblastoma tumour/normal pair sequenced in 6 different centres to combined 300-fold coverage used to establish 'truth'
16 ICGC project teams ran their pipelines on data from one centre (40x)

Alioto et al., Nat Commun. 2015

ICGC-TCGA DREAM Somatic Mutation Calling challenge

6 synthetic datasets based on cell line sequenced to 80x, BAM randomnly split into 2 ('tumour' and 'normal'), mutations added to one computationally
Synthetic dataset 4: 80% cellularity; 50% and 35% subclone VAF (effectively 30% and 15%)

Ewing et al., Nat Methods 2015 [leaderboards]

Average base quality of variant supporting reads

Average mapping quality of variant reads

Difference in mapping quality between variant and reference reads

Calculating metrics and applying filters

CalculateSNVMetrics tool available in CRUK-CI gatk-tools package
VariantFiltration tool in the Genome Analysis Toolkit (GATK)

# filter set 1 (complements MuTect2's own in-built hard filters)

java -jar GenomeAnalysisTK.jar \
  --analysis_type VariantFiltration \
  --reference_sequence reference.fasta \
  --variant input.vcf \
  --out output.vcf \
  --filterName VariantAlleleCount    --filterExpression "VariantAlleleCount < 3" \
  --filterName VariantCountControl   --filterExpression "VariantAlleleCountControl > 1" \
  --filterName VariantBaseQualMedian --filterExpression "VariantBaseQualMedian < 25.0" \
  --filterName VariantMapQualMedian  --filterExpression "VariantMapQualMedian < 40.0" \
  --filterName MapQualDiffMedian     --filterExpression "MapQualDiffMedian < -5.0 || MapQualDiffMedian > 5.0" \
  --filterName LowMapQual            --filterExpression "LowMapQual > 0.05"

ICGC MB99 benchmark – applying filters

DREAM challenge synthetic 4 dataset – testing filters

DREAM challenge synthetic 4 dataset

SNVs called by MuTect2 (12432 true, 665 false, 3836 not called)

Panel of Normals filter (PoN)

Artefacts usually cancel out in the tumour normal comparison but depends on adequate sampling
- Low depth in normal can cause germline variants to appear as somatic

An approach to detecting likely artefacts is to look for the variant in a panel of unrelated normal samples
- Filters both polymorphisms and locations prone to aberrant mapping or systematic sequencing artefacts

cgpCaVEManWrapper [Jones et al., 2016]

CRUK-CI blacklist

Based on 50x sequence data for 149 blood normal samples from the UK oesophageal cancer ICGC project
Variants appearing in at least 5 normals (minimum 3 reads, 5% allele fraction in each sample)

DREAM challenge synthetic 4 dataset – PoN filter

Combining results from multiple SNV callers

Can combining the results from multiple callers improve accuracy?

Ensemble calling

Majority voting on candidate SNVs from multiple callers to produce consensus call set
ICGC PanCancer project using SNV calls made by at least 2 out of 4 callers (CaVEMan, MuTect2, MuSE, samtools)
bcbio cancer variant calling pipeline – also see Brad Chapman's blog
SomaticSeq [Fang et al, 2014]

ICGC MB99 Benchmark – Ensemble call sets

Summary

SNV calling in cancer genomes is difficult for many reasons
Using an SNV caller out-of-the-box may give a reasonable set of calls but is likely to result in call sets with higher sensitivity at the expense of precision
Simple filtering strategies can improve precision but there is a trade-off between sensitivity and accuracy
The cancer genome sequencing community has been active in establishing benchmark datasets that can be used to assess and improve somatic mutation calling pipelines