July 2017

Outline

  • What factors complicate somatic SNV calling and why is additional filtering necessary?

  • CaVEMan filters

  • Metric-based approach to filtering false positives

  • Other approaches to filtering (panel of normals, ensemble calling)

  • Benchmarks for assessing SNV calling and filtering

Recap: several factors complicate somatic SNV calling

  • Low cellularity (tumour DNA content)

  • Intra-tumour heterogeneity in which multiple tumour cell populations (subclones) exist

  • Aneuploidy

  • Unbalanced structural variation (deletions, duplications, etc.)


  • Matched normal contaminated with cancer DNA

    • adjacent normal tissue may contain residual disease or early tumour-initiating somatic mutations

    • circulating tumour DNA in blood normals


  • Sequencing errors

  • Alignment artefacts

Mwenifumbo & Marra, Nat Rev Genet. 2013

Sequencing error

  • Base qualities drop off toward ends of reads on Illumina sequencing platform, errors in base calls more likely

Filter: minimum base quality for variant alleles [CaVEMan --min-base-qual parameter]

Filter: no variant alleles found in first 2/3 of a read [CaVEMan RP filter]

Low base qualities for variant alleles toward ends of reads

Technical bias

  • Duplicate reads

    • PCR amplification during library construction introduces duplicate reads

    • Bioinformatic tools identify likely PCR duplicates based on aligned start positions of both ends of a read pair

Post-alignment: mark duplicate reads and exclude from SNV calling


  • Strand bias

    • Calls supported mainly by reads aligning to one strand may also be PCR artefacts

Filter: variant alleles in reads aligning in one direction only [SE filter]

Strand bias

Technical bias

  • GC bias

    • Stretches of low GC content tend to be under-represented leading to uneven coverage across the genome

    • Poor confidence in calls at these low coverage regions as germline SNVs more likely to be mistaken for somatic SNVs due to inadequate sampling in the matched normal

Filter: minimum read depth at variant position particularly in the normal

Filter: minimum number of reads supporting variant call

Alignment issues

Alignment issues are common source of false positive SNV calls


  • Missing sequence in the reference genome causes misalignments, usually with mismatches

Alignment: use decoy sequence included in reference genome used by 1000 Genomes project

Filter: variants supported by reads with many mismatches


  • Assembly in region around variant may differ from the the reference sequence causing incorrect alignments, e.g. indels, rearrangements

Post-alignment: Perform local realignment around indels [GATK IndelRealigner]

Filter: variants within or close to germline indels [GI filter]

Filter: variant position always toward beginning/end of alignment [RP filter]

Missing sequence from reference genome assembly

Problematic alignment around indels

Alignment issues

  • Repetitive regions in the genome cause difficulties for alignment tools

    • Aligners assign a mapping quality of 0 if they cannot uniquely place a read

    • Regions of low mappability are usually out of bounds for short read sequencing

Filter: minimum mapping quality of variant reads [MQ filter]

Filter: calls from low-complexity and low-mappability regions [SR, CR filters]

Low mapping quality

CaVEMan filters

HCC1143 dataset: 138042 of 159001 (87%) of initial SNV calls made by CaVEMan are filtered

Filtering strategies

Hard filters

  • Based on summary statistics, or metrics, computed from the sequence reads covering the variant position, e.g. average mapping or base quality scores

Ensemble calling

  • Consensus call set from majority voting on candidate SNVs called by multiple callers

Blacklisting

  • Exclude list of problematic genomic positions and/or substitutions, e.g. based on panel of normals

Machine learning techniques

Creating, tuning and testing filters

  • Benchmark datasets can be used to tune and assess filters

  • Ideally need to test filters on separate dataset to that used for training

  • Danger of overfitting


  • Approach:

    • Plot distribution of true and false positive variants for variety of metrics

    • Choose threshold to best distinguish between TP and FP


Here we use the ICGC medulloblastoma benchmark dataset to derive filters and the ICGC-TCGA DREAM Challenge sythentic dataset 4 to test these filters.

Benchmark datasets

ICGC benchmarking exercise


  • Medulloblastoma tumour/normal pair sequenced in 6 different centres to combined 300-fold coverage used to establish 'truth'

  • 16 ICGC project teams ran their pipelines on data from one centre (40x)

Alioto et al., Nat Commun. 2015


ICGC-TCGA DREAM Somatic Mutation Calling challenge


  • 6 synthetic datasets based on cell line sequenced to 80x, BAM randomnly split into 2 ('tumour' and 'normal'), mutations added to one computationally

  • Synthetic dataset 4: 80% cellularity; 50% and 35% subclone VAF (effectively 30% and 15%)

Ewing et al., Nat Methods 2015 [leaderboards]

Average base quality of variant supporting reads

Average mapping quality of variant reads

Difference in mapping quality between variant and reference reads

Calculating metrics and applying filters

# filter set 1 (complements MuTect2's own in-built hard filters)

java -jar GenomeAnalysisTK.jar \
  --analysis_type VariantFiltration \
  --reference_sequence reference.fasta \
  --variant input.vcf \
  --out output.vcf \
  --filterName VariantAlleleCount    --filterExpression "VariantAlleleCount < 3" \
  --filterName VariantCountControl   --filterExpression "VariantAlleleCountControl > 1" \
  --filterName VariantBaseQualMedian --filterExpression "VariantBaseQualMedian < 25.0" \
  --filterName VariantMapQualMedian  --filterExpression "VariantMapQualMedian < 40.0" \
  --filterName MapQualDiffMedian     --filterExpression "MapQualDiffMedian < -5.0 || MapQualDiffMedian > 5.0" \
  --filterName LowMapQual            --filterExpression "LowMapQual > 0.05"

ICGC MB99 benchmark – applying filters

DREAM challenge synthetic 4 dataset – testing filters

DREAM challenge synthetic 4 dataset

SNVs called by MuTect2 (12432 true, 665 false, 3836 not called)

Panel of Normals filter (PoN)

  • Artefacts usually cancel out in the tumour normal comparison but depends on adequate sampling

    • Low depth in normal can cause germline variants to appear as somatic

  • An approach to detecting likely artefacts is to look for the variant in a panel of unrelated normal samples

    • Filters both polymorphisms and locations prone to aberrant mapping or systematic sequencing artefacts

cgpCaVEManWrapper [Jones et al., 2016]


CRUK-CI blacklist

  • Based on 50x sequence data for 149 blood normal samples from the UK oesophageal cancer ICGC project

  • Variants appearing in at least 5 normals (minimum 3 reads, 5% allele fraction in each sample)

DREAM challenge synthetic 4 dataset – PoN filter

Combining results from multiple SNV callers

Can combining the results from multiple callers improve accuracy?


Ensemble calling

  • Majority voting on candidate SNVs from multiple callers to produce consensus call set

  • ICGC PanCancer project using SNV calls made by at least 2 out of 4 callers (CaVEMan, MuTect2, MuSE, samtools)

  • bcbio cancer variant calling pipeline – also see Brad Chapman's blog

  • SomaticSeq [Fang et al, 2014]

ICGC MB99 Benchmark – Ensemble call sets

Summary

  • SNV calling in cancer genomes is difficult for many reasons

  • Using an SNV caller out-of-the-box may give a reasonable set of calls but is likely to result in call sets with higher sensitivity at the expense of precision

  • Simple filtering strategies can improve precision but there is a trade-off between sensitivity and accuracy

  • The cancer genome sequencing community has been active in establishing benchmark datasets that can be used to assess and improve somatic mutation calling pipelines