July 2017

Outline

  • Functional impact of indels in cancer

  • Short exercise: exploring TGF-beta receptor 2 indels in COSMIC

  • Indel calling tools

  • How well do somatic indel callers perform?

  • Some characteristics of indels called in HCC1143

Indels

  • Range from 1 to 10,000 bases but here we are mostly considering small indels of 1 - 50 bp

  • Lower frequency than SNPs except near highly repetitive regions, including homopolymers and microsatellites

Somatic frameshift deletion

Functional impact

  • 1000 Genomes Project loss of function indel variants in each individual [Nature 2010]

    • 340 - 400 premature stop codons, split site disruptions and frame shifts

    • 250 - 300 genes affected


  • Frameshift mutations have an impact in several diseases (cystic fibrosis, HIV, Crohn's disease, Tay-Sachs)

  • Colorectal cancer

    • 15% of colorectal tumours characterized by microsatellite instability (MSI), caused by defective mismatch repair

    • DNA slippage within coding sequences induces frameshift mutations resulting in truncated, functionally inactive proteins (TGFBR2 and BAX frequently targeted genes)


  • COSMIC database of somatic mutations in cancer has several examples of frameshift mutations in tumour suppressors, such as TP53, PTEN, BRCA1/2 and TGFBR2

Indels in cancer genome sequencing projects

  • In-depth analyses in large cohort studies typically focus on substitutions or copy number aberrations/rearrangements

    • Indels often just catalogued

  • Recent Sanger Institute paper on 560 breast cancers searched for novel indel drivers in non-coding regions with significant recurrence (functional regulatory elements) [Nik-Zainal et al., Nature 2016]


  • Application of 1000 Genomes Project data to cancer genomics showed that genes linked with cancer showed stronger selection against indels [Science 2013]

Exercise – Exploring indels in TGF-beta receptor 2

  • Search for TGFBR2 in the COSMIC website

  • Explore the breakdown of mutation types for samples with TGFBR2 mutations (Distribution tab)

    • What proportion of samples have frameshift indels?

    • What size are most of these indels?

  • Look at the location of the indels within the gene (Gene View tab)

    • What is immediately striking about this?

    • Zoom in on the most frequently observed indel and turn on the DNA sequence display

    • What is the sequence context at this indel?

  • Click on the c.374delA deletion to access details of the cancers in which this mutation was observed

    • Which cancer type is this mutation most commonly seen in?

Indel calling tools

  • Several tools call both SNVs and indels particularly those tools that perform local reassembly or realignment around likely indel sites

    • VarScan2
    • Strelka
    • MuTect2
    • VarDict

  • Pindel is the somatic indel caller used in the Sanger CGP pipeline

    • Uses a pattern growth algorithm for unmapped paired end reads anchored to mapped mate

    • Can identity short and medium sized indels up to 10kb

    • Modified version used in CGP pipeline that takes advantage of additional alignment information available for longer reads (>100 bases) and can make use of split read alignments

Identifying indels can be tricky

  • Many indels exist in long homopolymers and short sequence repeats (di-nucleotides, tri-nucleotides, etc.)

  • Not easy to distinguish between true variants caused by replication slippage and sequencing errors

    • Strelka filters out indels within repeats above certain length


ICGC MB99 benchmark

Indels in HCC1143

Total of 208 somatic insertions and 564 somatic deletions called by Pindel.

Short sequence repeat expansions/contractions in HCC1143

272 hompolymer A/T indels (35% of all indels), almost all of which are 1-bp insertions or deletions.