Detecting complex indels with wide length-spectrum from the third generation sequencing data

Structural variations are a complex collection of mutations, many of which are reported to associated to complex traits. Recent research reports a rare case of structural variants, complex indels, which may contribute to carcinogenesis. A complex indel often presents multiple inserted nucleotides in a deleted region. Due to the limitations on both data and algorithm, existing approaches could only detect complex indels with the length shorter than 80bps; however, the longer ones are considered to imply stronger impact. In this paper, we propose a novel algorithm, SVseq3, which handles the PacBio data and identifies the long complex indels. The algorithm captures the BLASR alignment results and locates the suspicious areas of complex indels by clustering. An improved similarity hash-based framework is then constructed. For each suspicious area, a continuing-seed strategy is adopted to split the inserted fragments and obtain the original locations. The mapped segments, which consist of a series of seeds, are used to further squeeze the intermediate breakpoints and identify the forms of the complex indels. SVseq3 is able to detect long complex indels and the complex indels with multiple sources of inserted fragments. We test SVseq3 on multiple datasets with different simulation configurations and compare it to the existing methods. The experiment results demonstrate that SVseq3 outperforms the existing approaches. The sensitivity and positive-predictive rates are able to reach around 70% and 85% in some common simulation settings, respectively.

[1]  M. Tijsterman,et al.  Polymerase theta-mediated end joining of replication-associated DNA breaks in C. elegans , 2014, Genome research.

[2]  Jin Zhang,et al.  An improved approach for accurate and efficient calling of structural variations with low-coverage sequence data , 2012, BMC Bioinformatics.

[3]  Andrew J Sharp,et al.  Structural variation of the human genome. , 2006, Annual review of genomics and human genetics.

[4]  Chun Hang Au,et al.  INDELseek: detection of complex insertions and deletions from next-generation sequencing data , 2017, BMC Genomics.

[5]  Glenn Tesler,et al.  Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory , 2012, BMC Bioinformatics.

[6]  James M Ford,et al.  Detection of Germline Mutation in Hereditary Breast and/or Ovarian Cancers by Next-Generation Sequencing on a Four-Gene Panel. , 2016, The Journal of molecular diagnostics : JMD.

[7]  Emmanuel Barillot,et al.  SV-Bay: structural variant detection in cancer genomes using a Bayesian approach with correction for GC-content and read mappability , 2016, Bioinform..

[8]  Timothy B. Stockwell,et al.  The Diploid Genome Sequence of an Individual Human , 2007, PLoS biology.

[9]  Onur Mutlu,et al.  Accelerating read mapping with FastHASH , 2013, BMC Genomics.

[10]  Kiyoshi Asai,et al.  PBSIM: PacBio reads simulator - toward accurate genome assembly , 2013, Bioinform..

[11]  Li Ding,et al.  Patterns and functional implications of rare germline variants across 12 cancer types , 2015, Nature Communications.

[12]  Alexander Schönhuth,et al.  Characteristics of de novo structural changes in the human genome , 2015, Genome research.

[13]  Wouter Koole,et al.  A Polymerase Theta-dependent repair pathway suppresses extensive genomic instability at endogenous G4 DNA sites , 2014, Nature Communications.

[14]  Moses Charikar,et al.  Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[15]  Thomas Zichner,et al.  DELLY: structural variant discovery by integrated paired-end and split-read analysis , 2012, Bioinform..

[16]  Joshua F. McMichael,et al.  Age-related cancer mutations associated with clonal hematopoietic expansion , 2014, Nature Medicine.

[17]  Rendong Yang,et al.  ScanIndel: a hybrid framework for indel detection via gapped alignment, split reads and de novo assembly , 2015, Genome Medicine.