rCANID: read Clustering and Assembly-based Novel Insertion Detection tool

Novel sequence insertion (NSI) is a class of genome structural variations (SVs) having important biological functions and strong correlations with phenotypes and diseases. The rapid development of long read sequencing technologies provide the opportunity to more comprehensively study NSIs, since the much longer reads are helpful to the assembly and location of novel sequences. However, state-of-the-art long read-based SV detection approaches are in generic design to detect various kinds of SVs, and they either only use the signals of chimerically aligned reads or the contigs of de novoassembly, which are not good at NSI detection and/or computationally expensive. Herein, we propose Read Clustering and Assembly-based Novel Insertion Detection tool (rCANID), a novel long read-based NSI detection approach. rCANID fully takes the advantage of chimerically aligned and unaligned reads by its specifically designed read clustering and lightweight local read assembly methods to effectively reconstruct inserted sequences with relatively low computational cost. Benchmarking on both of simulated and real datasets demonstrates that rCANID can sensitively discover NSIs, especially for those having large inserted novel sequences, which could be hard to state-of-the-art approaches. rCANID is suited to be integrated into many computational pipelines to play important roles in many genomic studies.

[1]  Fritz J Sedlazeck,et al.  Piercing the dark matter: bioinformatics of long-range sequencing and mapping , 2018, Nature Reviews Genetics.

[2]  Páll Melsted,et al.  PopIns: population-scale detection of novel sequence insertions , 2015, Bioinform..

[3]  Michael C. Schatz,et al.  Accurate detection of complex structural variations using single molecule sequencing , 2017 .

[4]  Heng Li,et al.  Minimap2: pairwise alignment for nucleotide sequences , 2017, Bioinform..

[5]  J. Zook,et al.  Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls , 2013, Nature Biotechnology.

[6]  Yadong Wang,et al.  PRISM: Pair-read informed split-read mapping for base-pair level detection of insertion, deletion and structural variants , 2012, Bioinform..

[7]  Adam C. English,et al.  PBHoney: identifying genomic variants via long-read discordance and interrupted mapping , 2014, BMC Bioinformatics.

[8]  Bradley P. Coe,et al.  Genome structural variation discovery and genotyping , 2011, Nature Reviews Genetics.

[9]  Hugh E. Olsen,et al.  The Oxford Nanopore MinION: delivery of nanopore sequencing to the genomics community , 2016, Genome Biology.

[10]  Lars Bolund,et al.  Building the sequence map of the human pan-genome , 2010, Nature Biotechnology.

[11]  Andrew J Sharp,et al.  Structural variation of the human genome. , 2006, Annual review of genomics and human genetics.

[12]  Mauricio O. Carneiro,et al.  The advantages of SMRT sequencing , 2013, Genome Biology.

[13]  J. Korlach,et al.  De novo assembly and phasing of a Korean human genome , 2016, Nature.

[14]  Chao Chen,et al.  dbVar and DGVa: public archives for genomic structural variation , 2012, Nucleic Acids Res..

[15]  Inanç Birol,et al.  Detection and characterization of novel sequence insertions using paired-end next-generation sequencing , 2010, Bioinform..

[16]  Thomas Zichner,et al.  DELLY: structural variant discovery by integrated paired-end and split-read analysis , 2012, Bioinform..

[17]  Bjarni V. Halldórsson,et al.  Large-scale whole-genome sequencing of the Icelandic population , 2015, Nature Genetics.

[18]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[19]  Kai Ye,et al.  Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads , 2009, Bioinform..

[20]  Ryan M. Layer,et al.  LUMPY: a probabilistic framework for structural variant discovery , 2012, Genome Biology.

[21]  Tom R. Gaunt,et al.  The UK10K project identifies rare variants in health and disease , 2016 .

[22]  A. Magi,et al.  Detection of Genomic Structural Variants from Next-Generation Sequencing Data , 2015, Front. Bioeng. Biotechnol..

[23]  Kin-Fan Au,et al.  PacBio Sequencing and Its Applications , 2015, Genom. Proteom. Bioinform..

[24]  Li Fang,et al.  NextSV: a meta-caller for structural variants from low-coverage long-read sequencing data , 2018 .

[25]  D. Altshuler,et al.  A map of human genome variation from population-scale sequencing , 2010, Nature.

[26]  Mark J. P. Chaisson,et al.  Resolving the complexity of the human genome using single-molecule sequencing , 2014, Nature.

[27]  Feng Luo,et al.  MECAT: fast mapping, error correction, and de novo assembly for single-molecule sequencing reads , 2017, Nature Methods.

[28]  Kiyoshi Asai,et al.  PBSIM: PacBio reads simulator - toward accurate genome assembly , 2013, Bioinform..

[29]  Helga Thorvaldsdóttir,et al.  Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration , 2012, Briefings Bioinform..

[30]  Russell E. Durrett,et al.  Assembly and diploid architecture of an individual human genome via single-molecule technologies , 2015, Nature Methods.

[31]  Glenn Tesler,et al.  Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory , 2012, BMC Bioinformatics.

[32]  Faraz Hach,et al.  Discovery and genotyping of novel sequence insertions in many sequenced individuals , 2017, Bioinform..

[33]  Richard J. Roberts,et al.  The advantages of SMRT sequencing , 2013, Genome Biology.