Discovery and genotyping of novel sequence insertions in many sequenced individuals

Motivation: Despite recent advances in algorithms design to characterize structural variation using high‐throughput short read sequencing (HTS) data, characterization of novel sequence insertions longer than the average read length remains a challenging task. This is mainly due to both computational difficulties and the complexities imposed by genomic repeats in generating reliable assemblies to accurately detect both the sequence content and the exact location of such insertions. Additionally, de novo genome assembly algorithms typically require a very high depth of coverage, which may be a limiting factor for most genome studies. Therefore, characterization of novel sequence insertions is not a routine part of most sequencing projects. There are only a handful of algorithms that are specifically developed for novel sequence insertion discovery that can bypass the need for the whole genome de novo assembly. Still, most such algorithms rely on high depth of coverage, and to our knowledge there is only one method (PopIns) that can use multi‐sample data to “collectively” obtain a very high coverage dataset to accurately find insertions common in a given population. Result: Here, we present Pamir, a new algorithm to efficiently and accurately discover and genotype novel sequence insertions using either single or multiple genome sequencing datasets. Pamir is able to detect breakpoint locations of the insertions and calculate their zygosity (i.e. heterozygous versus homozygous) by analyzing multiple sequence signatures, matching one‐end‐anchored sequences to small‐scale de novo assemblies of unmapped reads, and conducting strand‐aware local assembly. We test the efficacy of Pamir on both simulated and real data, and demonstrate its potential use in accurate and routine identification of novel sequence insertions in genome projects. Availability and implementation: Pamir is available at https://github.com/vpc‐ccg/pamir. Contact: fhach@{sfu.ca, prostatecentre.com} or calkan@cs.bilkent.edu.tr Supplementary information: Supplementary data are available at Bioinformatics online.

[1]  J. Zook,et al.  Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls , 2013, Nature Biotechnology.

[2]  Jonas Korlach,et al.  Discovery and genotyping of structural variation from long-read haploid genome sequence data , 2017, Genome research.

[3]  Jan O. Korbel,et al.  Computational Pan-Genomics: Status, Promises and Challenges , 2016 .

[4]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[5]  Can Alkan,et al.  On genomic repeats and reproducibility , 2016, Bioinform..

[6]  Gonçalo R. Abecasis,et al.  The variant call format and VCFtools , 2011, Bioinform..

[7]  Nancy R. Zhang,et al.  A genome-wide approach for detecting novel insertion-deletion variants of mid-range size , 2016, Nucleic acids research.

[8]  Faraz Hach,et al.  mrsFAST: a cache-oblivious algorithm for short-read mapping , 2010, Nature Methods.

[9]  Knut Reinert,et al.  Methods for the detection and assembly of novel sequence in high-throughput sequencing data , 2015, Bioinform..

[10]  Ying Zhang,et al.  Computational pan-genomics: status, promises and challenges , 2016, bioRxiv.

[11]  Faraz Hach,et al.  mrsFAST-Ultra: a compact, SNP-aware mapper for high performance sequencing applications , 2014, Nucleic Acids Res..

[12]  Joshua M. Korn,et al.  Mapping and sequencing of structural variation from eight human genomes , 2008, Nature.

[13]  Sergey A. Shiryev,et al.  Single haplotype assembly of the human genome from a hydatidiform mole , 2014, bioRxiv.

[14]  Evan E. Eichler,et al.  Genetic variation and the de novo assembly of human genomes , 2015, Nature Reviews Genetics.

[15]  Y. Benjamini,et al.  Summarizing and correcting the GC content bias in high-throughput sequencing , 2012, Nucleic acids research.

[16]  Jonas Korlach,et al.  Corrigendum: Discovery and genotyping of structural variation from long-read haploid genome sequence data. , 2018, Genome research.

[17]  Rayan Chikhi,et al.  MindTheGap: integrated detection and assembly of short and long insertions , 2014, Bioinform..

[18]  Kenny Q. Ye,et al.  Mapping copy number variation by population scale genome sequencing , 2010, Nature.

[19]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[20]  Inanç Birol,et al.  Detection and characterization of novel sequence insertions using paired-end next-generation sequencing , 2010, Bioinform..

[21]  E. Eichler,et al.  A Human Genome Structural Variation Sequencing Resource Reveals Insights into Mutational Mechanisms , 2010, Cell.

[22]  G. McVean,et al.  De novo assembly and genotyping of variants using colored de Bruijn graphs , 2011, Nature Genetics.

[23]  A. B. Kahn,et al.  Topological sorting of large networks , 1962, CACM.

[24]  Páll Melsted,et al.  PopIns: population-scale detection of novel sequence insertions , 2015, Bioinform..

[25]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[26]  B. Trask,et al.  Segmental duplications: organization and impact within the current human genome project assembly. , 2001, Genome research.

[27]  Richard Durbin,et al.  Extending reference assembly models , 2015, Genome Biology.

[28]  Mark J. P. Chaisson,et al.  Resolving the complexity of the human genome using single-molecule sequencing , 2014, Nature.

[29]  David S. Johnson,et al.  Approximation algorithms for combinatorial problems , 1973, STOC.

[30]  N. Lennon,et al.  Characterizing and measuring bias in sequence data , 2013, Genome Biology.

[31]  Yong-shu He,et al.  [Structural variation in the human genome]. , 2009, Yi chuan = Hereditas.

[32]  Bradley P. Coe,et al.  Genome structural variation discovery and genotyping , 2011, Nature Reviews Genetics.

[33]  E. Eichler,et al.  Characterization of Missing Human Genome Sequences and Copy-number Polymorphic Insertions , 2010, Nature Methods.

[34]  Leping Li,et al.  ART: a next-generation sequencing read simulator , 2012, Bioinform..

[35]  Paul Medvedev,et al.  Computational methods for discovering structural variation with next-generation sequencing , 2009, Nature Methods.

[36]  Gabor T. Marth,et al.  A global reference for human genetic variation , 2015, Nature.