SDhaP: haplotype assembly for diploids and polyploids via semi-definite programming

BackgroundThe goal of haplotype assembly is to infer haplotypes of an individual from a mixture of sequenced chromosome fragments. Limited lengths of paired-end sequencing reads and inserts render haplotype assembly computationally challenging; in fact, most of the problem formulations are known to be NP-hard. Dimensions (and, therefore, difficulty) of the haplotype assembly problems keep increasing as the sequencing technology advances and the length of reads and inserts grow. The computational challenges are even more pronounced in the case of polyploid haplotypes, whose assembly is considerably more difficult than in the case of diploids. Fast, accurate, and scalable methods for haplotype assembly of diploid and polyploid organisms are needed.ResultsWe develop a novel framework for diploid/polyploid haplotype assembly from high-throughput sequencing data. The method formulates the haplotype assembly problem as a semi-definite program and exploits its special structure – namely, the low rank of the underlying solution – to solve it rapidly and with high accuracy. The developed framework is applicable to both diploid and polyploid species. The code for SDhaP is freely available at https://sourceforge.net/projects/sdhap.ConclusionExtensive benchmarking tests on both real and simulated data show that the proposed algorithms outperform several well-known haplotype assembly methods in terms of either accuracy or speed or both. Useful recommendations for coverages needed to achieve near-optimal solutions are also provided.

[1]  Alexander I. Barvinok,et al.  Problems of distance geometry and convex properties of quadratic maps , 1995, Discret. Comput. Geom..

[2]  David P. Williamson,et al.  Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming , 1995, JACM.

[3]  Alan M. Frieze,et al.  Improved Approximation Algorithms for MAX k-CUT and MAX BISECTION , 1995, IPCO.

[4]  Gábor Pataki,et al.  On the Rank of Extreme Matrices in Semidefinite Programs and the Multiplicity of Optimal Eigenvalues , 1998, Math. Oper. Res..

[5]  Friedhelm Meyer auf der Heide,et al.  Algorithms — ESA 2001 , 2001, Lecture Notes in Computer Science.

[6]  Russell Schwartz,et al.  SNPs Problems, Complexity, and Algorithms , 2001, ESA.

[7]  Pardis C Sabeti,et al.  Detecting recent positive selection in the human genome from haplotype structure , 2002, Nature.

[8]  Toshihiro Tanaka The International HapMap Project , 2003, Nature.

[9]  Moses Charikar,et al.  Maximizing quadratic programs: extending Grothendieck's inequality , 2004, 45th Annual IEEE Symposium on Foundations of Computer Science.

[10]  A. Clark,et al.  The role of haplotypes in candidate gene studies , 2004, Genetic epidemiology.

[11]  Noga Alon,et al.  Approximating the cut-norm via Grothendieck's inequality , 2004, STOC '04.

[12]  Xiang-Sun Zhang,et al.  Haplotype reconstruction from SNP fragments by minimum error correction , 2005, Bioinform..

[13]  Leo van Iersel,et al.  On the Complexity of Several Haplotyping Problems , 2005, WABI.

[14]  Rita Casadio,et al.  Algorithms in Bioinformatics, 5th International Workshop, WABI 2005, Mallorca, Spain, October 3-6, 2005, Proceedings , 2005, WABI.

[15]  Sanjeev Arora,et al.  Fast algorithms for approximate semidefinite programming using the multiplicative weights update method , 2005, 46th Annual IEEE Symposium on Foundations of Computer Science (FOCS'05).

[16]  Venkatesan Guruswami,et al.  Clustering with qualitative information , 2005, 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings..

[17]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[18]  Kenneth Ward Church,et al.  Very sparse random projections , 2006, KDD '06.

[19]  Timothy B. Stockwell,et al.  The Diploid Genome Sequence of an Individual Human , 2007, PLoS biology.

[20]  Michael S Waterman,et al.  Diploid genome reconstruction of Ciona intestinalis and comparative analysis with Ciona savignyi. , 2007, Genome research.

[21]  Vineet Bafna,et al.  HapCUT: an efficient and accurate algorithm for the haplotype assembly problem , 2008, ECCB.

[22]  A. Halpern,et al.  An MCMC algorithm for haplotype assembly from whole-genome sequence data. , 2008, Genome research.

[23]  Eleazar Eskin,et al.  Optimal algorithms for haplotype assembly from whole-genome sequence data , 2010, Bioinform..

[24]  Anthony Wirth,et al.  Correlation Clustering , 2010, Encyclopedia of Machine Learning and Data Mining.

[25]  M. DePristo,et al.  A framework for variation discovery and genotyping using next-generation DNA sequencing data , 2011, Nature Genetics.

[26]  Joshua S. Paul,et al.  Genotype and SNP calling from next-generation sequencing data , 2011, Nature Reviews Genetics.

[27]  Haris Vikalo,et al.  OnlineCall: fast online parameter estimation and base calling for illumina's next-generation sequencing , 2012, Bioinform..

[28]  Sorin Istrail,et al.  HapCompass: A Fast Cycle Basis Algorithm for Accurate Haplotype Assembly of Sequence Data , 2012, J. Comput. Biol..

[29]  K. Verstrepen,et al.  Fosmid-based whole genome haplotyping of a HapMap trio child: evaluation of Single Individual Haplotyping techniques , 2011, Nucleic acids research.

[30]  K. Robasky,et al.  On the design of clone-based haplotyping , 2013, Genome Biology.

[31]  Zhi-Zhong Chen,et al.  Exact algorithms for haplotype assembly from whole-genome sequence data , 2013, Bioinform..

[32]  Sorin Istrail,et al.  Haplotype assembly in polyploid genomes and identical by descent shared tracts , 2013, Bioinform..

[33]  Bonnie Berger,et al.  HapTree: A Novel Bayesian Framework for Single Individual Polyplotyping Using NGS Data , 2014, RECOMB.

[34]  Bonnie Berger,et al.  HapTree: A Novel Bayesian Framework for Single Individual Polyplotyping Using NGS Data , 2014, PLoS Comput. Biol..