Sequence analysis H-PoP and H-PoPG: heuristic partitioning algorithms for single individual haplotyping of polyploids

MOTIVATION Some economically important plants including wheat and cotton have more than two copies of each chromosome. With the decreasing cost and increasing read length of next-generation sequencing technologies, reconstructing the multiple haplotypes of a polyploid genome from its sequence reads becomes practical. However, the computational challenge in polyploid haplotyping is much greater than that in diploid haplotyping, and there are few related methods. RESULTS This article models the polyploid haplotyping problem as an optimal poly-partition problem of the reads, called the Polyploid Balanced Optimal Partition model. For the reads sequenced from a k-ploid genome, the model tries to divide the reads into k groups such that the difference between the reads of the same group is minimized while the difference between the reads of different groups is maximized. When the genotype information is available, the model is extended to the Polyploid Balanced Optimal Partition with Genotype constraint problem. These models are all NP-hard. We propose two heuristic algorithms, H-PoP and H-PoPG, based on dynamic programming and a strategy of limiting the number of intermediate solutions at each iteration, to solve the two models, respectively. Extensive experimental results on simulated and real data show that our algorithms can solve the models effectively, and are much faster and more accurate than the recent state-of-the-art polyploid haplotyping algorithms. The experiments also show that our algorithms can deal with long reads and deep read coverage effectively and accurately. Furthermore, H-PoP might be applied to help determine the ploidy of an organism. AVAILABILITY AND IMPLEMENTATION https://github.com/MinzhuXie/H-PoPG CONTACT: xieminzhu@hotmail.comSupplementary information: Supplementary data are available at Bioinformatics online.

[1]  Paola Bonizzoni,et al.  On the Fixed Parameter Tractability and Approximability of the Minimum Error Correction Problem , 2015, CPM.

[2]  Jianer Chen,et al.  A model of higher accuracy for the individual haplotyping problem based on weighted SNP fragments and genotype with errors , 2008, ISMB.

[3]  Anthony R. Borneman,et al.  De-Novo Assembly and Analysis of the Heterozygous Triploid Genome of the Wine Spoilage Yeast Dekkera bruxellensis AWRI1499 , 2012, PloS one.

[4]  Xiang-Sun Zhang,et al.  Haplotype reconstruction from SNP fragments by minimum error correction , 2005, Bioinform..

[5]  B. Browning,et al.  Haplotype phasing: existing methods and new developments , 2011, Nature Reviews Genetics.

[6]  Jianer Chen,et al.  A practical parameterised algorithm for the individual haplotyping problem MLF , 2010, Math. Struct. Comput. Sci..

[7]  Leo van Iersel,et al.  The Complexity of the Single Individual SNP Haplotyping Problem , 2005, Algorithmica.

[8]  Zhi-Zhong Chen,et al.  Exact algorithms for haplotype assembly from whole-genome sequence data , 2013, Bioinform..

[9]  H. Vikalo,et al.  SDhaP: haplotype assembly for diploids and polyploids via semi-definite programming , 2015, BMC Genomics.

[10]  Jianer Chen,et al.  A Practical Exact Algorithm for the Individual Haplotyping Problem MEC/GI , 2008, 2008 International Conference on BioMedical Engineering and Informatics.

[11]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[12]  Jonathan F Wendel,et al.  Doubling down on genomes: polyploidy and crop plants. , 2014, American journal of botany.

[13]  Leping Li,et al.  ART: a next-generation sequencing read simulator , 2012, Bioinform..

[14]  Sorin Istrail,et al.  Haplotype assembly in polyploid genomes and identical by descent shared tracts , 2013, Bioinform..

[15]  James R. Knight,et al.  Genome sequencing in microfabricated high-density picolitre reactors , 2005, Nature.

[16]  Jorge Duitama,et al.  ReFHap: a reliable and fast algorithm for single individual haplotyping , 2010, BCB '10.

[17]  Russell Schwartz,et al.  Algorithmic strategies for the single nucleotide polymorphism haplotype assembly problem , 2002, Briefings Bioinform..

[18]  Giuseppe Lancia,et al.  Polynomial and APX-hard cases of the individual haplotyping problem , 2005, Theor. Comput. Sci..

[19]  Eleazar Eskin,et al.  Optimal algorithms for haplotype assembly from whole-genome sequence data , 2010, Bioinform..

[20]  Heng Li,et al.  Improving SNP discovery by base alignment quality , 2011, Bioinform..

[21]  Heng Li,et al.  A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data , 2011, Bioinform..

[22]  Leo van Iersel,et al.  WhatsHap: Weighted Haplotype Assembly for Future-Generation Sequencing Reads , 2015, J. Comput. Biol..

[23]  Bonnie Berger,et al.  HapTree: A Novel Bayesian Framework for Single Individual Polyplotyping Using NGS Data , 2014, PLoS Comput. Biol..

[24]  Tao Jiang,et al.  A fast and accurate algorithm for single individual haplotyping , 2012, BMC Systems Biology.

[25]  Jianer Chen,et al.  A Practical Exact Algorithm for the Individual Haplotyping Problem MEC/GI , 2009, Algorithmica.

[26]  A. Leitch,et al.  Genomic Plasticity and the Diversity of Polyploid Plants , 2008, Science.

[27]  Armin R. Mikler,et al.  Proceedings of the First ACM International Conference on Bioinformatics and Computational Biology, BCB 2010, Niagara Falls, NY, USA, August 2-4, 2010 , 2010, BCB.

[28]  Alessandro Panconesi,et al.  Fast Hare: A Fast Heuristic for Single Individual SNP Haplotype Reconstruction , 2004, WABI.

[29]  Minzhu Xie,et al.  Computational Models and Algorithms for the Single Individual Haplotyping Problem , 2010 .

[30]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[31]  Paola Bonizzoni,et al.  HapCol: accurate and memory-efficient haplotype assembly from long reads , 2016, Bioinform..

[32]  Marco Pellegrini,et al.  SpeedHap: An Accurate Heuristic for the Single Individual SNP Haplotyping Problem with Many Gaps, High Reading Error Rate and Low Coverage , 2008, IEEE ACM Trans. Comput. Biol. Bioinform..

[33]  Russell Schwartz,et al.  SNPs Problems, Complexity, and Algorithms , 2001, ESA.