PolyCluster: Minimum Fragment Disagreement Clustering for Polyploid Phasing

Phasing is an emerging area in computational biology with important applications in clinical decision making and biomedical sciences. While machine learning techniques have shown tremendous potential in many biomedical applications, their utility in phasing has not yet been fully understood. In this paper, we investigate development of clustering-based techniques for phasing in polyploidy organisms where more than two copies of each chromosome exist in the cells of the organism under study. We develop a novel framework, called PolyCluster, based on the concept of correlation clustering followed by an effective cluster merging mechanism to minimize the amount of disagreement among short reads residing in each cluster. We first introduce a graph model to quantify the amount of similarity between each pair of DNA reads. We then present a combination of linear programming, rounding, region-growing, and cluster merging to group similar reads and reconstruct haplotypes. Our extensive analysis demonstrates the effectiveness of PolyCluster in accurate and scalable phasing. In particular, we show that PolyCluster reduces switching error of H-PoP, HapColor, and HapTree by 44.4, 51.2, and 48.3 percent, respectively. Also, the running time of PolyCluster is several orders-of-magnitude less than HapTree while it achieves a running time comparable to other algorithms.

[1]  Roded Sharan,et al.  Cluster graph modification problems , 2002, Discret. Appl. Math..

[2]  Leo van Iersel,et al.  WhatsHap: Haplotype Assembly for Future-Generation Sequencing Reads , 2014, RECOMB.

[3]  Mihalis Yannakakis,et al.  The complexity of multiway cuts (extended abstract) , 1992, STOC '92.

[4]  Russell Schwartz,et al.  SNPs Problems, Complexity, and Algorithms , 2001, ESA.

[5]  Shilpa Garg,et al.  Read-based phasing of related individuals , 2016, bioRxiv.

[6]  Paola Bonizzoni,et al.  On the Approximation of Correlation Clustering and Consensus Clustering , 2008, J. Comput. Syst. Sci..

[7]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[8]  A. Meyer,et al.  From 2R to 3R: evidence for a fish-specific genome duplication (FSGD). , 2005, BioEssays : news and reviews in molecular, cellular and developmental biology.

[9]  Hila Becker A Survey of Correlation Clustering , 2005 .

[10]  Venkatesan Guruswami,et al.  Correlation clustering with a fixed number of clusters , 2005, SODA '06.

[11]  Tao Jiang,et al.  Sequence analysis H-PoP and H-PoPG: heuristic partitioning algorithms for single individual haplotyping of polyploids , 2016 .

[12]  Venkatesan Guruswami,et al.  Clustering with qualitative information , 2005, 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings..

[13]  Yun Zhang,et al.  The Cluster Editing Problem: Implementations and Experiments , 2006, IWPEC.

[14]  Wei Wang,et al.  HapColor: A graph coloring framework for polyploidy phasing , 2015, 2015 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[15]  H. Vikalo,et al.  SDhaP: haplotype assembly for diploids and polyploids via semi-definite programming , 2015, BMC Genomics.

[16]  Amos Fiat,et al.  Correlation clustering in general weighted graphs , 2006, Theor. Comput. Sci..

[17]  Wei Wang,et al.  Individual haplotyping prediction agreements , 2014, BCB.

[18]  Leo van Iersel,et al.  WhatsHap: Weighted Haplotype Assembly for Future-Generation Sequencing Reads , 2015, J. Comput. Biol..

[19]  A. Halpern,et al.  An MCMC algorithm for haplotype assembly from whole-genome sequence data. , 2008, Genome research.

[20]  Sreeram Kannan,et al.  Resolving Multicopy Duplications de novo Using Polyploid Phasing , 2017, RECOMB.

[21]  Bonnie Berger,et al.  HapTree: A Novel Bayesian Framework for Single Individual Polyplotyping Using NGS Data , 2014, PLoS Comput. Biol..

[22]  S. Swamy,et al.  PICNIC: an algorithm to predict absolute allelic copy number variation with microarray cancer data , 2009, Biostatistics.

[23]  Sorin Istrail,et al.  HapCompass: A Fast Cycle Basis Algorithm for Accurate Haplotype Assembly of Sequence Data , 2012, J. Comput. Biol..

[24]  Vineet Bafna,et al.  HapCUT: an efficient and accurate algorithm for the haplotype assembly problem , 2008, ECCB.

[25]  D. de Ridder,et al.  EXPLOITING NEXT GENERATION SEQUENCING TO SOLVE THE HAPLOTYPING PUZZLE IN POLYPLOIDS: A SIMULATION STUDY , 2016, bioRxiv.

[26]  Paola Bonizzoni,et al.  On the Fixed Parameter Tractability and Approximability of the Minimum Error Correction Problem , 2015, CPM.

[27]  Yoshiko Wakabayashi,et al.  A cutting plane algorithm for a clustering problem , 1989, Math. Program..

[28]  P. Donnelly,et al.  A new statistical method for haplotype reconstruction from population data. , 2001, American journal of human genetics.

[29]  Hae-Sang Park,et al.  A simple and fast algorithm for K-medoids clustering , 2009, Expert Syst. Appl..

[30]  Sebastian Böcker,et al.  Exact Algorithms for Cluster Editing: Evaluation and Experiments , 2008, Algorithmica.

[31]  Harvey J. Greenberg,et al.  Opportunities for Combinatorial Optimization in Computational Biology , 2004, INFORMS J. Comput..

[32]  K. Verstrepen,et al.  Fosmid-based whole genome haplotyping of a HapMap trio child: evaluation of Single Individual Haplotyping techniques , 2011, Nucleic acids research.

[33]  Paola Bonizzoni,et al.  HapCol: accurate and memory-efficient haplotype assembly from long reads , 2016, Bioinform..

[34]  Timothy B. Stockwell,et al.  The Diploid Genome Sequence of an Individual Human , 2007, PLoS biology.

[35]  V. Bansal,et al.  The importance of phase information for human genomics , 2011, Nature Reviews Genetics.

[36]  Wei Wang,et al.  FastHap: fast and accurate single individual haplotype reconstruction using fuzzy conflict graphs , 2014, Bioinform..

[37]  Leo van Iersel,et al.  The Complexity of the Single Individual SNP Haplotyping Problem , 2005, Algorithmica.

[38]  Joachim Selbig,et al.  Haplotype inference from unphased SNP data in heterozygous polyploids based on SAT , 2008, BMC Genomics.

[39]  Alessandro Panconesi,et al.  Fast Hare: A Fast Heuristic for Single Individual SNP Haplotype Reconstruction , 2004, WABI.

[40]  Leo van Iersel,et al.  On the Complexity of Several Haplotyping Problems , 2005, WABI.

[41]  Bonnie Berger,et al.  HapTree-X: An Integrative Bayesian Framework for Haplotype Reconstruction from Transcriptome and Genome Sequencing Data , 2015, RECOMB.

[42]  Nicole Immorlica,et al.  Approximation, Randomization, and Combinatorial Optimization.. Algorithms and Techniques , 2003, Lecture Notes in Computer Science.

[43]  Itay Mayrose,et al.  The frequency of polyploid speciation in vascular plants , 2009, Proceedings of the National Academy of Sciences.

[44]  Sorin Istrail,et al.  Haplotype assembly in polyploid genomes and identical by descent shared tracts , 2013, Bioinform..

[45]  Zhongfu Ni,et al.  Mechanisms of genomic rearrangements and gene expression changes in plant polyploids. , 2006, BioEssays : news and reviews in molecular, cellular and developmental biology.

[46]  Russell Schwartz,et al.  Algorithmic strategies for the single nucleotide polymorphism haplotype assembly problem , 2002, Briefings Bioinform..

[47]  Mark J. P. Chaisson,et al.  Reconstructing complex regions of genomes using long-read sequencing technology , 2014, Genome research.

[48]  Anthony Wirth,et al.  Correlation Clustering , 2010, Encyclopedia of Machine Learning and Data Mining.

[49]  Xiang-Sun Zhang,et al.  Haplotype reconstruction from SNP fragments by minimum error correction , 2005, Bioinform..

[50]  B. Browning,et al.  Haplotype phasing: existing methods and new developments , 2011, Nature Reviews Genetics.