Sparse Tensor Decomposition for Haplotype Assembly of Diploids and Polyploids

BackgroundHaplotype assembly is the task of reconstructing haplotypes of an individual from a mixture of sequenced chromosome fragments. Haplotype information enables studies of the effects of genetic variations on an organism’s phenotype. Most of the mathematical formulations of haplotype assembly are known to be NP-hard and haplotype assembly becomes even more challenging as the sequencing technology advances and the length of the paired-end reads and inserts increases. Assembly of haplotypes polyploid organisms is considerably more difficult than in the case of diploids. Hence, scalable and accurate schemes with provable performance are desired for haplotype assembly of both diploid and polyploid organisms.ResultsWe propose a framework that formulates haplotype assembly from sequencing data as a sparse tensor decomposition. We cast the problem as that of decomposing a tensor having special structural constraints and missing a large fraction of its entries into a product of two factors, U and V̲$\underline {\mathbf {V}}$; tensor V̲$\underline {\mathbf {V}}$ reveals haplotype information while U is a sparse matrix encoding the origin of erroneous sequencing reads. An algorithm, AltHap, which reconstructs haplotypes of either diploid or polyploid organisms by iteratively solving this decomposition problem is proposed. The performance and convergence properties of AltHap are theoretically analyzed and, in doing so, guarantees on the achievable minimum error correction scores and correct phasing rate are established. The developed framework is applicable to diploid, biallelic and polyallelic polyploid species. The code for AltHap is freely available from https://github.com/realabolfazl/AltHap.ConclusionAltHap was tested in a number of different scenarios and was shown to compare favorably to state-of-the-art methods in applications to haplotype assembly of diploids, and significantly outperforms existing techniques when applied to haplotype assembly of polyploids.

[1]  Jonathan F Wendel,et al.  Doubling down on genomes: polyploidy and crop plants. , 2014, American journal of botany.

[2]  K. Verstrepen,et al.  Fosmid-based whole genome haplotyping of a HapMap trio child: evaluation of Single Individual Haplotyping techniques , 2011, Nucleic acids research.

[3]  Bonnie Berger,et al.  HapTree: A Novel Bayesian Framework for Single Individual Polyplotyping Using NGS Data , 2014, PLoS Comput. Biol..

[4]  Vineet Bafna,et al.  HapCUT2: robust and accurate haplotype assembly for diverse sequencing technologies , 2017, Genome research.

[5]  Timothy B. Stockwell,et al.  The Diploid Genome Sequence of an Individual Human , 2007, PLoS biology.

[6]  Joydeep Ghosh,et al.  Noisy Matrix Completion Using Alternating Minimization , 2013, ECML/PKDD.

[7]  Sreeram Kannan,et al.  Resolving Multicopy Duplications de novo Using Polyploid Phasing , 2017, RECOMB.

[8]  Vineet Bafna,et al.  HapCUT: an efficient and accurate algorithm for the haplotype assembly problem , 2008, ECCB.

[9]  Zhi-Quan Luo,et al.  Guaranteed Matrix Completion via Non-Convex Factorization , 2014, IEEE Transactions on Information Theory.

[10]  Volodymyr Kuleshov,et al.  Probabilistic single-individual haplotyping , 2014, Bioinform..

[11]  Andrea Montanari,et al.  Matrix Completion from Noisy Entries , 2009, J. Mach. Learn. Res..

[12]  Sorin Istrail,et al.  HapCompass: A Fast Cycle Basis Algorithm for Accurate Haplotype Assembly of Sequence Data , 2012, J. Comput. Biol..

[13]  Riccardo Velasco,et al.  Saturated linkage map construction in Rubus idaeus using genotyping by sequencing and genome-independent imputation , 2013, BMC Genomics.

[14]  Tao Jiang,et al.  Sequence analysis H-PoP and H-PoPG: heuristic partitioning algorithms for single individual haplotyping of polyploids , 2016 .

[15]  Russell Schwartz,et al.  SNPs Problems, Complexity, and Algorithms , 2001, ESA.

[16]  Michael S Waterman,et al.  Diploid genome reconstruction of Ciona intestinalis and comparative analysis with Ciona savignyi. , 2007, Genome research.

[17]  R. Larsen Lanczos Bidiagonalization With Partial Reorthogonalization , 1998 .

[18]  Russell Schwartz,et al.  Theory and Algorithms for the Haplotype Assembly Problem , 2010, Commun. Inf. Syst..

[19]  M. DePristo,et al.  A framework for variation discovery and genotyping using next-generation DNA sequencing data , 2011, Nature Genetics.

[20]  Eleazar Eskin,et al.  Leveraging reads that span multiple single nucleotide polymorphisms for haplotype inference from sequencing data , 2013, Bioinform..

[21]  Jorge Duitama,et al.  ReFHap: a reliable and fast algorithm for single individual haplotyping , 2010, BCB '10.

[22]  Eleazar Eskin,et al.  Hap-seq: An Optimal Algorithm for Haplotype Phasing with Imputation Using Sequencing Data , 2013, J. Comput. Biol..

[23]  A. Clark,et al.  The role of haplotypes in candidate gene studies , 2004, Genetic epidemiology.

[24]  Lusheng Wang,et al.  A highly accurate heuristic algorithm for the haplotype assembly problem , 2013, BMC Genomics.

[25]  H. Vikalo,et al.  SDhaP: haplotype assembly for diploids and polyploids via semi-definite programming , 2015, BMC Genomics.

[26]  Russell Schwartz,et al.  Algorithmic strategies for the single nucleotide polymorphism haplotype assembly problem , 2002, Briefings Bioinform..

[27]  Zhi-Zhong Chen,et al.  Exact algorithms for haplotype assembly from whole-genome sequence data , 2013, Bioinform..

[28]  Leo van Iersel,et al.  On the Complexity of Several Haplotyping Problems , 2005, WABI.

[29]  Filippo Geraci,et al.  A comparison of several algorithms for the single individual SNP haplotyping reconstruction problem , 2010, Bioinform..

[30]  Eleazar Eskin,et al.  Optimal algorithms for haplotype assembly from whole-genome sequence data , 2010, Bioinform..

[31]  Emmanuel J. Candès,et al.  Exact Matrix Completion via Convex Optimization , 2008, Found. Comput. Math..

[32]  Toshihiro Tanaka The International HapMap Project , 2003, Nature.

[33]  A. Halpern,et al.  An MCMC algorithm for haplotype assembly from whole-genome sequence data. , 2008, Genome research.

[34]  Pardis C Sabeti,et al.  Detecting recent positive selection in the human genome from haplotype structure , 2002, Nature.

[35]  Olivier Delaneau,et al.  Integrating sequence and array data to create an improved 1000 Genomes Project haplotype reference panel , 2014, Nature Communications.

[36]  Haris Vikalo,et al.  Decoding Genetic Variations: Communications-Inspired Haplotype Assembly , 2016, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[37]  Leo van Iersel,et al.  WhatsHap: Weighted Haplotype Assembly for Future-Generation Sequencing Reads , 2015, J. Comput. Biol..

[38]  Paola Bonizzoni,et al.  HapCol: accurate and memory-efficient haplotype assembly from long reads , 2016, Bioinform..

[39]  Xiang-Sun Zhang,et al.  Haplotype reconstruction from SNP fragments by minimum error correction , 2005, Bioinform..

[40]  D. de Ridder,et al.  EXPLOITING NEXT GENERATION SEQUENCING TO SOLVE THE HAPLOTYPING PUZZLE IN POLYPLOIDS: A SIMULATION STUDY , 2016, bioRxiv.

[41]  Haris Vikalo,et al.  OnlineCall: fast online parameter estimation and base calling for illumina's next-generation sequencing , 2012, Bioinform..

[42]  Lothar Reichel,et al.  Augmented Implicitly Restarted Lanczos Bidiagonalization Methods , 2005, SIAM J. Sci. Comput..

[43]  Paola Bonizzoni,et al.  On the Fixed Parameter Tractability and Approximability of the Minimum Error Correction Problem , 2015, CPM.

[44]  Sujay Sanghavi,et al.  Structured Low-Rank Matrix Factorization for Haplotype Assembly , 2016, IEEE Journal of Selected Topics in Signal Processing.

[45]  Paola Bonizzoni,et al.  On the Minimum Error Correction Problem for Haplotype Assembly in Diploid and Polyploid Genomes , 2016, J. Comput. Biol..

[46]  Mattia C. F. Prosperi,et al.  QuRe: software for viral quasispecies reconstruction from next-generation sequencing data , 2012, Bioinform..