Structured Low-Rank Matrix Factorization for Haplotype Assembly

In matrix factorization problems, one seeks to decompose a data matrix into a product of two matrices-frequently, one captures meaningful information contained in the data, and the other specifies how this information is combined to generate the data matrix. In this paper, matrix factorization that arises in haplotype assembly, an important NP-hard problem in genomics, is studied. Haplotypes are sequences of chromosomal variations in an individual's genome, which are of critical importance for understudying the individual's susceptibility to various diseases. A novel formulation of haplotype assembly as the partially observed low-rank matrix factorization problem is proposed and efficiently solved via a modified gradient descent method that exploits salient structural properties of sequencing data. In particular, the observed matrix in the problem at hand contains noisy samples of the product of an informative matrix with rows having entries from a finite alphabet and a matrix with rows that are standard unit basis. Convergence of the proposed algorithm is analyzed and its performance tested on both synthetic and experimental data. The results demonstrate superior accuracy and speed of the proposed method as compared to state-of-the-art haplotype assembly techniques.

[1]  K. Verstrepen,et al.  Fosmid-based whole genome haplotyping of a HapMap trio child: evaluation of Single Individual Haplotyping techniques , 2011, Nucleic acids research.

[2]  Prateek Jain,et al.  Low-rank matrix completion using alternating minimization , 2012, STOC '13.

[3]  Richard Durbin,et al.  Fast and accurate long-read alignment with Burrows–Wheeler transform , 2010, Bioinform..

[4]  J. Magnus,et al.  Matrix Differential Calculus with Applications in Statistics and Econometrics , 1991 .

[5]  Haris Vikalo,et al.  OnlineCall: fast online parameter estimation and base calling for illumina's next-generation sequencing , 2012, Bioinform..

[6]  Filippo Geraci,et al.  A comparison of several algorithms for the single individual SNP haplotyping reconstruction problem , 2010, Bioinform..

[7]  Haesun Park,et al.  Sparse Nonnegative Matrix Factorization for Clustering , 2008 .

[8]  Russell Schwartz,et al.  Theory and Algorithms for the Haplotype Assembly Problem , 2010, Commun. Inf. Syst..

[9]  Pablo A. Parrilo,et al.  Guaranteed Minimum-Rank Solutions of Linear Matrix Equations via Nuclear Norm Minimization , 2007, SIAM Rev..

[10]  Toshihiro Tanaka The International HapMap Project , 2003, Nature.

[11]  Bonnie Berger,et al.  HapTree: A Novel Bayesian Framework for Single Individual Polyplotyping Using NGS Data , 2014, PLoS Comput. Biol..

[12]  Bin Fu,et al.  Linear Time Probabilistic Algorithms for the Singular Haplotype Reconstruction Problem from SNP Fragments , 2007, APBC.

[13]  Xiang-Sun Zhang,et al.  Haplotype reconstruction from SNP fragments by minimum error correction , 2005, Bioinform..

[14]  H. Vikalo,et al.  SDhaP: haplotype assembly for diploids and polyploids via semi-definite programming , 2015, BMC Genomics.

[15]  Andrea Montanari,et al.  Matrix completion from a few entries , 2009, 2009 IEEE International Symposium on Information Theory.

[16]  R. Tibshirani,et al.  Sparse Principal Component Analysis , 2006 .

[17]  Pardis C Sabeti,et al.  Detecting recent positive selection in the human genome from haplotype structure , 2002, Nature.

[18]  Xiaoming Yuan,et al.  Matrix completion via an alternating direction method , 2012 .

[19]  Haris Vikalo,et al.  Base calling for high-throughput short-read sequencing: dynamic programming solutions , 2013, BMC Bioinformatics.

[20]  A. Halpern,et al.  An MCMC algorithm for haplotype assembly from whole-genome sequence data. , 2008, Genome research.

[21]  Haris Vikalo,et al.  Decoding Genetic Variations: Communications-Inspired Haplotype Assembly , 2016, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[22]  F. Geraci,et al.  SpeedHap: An Accurate Heuristic for the Single Individual SNP Haplotyping Problem with Many Gaps, High Reading Error Rate and Low Coverage , 2008, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[23]  Sorin Istrail,et al.  HapCompass: A Fast Cycle Basis Algorithm for Accurate Haplotype Assembly of Sequence Data , 2012, J. Comput. Biol..

[24]  Ying Wang,et al.  A clustering algorithm based on two distance functions for MEC model , 2007, Comput. Biol. Chem..

[25]  A. Clark,et al.  The role of haplotypes in candidate gene studies , 2004, Genetic epidemiology.

[26]  Robert D. Nowak,et al.  High-Rank Matrix Completion , 2012, AISTATS.

[27]  Vineet Bafna,et al.  HapCUT: an efficient and accurate algorithm for the haplotype assembly problem , 2008, ECCB.

[28]  Hyunsoo Kim,et al.  Nonnegative Matrix Factorization Based on Alternating Nonnegativity Constrained Least Squares and Active Set Method , 2008, SIAM J. Matrix Anal. Appl..

[29]  T. Dallman,et al.  Performance comparison of benchtop high-throughput sequencing platforms , 2012, Nature Biotechnology.

[30]  D. Altshuler,et al.  A map of human genome variation from population-scale sequencing , 2010, Nature.

[31]  Timothy B. Stockwell,et al.  The Diploid Genome Sequence of an Individual Human , 2007, PLoS biology.

[32]  Emmanuel J. Candès,et al.  Matrix Completion With Noise , 2009, Proceedings of the IEEE.

[33]  Yehuda Koren,et al.  Matrix Factorization Techniques for Recommender Systems , 2009, Computer.

[34]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[35]  Alessandro Panconesi,et al.  Fast Hare: A Fast Heuristic for Single Individual SNP Haplotype Reconstruction , 2004, WABI.

[36]  Leo van Iersel,et al.  On the Complexity of Several Haplotyping Problems , 2005, WABI.

[37]  M. DePristo,et al.  A framework for variation discovery and genotyping using next-generation DNA sequencing data , 2011, Nature Genetics.

[38]  M. Daly,et al.  A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms , 2001, Nature.

[39]  Michael S Waterman,et al.  Diploid genome reconstruction of Ciona intestinalis and comparative analysis with Ciona savignyi. , 2007, Genome research.

[40]  Sriram Vishwanath,et al.  Haplotype assembly: An information theoretic view , 2014, 2014 IEEE Information Theory Workshop (ITW 2014).

[41]  Russell Schwartz,et al.  SNPs Problems, Complexity, and Algorithms , 2001, ESA.