Decoding Genetic Variations: Communications-Inspired Haplotype Assembly

High-throughput DNA sequencing technologies allow fast and affordable sequencing of individual genomes and thus enable unprecedented studies of genetic variations. Information about variations in the genome of an individual is provided by haplotypes, ordered collections of single nucleotide polymorphisms. Knowledge of haplotypes is instrumental in finding genes associated with diseases, drug development, and evolutionary studies. Haplotype assembly from high-throughput sequencing data is challenging due to errors and limited lengths of sequencing reads. The key observation made in this paper is that the minimum error-correction formulation of the haplotype assembly problem is identical to the task of deciphering a coded message received over a noisy channel-a classical problem in the mature field of communication theory. Exploiting this connection, we develop novel haplotype assembly schemes that rely on the bit-flipping and belief propagation algorithms often used in communication systems. The latter algorithm is then adapted to the haplotype assembly of polyploids. We demonstrate on both simulated and experimental data that the proposed algorithms compare favorably with state-of-the-art haplotype assembly methods in terms of accuracy, while being scalable and computationally efficient.

[1]  Timothy B. Stockwell,et al.  The Diploid Genome Sequence of an Individual Human , 2007, PLoS biology.

[2]  Robert G. Gallager,et al.  Low-density parity-check codes , 1962, IRE Trans. Inf. Theory.

[3]  Jong Hyun Kim,et al.  Haplotype reconstruction from SNP alignment , 2003, RECOMB '03.

[4]  Jorge Duitama,et al.  ReFHap: a reliable and fast algorithm for single individual haplotyping , 2010, BCB '10.

[5]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[6]  Rüdiger L. Urbanke,et al.  Modern Coding Theory , 2008 .

[7]  Bonnie Berger,et al.  HapTree: A Novel Bayesian Framework for Single Individual Polyplotyping Using NGS Data , 2014, PLoS Comput. Biol..

[8]  Xiang-Sun Zhang,et al.  Haplotype reconstruction from SNP fragments by minimum error correction , 2005, Bioinform..

[9]  Alessandro Panconesi,et al.  Fast Hare: A Fast Heuristic for Single Individual SNP Haplotype Reconstruction , 2004, WABI.

[10]  D. Mccormick Sequence the Human Genome , 1986, Bio/Technology.

[11]  International Human Genome Sequencing Consortium Initial sequencing and analysis of the human genome , 2001, Nature.

[12]  M. Ronaghi,et al.  Real-time DNA sequencing using detection of pyrophosphate release. , 1996, Analytical biochemistry.

[13]  Nicolas Macris,et al.  A proof of threshold saturation for spatially-coupled LDPC codes on BMS channels , 2012, 2012 50th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[14]  E. Mardis The impact of next-generation sequencing technology on genetics. , 2008, Trends in genetics : TIG.

[15]  Jong Hyun Kim,et al.  Accuracy Assessment of Diploid Consensus Sequences , 2007, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[16]  Andrea Montanari,et al.  Tight bounds for LDPC and LDGM codes under MAP decoding , 2004, IEEE Transactions on Information Theory.

[17]  Russell Schwartz,et al.  Algorithmic strategies for the single nucleotide polymorphism haplotype assembly problem , 2002, Briefings Bioinform..

[18]  K. Verstrepen,et al.  Fosmid-based whole genome haplotyping of a HapMap trio child: evaluation of Single Individual Haplotyping techniques , 2011, Nucleic acids research.

[19]  S. Schuster Next-generation sequencing transforms today's biology , 2008, Nature Methods.

[20]  D. Altshuler,et al.  A map of human genome variation from population-scale sequencing , 2010, Nature.

[21]  F. Geraci,et al.  SpeedHap: An Accurate Heuristic for the Single Individual SNP Haplotyping Problem with Many Gaps, High Reading Error Rate and Low Coverage , 2008, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[22]  Ying Wang,et al.  A clustering algorithm based on two distance functions for MEC model , 2007, Comput. Biol. Chem..

[23]  Michael S Waterman,et al.  Diploid genome reconstruction of Ciona intestinalis and comparative analysis with Ciona savignyi. , 2007, Genome research.

[24]  Rüdiger L. Urbanke,et al.  Spatially coupled ensembles universally achieve capacity under belief propagation , 2012, 2012 IEEE International Symposium on Information Theory Proceedings.

[25]  Bin Fu,et al.  Linear Time Probabilistic Algorithms for the Singular Haplotype Reconstruction Problem from SNP Fragments , 2007, APBC.

[26]  Russell Schwartz,et al.  Theory and Algorithms for the Haplotype Assembly Problem , 2010, Commun. Inf. Syst..

[27]  F. Collins,et al.  The Human Genome Project: Lessons from Large-Scale Biology , 2003, Science.

[28]  Sorin Istrail,et al.  HapCompass: A Fast Cycle Basis Algorithm for Accurate Haplotype Assembly of Sequence Data , 2012, J. Comput. Biol..

[29]  Vineet Bafna,et al.  HapCUT: an efficient and accurate algorithm for the haplotype assembly problem , 2008, ECCB.

[30]  Rüdiger L. Urbanke,et al.  Threshold Saturation via Spatial Coupling: Why Convolutional LDPC Ensembles Perform So Well over the BEC , 2010, IEEE Transactions on Information Theory.

[31]  Neil Hall,et al.  Advanced sequencing technologies and their wider impact in microbiology , 2007, Journal of Experimental Biology.

[32]  Hwan-Gue Cho,et al.  HapAssembler: a web server for haplotype assembly from SNP fragments using genetic algorithm. , 2010, Biochemical and biophysical research communications.

[33]  M. Frazier,et al.  Realizing the Potential of the Genome Revolution: The Genomes to Life Program , 2003, Science.

[34]  Toshihiro Tanaka The International HapMap Project , 2003, Nature.

[35]  Sorin Istrail,et al.  Haplotype assembly in polyploid genomes and identical by descent shared tracts , 2013, Bioinform..