Computational Haplotyping: Theory and Practice

Genomics has paved a new way to comprehend life and its evolution, and also to investigate causes of diseases and their treatment. One of the important problems in genomic analyses is haplotype assembly. Constructing complete and accurate haplotypes plays an essential role in understanding population genetics and how species evolve. In this thesis, we focus on computational approaches to haplotype assembly from third generation sequencing technologies. This involves huge amounts of sequencing data, and such data contain errors due to the single molecule sequencing protocols employed. Taking advantage of combinatorial formulations helps to correct for these errors to solve the haplotyping problem. Various computational techniques such as dynamic programming, parameterized algorithms, and graph algorithms are used to solve this problem. This thesis presents several contributions concerning the area of haplotyping. First, a novel algorithm based on dynamic programming is proposed to provide approximation guarantees for phasing a single individual. Second, an integrative approach is introduced to combining multiple sequencing datasets to generating complete and accurate haplotypes. The e ectiveness of this integrative approach is demonstrated on a real human genome. Third, we provide a novel e cient approach to phasing pedigrees and demonstrate its advantages in comparison to phasing a single individual. Fourth, we present a generalized graph-based framework for performing haplotype-aware de novo assembly. Speci cally, this generalized framework consists of a hybrid pipeline for generating accurate and complete haplotypes from data stemming from multiple sequencing technologies, one that provides accurate reads and other that provides long reads.

[1]  Meltz Steinberg Karyn Single haplotype assembly of the human genome from a hydatidiform mole , 2014 .

[2]  Paul T. Groth,et al.  The ENCODE (ENCyclopedia Of DNA Elements) Project , 2004, Science.

[3]  Gustavo Glusman,et al.  Whole-genome haplotyping approaches and genomic medicine , 2014, Genome Medicine.

[4]  C. Nusbaum,et al.  ALLPATHS: de novo assembly of whole-genome shotgun microreads. , 2008, Genome research.

[5]  Mile Šikić,et al.  Fast and accurate de novo genome assembly from long uncorrected reads , 2016, bioRxiv.

[6]  Heng Li,et al.  FermiKit: assembly-based variant calling for Illumina resequencing data , 2015, Bioinform..

[7]  M. Eisenstein Startups use short-read data to expand long-read sequencing market , 2015, Nature Biotechnology.

[8]  Mihai Pop,et al.  Parametric Complexity of Sequence Assembly: Theory and Applications to Next Generation Sequencing , 2009, J. Comput. Biol..

[9]  Fritz J Sedlazeck,et al.  Piercing the dark matter: bioinformatics of long-range sequencing and mapping , 2018, Nature Reviews Genetics.

[10]  de Ng Dick Bruijn A combinatorial problem , 1946 .

[11]  René L. Warren,et al.  Assembling millions of short DNA sequences using SSAKE , 2006, Bioinform..

[12]  Giovanni Manzini,et al.  Opportunistic data structures with applications , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[13]  N. Weisenfeld,et al.  Direct determination of diploid genome sequences , 2016, bioRxiv.

[14]  Gabor T. Marth,et al.  An integrated map of structural variation in 2,504 human genomes , 2015, Nature.

[15]  Gary D Bader,et al.  Long read nanopore sequencing for detection of HLA and CYP2D6 variants and haplotypes , 2015, F1000Research.

[16]  Sorin Istrail,et al.  Haplotype assembly in polyploid genomes and identical by descent shared tracts , 2013, Bioinform..

[17]  Jian Wang,et al.  SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler , 2012, GigaScience.

[18]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[19]  S. Kurtz,et al.  Readjoiner: a fast and memory efficient string graph-based sequence assembler , 2012, BMC Bioinformatics.

[20]  Jing Li,et al.  De novo yeast genome assemblies from MinION, PacBio and MiSeq platforms , 2017, Scientific Reports.

[21]  Ryan L. Collins,et al.  Multi-platform discovery of haplotype-resolved structural variation in human genomes , 2017, bioRxiv.

[22]  Eugene W. Myers,et al.  The fragment assembly string graph , 2005, ECCB/JBI.

[23]  Owen White,et al.  TIGR Assembler: A New Tool for Assembling Large Shotgun Sequencing Projects , 1995 .

[24]  I. Amit,et al.  Comprehensive mapping of long range interactions reveals folding principles of the human genome , 2011 .

[25]  Ali Ridha Mahjoub,et al.  Solving VLSI design and DNA sequencing problems using bipartization of graphs , 2012, Comput. Optim. Appl..

[26]  Paola Bonizzoni,et al.  HapCol: accurate and memory-efficient haplotype assembly from long reads , 2016, Bioinform..

[27]  Bin Fu,et al.  Linear Time Probabilistic Algorithms for the Singular Haplotype Reconstruction Problem from SNP Fragments , 2007, APBC.

[28]  Sorin Istrail,et al.  HapCompass: A Fast Cycle Basis Algorithm for Accurate Haplotype Assembly of Sequence Data , 2012, J. Comput. Biol..

[29]  Luay Nakhleh,et al.  HySA: A Hybrid Structural variant Assembly approach using next generation and single-molecule sequencing technologies , 2016 .

[30]  A. Gnirke,et al.  High-quality draft assemblies of mammalian genomes from massively parallel sequence data , 2010, Proceedings of the National Academy of Sciences.

[31]  Vineet Bafna,et al.  HapCUT: an efficient and accurate algorithm for the haplotype assembly problem , 2008, ECCB.

[32]  Dmitry Antipov,et al.  hybridSPAdes: an algorithm for hybrid assembly of short and long reads , 2016, Bioinform..

[33]  Shilpa Garg,et al.  WhatsHap: fast and accurate read-based phasing , 2016, bioRxiv.

[34]  Heng Li Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM , 2013, 1303.3997.

[35]  Christina A. Cuomo,et al.  Pilon: An Integrated Tool for Comprehensive Microbial Variant Detection and Genome Assembly Improvement , 2014, PloS one.

[36]  G. Abecasis,et al.  Merlin—rapid analysis of dense genetic maps using sparse gene flow trees , 2002, Nature Genetics.

[37]  Kunihiko Sadakane,et al.  Succinct de Bruijn Graphs , 2012, WABI.

[38]  J. Zook,et al.  Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls , 2013, Nature Biotechnology.

[39]  Vijay V. Vazirani,et al.  Approximation Algorithms , 2001, Springer Berlin Heidelberg.

[40]  Heng Li,et al.  Exploring single-sample SNP and INDEL calling with whole-genome de novo assembly , 2012, Bioinform..

[41]  Karolj Skala,et al.  Approaches to DNA de novo assembly , 2013, 2013 36th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO).

[42]  Hwan-Gue Cho,et al.  HapAssembler: a web server for haplotype assembly from SNP fragments using genetic algorithm. , 2010, Biochemical and biophysical research communications.

[43]  Markus A. Grohme,et al.  The genome of S. mediterranea and the evolution of cellular core mechanisms , 2018, Nature.

[44]  M. Pop,et al.  Sequence assembly demystified , 2013, Nature Reviews Genetics.

[45]  Zhi-Zhong Chen,et al.  Exact algorithms for haplotype assembly from whole-genome sequence data , 2013, Bioinform..

[46]  Albert Y. Zomaya,et al.  Using genetic algorithm in reconstructing single individual haplotype with minimum error correction , 2012, J. Biomed. Informatics.

[47]  Benny Chor,et al.  String graph construction using incremental hashing , 2014, Bioinform..

[48]  Steven J. M. Jones,et al.  Abyss: a Parallel Assembler for Short Read Sequence Data Material Supplemental Open Access , 2022 .

[49]  Lusheng Wang,et al.  A highly accurate heuristic algorithm for the haplotype assembly problem , 2013, BMC Genomics.

[50]  L. Trevisan On Khot’s unique games conjecture , 2012 .

[51]  Tobias Marschall,et al.  Selecting Reads for Haplotype Assembly , 2016, bioRxiv.

[52]  Eugene W. Myers,et al.  Toward Simplifying and Accurately Formulating Fragment Assembly , 1995, J. Comput. Biol..

[53]  Jan Remy,et al.  Approximation Schemes for Node-Weighted Geometric Steiner Tree Problems , 2007, Algorithmica.

[54]  Michael S. Waterman,et al.  A New Algorithm for DNA Sequence Assembly , 1995, J. Comput. Biol..

[55]  Leo van Iersel,et al.  WhatsHap: Weighted Haplotype Assembly for Future-Generation Sequencing Reads , 2015, J. Comput. Biol..

[56]  M. Schatz,et al.  Phased diploid genome assembly with single-molecule real-time sequencing , 2016, Nature Methods.

[57]  Timothy B. Stockwell,et al.  The Diploid Genome Sequence of an Individual Human , 2007, PLoS biology.

[58]  Eleazar Eskin,et al.  Optimal algorithms for haplotype assembly from whole-genome sequence data , 2010, Bioinform..

[59]  J. Landolin,et al.  Assembling Large Genomes with Single-Molecule Sequencing and Locality Sensitive Hashing , 2014 .

[60]  Mark Hills,et al.  Single-cell template strand sequencing by Strand-seq enables the characterization of individual homologs , 2017, Nature Protocols.

[61]  Sreeram Kannan,et al.  Resolving Multicopy Duplications de novo Using Polyploid Phasing , 2017, RECOMB.

[62]  Bin Ma,et al.  Finding Similar Regions in Many Sequences , 2002, J. Comput. Syst. Sci..

[63]  D. Altshuler,et al.  A map of human genome variation from population-scale sequencing , 2010, Nature.

[64]  Jan O. Korbel,et al.  Dense and accurate whole-chromosome haplotyping of individual genomes , 2017 .

[65]  Yongjun Zhao,et al.  DNA template strand sequencing of single-cells maps genomic rearrangements at high resolution , 2012, Nature Methods.

[66]  Rajeev Motwani,et al.  Randomized Algorithms , 1995, SIGA.

[67]  Toni Gabaldón,et al.  Redundans: an assembly pipeline for highly heterozygous genomes , 2015 .

[68]  Victor Guryev,et al.  Direct chromosome-length haplotyping by single-cell sequencing , 2016, Genome research.

[69]  Sayyed R Mousavi,et al.  Effective haplotype assembly via maximum Boolean satisfiability. , 2011, Biochemical and biophysical research communications.

[70]  Jorge Duitama,et al.  ReFHap: a reliable and fast algorithm for single individual haplotyping , 2010, BCB '10.

[71]  Jianer Chen,et al.  A model of higher accuracy for the individual haplotyping problem based on weighted SNP fragments and genotype with errors , 2008, ISMB.

[72]  Jin-Wu Nam,et al.  The present and future of de novo whole-genome assembly , 2016, Briefings Bioinform..

[73]  Ming Li,et al.  On the k-Closest Substring and k-Consensus Pattern Problems , 2004, CPM.

[74]  Jill P Mesirov,et al.  Assembly of polymorphic genomes: algorithms and application to Ciona savignyi. , 2005, Genome research.

[75]  Jonas Korlach,et al.  Discovery and genotyping of structural variation from long-read haploid genome sequence data , 2017, Genome research.

[76]  Uriel Feige NP-hardness of hypercube 2-segmentation , 2014, ArXiv.

[77]  Nan Li,et al.  Comparison of the two major classes of assembly algorithms: overlap-layout-consensus and de-bruijn-graph. , 2012, Briefings in functional genomics.

[78]  M. Mitzenmacher,et al.  Probability and Computing: Chernoff Bounds , 2005 .

[79]  Eugene W. Myers,et al.  Computability of Models for Sequence Assembly , 2007, WABI.

[80]  Vineet Bafna,et al.  HapCUT2: robust and accurate haplotype assembly for diverse sequencing technologies , 2017, Genome research.

[81]  Tao Jiang,et al.  A fast and accurate algorithm for single individual haplotyping , 2012, BMC Systems Biology.

[82]  Tetsuya Hayashi,et al.  Efficient de novo assembly of highly heterozygous genomes from whole-genome shotgun short reads , 2014, Genome research.

[83]  B. Browning,et al.  Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. , 2007, American journal of human genetics.

[84]  Heng Li,et al.  BFC: correcting Illumina sequencing errors , 2015, Bioinform..

[85]  Eugene W. Myers,et al.  A whole-genome assembly of Drosophila. , 2000, Science.

[86]  R. Durbin,et al.  Efficient de novo assembly of large genomes using compressed data structures. , 2012, Genome research.

[87]  Benedict Paten,et al.  Superbubbles, Ultrabubbles, and Cacti , 2018, J. Comput. Biol..

[88]  Bronwen L. Aken,et al.  GENCODE: The reference human genome annotation for The ENCODE Project , 2012, Genome research.

[89]  Son K. Pham,et al.  Improved genome assembly of American alligator genome reveals conserved architecture of estrogen signaling. , 2017, Genome research.

[90]  V. Bansal,et al.  The importance of phase information for human genomics , 2011, Nature Reviews Genetics.

[91]  Gabor T. Marth,et al.  A global reference for human genetic variation , 2015, Nature.

[92]  G. Church,et al.  The Personal Genome Project , 2005, Molecular systems biology.

[93]  W. Gilbert,et al.  A new method for sequencing DNA. , 1977, Proceedings of the National Academy of Sciences of the United States of America.

[94]  Wei Wang,et al.  FastHap: fast and accurate single individual haplotype reconstruction using fuzzy conflict graphs , 2014, Bioinform..

[95]  Hisanori Kiryu,et al.  MixSIH: a mixture model for single individual haplotyping , 2013, BMC Genomics.

[96]  Leo van Iersel,et al.  WhatsHap: Haplotype Assembly for Future-Generation Sequencing Reads , 2014, RECOMB.

[97]  Shane A. McCarthy,et al.  Reference-based phasing using the Haplotype Reference Consortium panel , 2016, Nature Genetics.

[98]  Jon M. Kleinberg,et al.  Segmentation problems , 2004, JACM.

[99]  Thomas Jansen,et al.  Introduction to the Theory of Complexity and Approximation Algorithms , 1997, Lectures on Proof Verification and Approximation Algorithms.

[100]  Ross M. Fraser,et al.  A General Approach for Haplotype Phasing across the Full Spectrum of Relatedness , 2014, PLoS genetics.

[101]  Jared T. Simpson,et al.  Efficient construction of an assembly string graph using the FM-index , 2010, Bioinform..

[102]  A. Halpern,et al.  An MCMC algorithm for haplotype assembly from whole-genome sequence data. , 2008, Genome research.

[103]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[104]  Aaron A. Klammer,et al.  Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data , 2013, Nature Methods.

[105]  Esko Ukkonen,et al.  A Greedy Approximation Algorithm for Constructing Shortest Common Superstrings , 1988, Theor. Comput. Sci..

[106]  Ying Wang,et al.  A clustering algorithm based on two distance functions for MEC model , 2007, Comput. Biol. Chem..

[107]  David Hernández,et al.  De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer. , 2008, Genome research.

[108]  Mihai Pop,et al.  Exploiting sparseness in de novo genome assembly , 2012, BMC Bioinformatics.

[109]  Nancy F. Hansen,et al.  Accurate Whole Human Genome Sequencing using Reversible Terminator Chemistry , 2008, Nature.

[110]  Zhaohui S. Qin,et al.  A comparison of phasing algorithms for trios and unrelated individuals. , 2006, American journal of human genetics.

[111]  Tobias Marschall,et al.  Aligning sequences to general graphs in O(V + mE) time , 2017, bioRxiv.

[112]  P. Pevzner,et al.  An Eulerian path approach to DNA fragment assembly , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[113]  Jean-François Zagury,et al.  Haplotype estimation using sequencing reads. , 2013, American journal of human genetics.

[114]  Andrew C. Adey,et al.  Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions , 2013, Nature Biotechnology.

[115]  Jared C. Roach,et al.  Chromosomal haplotypes by genetic phasing of human families. , 2011, American journal of human genetics.

[116]  Shilpa Garg,et al.  A QPTAS for Gapless MEC , 2018, ESA.

[117]  Olivier Delaneau,et al.  Integrating sequence and array data to create an improved 1000 Genomes Project haplotype reference panel , 2014, Nature Communications.

[118]  Harvey J. Greenberg,et al.  Opportunities for Combinatorial Optimization in Computational Biology , 2004, INFORMS J. Comput..

[119]  Paul Medvedev,et al.  Paired de Bruijn Graphs: A Novel Approach for Incorporating Mate Pair Information into Genome Assemblers , 2011, J. Comput. Biol..

[120]  G. McVean,et al.  A reference data set of 5.4 million phased human variants validated by genetic inheritance from sequencing a three-generation 17-member pedigree , 2016, bioRxiv.

[121]  Paola Bonizzoni,et al.  On the Minimum Error Correction Problem for Haplotype Assembly in Diploid and Polyploid Genomes , 2016, J. Comput. Biol..

[122]  Jay Shendure,et al.  Decoding long nanopore sequencing reads of natural DNA , 2014, Nature Biotechnology.

[123]  Kui Zhang,et al.  Direct determination of molecular haplotypes by chromosome microdissection , 2010, Nature Methods.

[124]  Ilan Shomorony,et al.  HINGE: Long-Read Assembly Achieves Optimal Repeat Resolution , 2016, bioRxiv.

[125]  Hyeong-Seok Lim,et al.  Individual haplotype assembly of Apis mellifera (honeybee) using a practical branch and bound algorithm , 2012 .

[126]  Xiang-Sun Zhang,et al.  Haplotype reconstruction from SNP fragments by minimum error correction , 2005, Bioinform..

[127]  B. Browning,et al.  Haplotype phasing: existing methods and new developments , 2011, Nature Reviews Genetics.

[128]  Eugene L. Lawler,et al.  Parameterized Approximation Scheme for the Multiple Knapsack Problem , 2009, SIAM J. Comput..

[129]  Sergey I. Nikolenko,et al.  SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing , 2012, J. Comput. Biol..

[130]  Sònia Casillas,et al.  Molecular Population Genetics , 2017, Genetics.

[131]  David Maier,et al.  The Complexity of Some Problems on Subsequences and Supersequences , 1978, JACM.

[132]  Kiyoshi Asai,et al.  PBSIM: PacBio reads simulator - toward accurate genome assembly , 2013, Bioinform..

[133]  S. Turner,et al.  Real-Time DNA Sequencing from Single Polymerase Molecules , 2009, Science.

[134]  Zohar Yakhini,et al.  Extending partial haplotypes to full genome haplotypes using chromosome conformation capture data , 2016 .

[135]  Noga Alon,et al.  On Two Segmentation Problems , 1999, J. Algorithms.

[136]  Justin Chu,et al.  ABySS 2.0: resource-efficient assembly of large genomes using a Bloom filter , 2016, bioRxiv.

[137]  Wing Hung Wong,et al.  Completely phased genome sequencing through chromosome sorting , 2010, Proceedings of the National Academy of Sciences.

[138]  Caspar Zialor DNA sequencing with chain terminating inhibitors , 2014 .

[139]  Heng Li,et al.  Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences , 2015, Bioinform..

[140]  Volodymyr Kuleshov,et al.  Probabilistic single-individual haplotyping , 2014, Bioinform..

[141]  M. Olivier A haplotype map of the human genome. , 2003, Nature.

[142]  Russell Schwartz,et al.  SNPs Problems, Complexity, and Algorithms , 2001, ESA.

[143]  Russell Schwartz,et al.  Algorithmic strategies for the single nucleotide polymorphism haplotype assembly problem , 2002, Briefings Bioinform..

[144]  Jing Li,et al.  Contrasting evolutionary genome dynamics between domesticated and wild yeasts , 2017, Nature Genetics.

[145]  Vincent J. Magrini,et al.  Extending assembly of short DNA sequences to handle error , 2007, Bioinform..

[146]  Byoung-Tak Zhang,et al.  Survey of computational haplotype determination methods for single individual , 2015, Genes & Genomics.

[147]  Ying Chen,et al.  MECAT: an ultra-fast mapping, error correction and de novo assembly tool for single-molecule sequencing reads , 2016, bioRxiv.

[148]  P. Kwok,et al.  A Hybrid Approach for de novo Human Genome Sequence Assembly and Phasing , 2016, Nature Methods.

[149]  Sanguthevar Rajasekaran,et al.  A memory-efficient data structure representing exact-match overlap graphs with application for next-generation DNA assembly , 2010, Bioinform..

[150]  Hanlee P. Ji,et al.  Haplotyping germline and cancer genomes using high-throughput linked-read sequencing , 2015, Nature Biotechnology.

[151]  James H. Bullard,et al.  A hybrid approach for the automated finishing of bacterial genomes , 2012, Nature Biotechnology.

[152]  Michal Pilipczuk,et al.  Parameterized Algorithms , 2015, Springer International Publishing.

[153]  J. Korlach,et al.  De novo assembly and phasing of a Korean human genome , 2016, Nature.

[154]  Russell E. Durrett,et al.  Assembly and diploid architecture of an individual human genome via single-molecule technologies , 2015, Nature Methods.

[155]  Sergey Koren,et al.  Hybrid assembly of the large and highly repetitive genome of Aegilops tauschii , a progenitor of bread wheat , with the mega-reads algorithm , 2016 .

[156]  Ruth Urner,et al.  Monochromatic Bi-Clustering , 2013, ICML.

[157]  Jianxin Wang,et al.  A heuristic algorithm for haplotype reconstruction from aligned weighted SNP fragments , 2013, Int. J. Bioinform. Res. Appl..

[158]  J. Parkhill,et al.  Circlator: automated circularization of genome assemblies using long sequencing reads , 2015, bioRxiv.

[159]  S. Koren,et al.  Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation , 2016, bioRxiv.

[160]  William Jones,et al.  Sequence variation aware genome references and read mapping with the variation graph toolkit , 2017, bioRxiv.

[161]  Yu Lin,et al.  Assembly of Long Error-Prone Reads Using de Bruijn Graphs , 2016 .

[162]  O. Delaneau,et al.  Supplementary Information for ‘ Improved whole chromosome phasing for disease and population genetic studies ’ , 2012 .

[163]  Rafail Ostrovsky,et al.  Polynomial-time approximation schemes for geometric min-sum median clustering , 2002, JACM.

[164]  Tobias Marschall,et al.  A Guided Tour to Computational Haplotyping , 2017, CiE.

[165]  David K Gifford,et al.  Rapid haplotype inference for nuclear families , 2010, Genome Biology.

[166]  F. Collins,et al.  The Human Genome Project: Lessons from Large-Scale Biology , 2003, Science.

[167]  Evan E. Eichler,et al.  Genetic variation and the de novo assembly of human genomes , 2015, Nature Reviews Genetics.

[168]  Leo van Iersel,et al.  The Complexity of the Single Individual SNP Haplotyping Problem , 2005, Algorithmica.

[169]  Po-Ru Loh,et al.  Fast and accurate long-range phasing in a UK Biobank cohort , 2015, Nature Genetics.

[170]  Bing Ren,et al.  Whole-genome haplotype reconstruction using proximity-ligation and shotgun sequencing , 2013, Nature Biotechnology.

[171]  Einar Andreas Rødland,et al.  Compact representation of k-mer de Bruijn graphs for genome read assembly , 2013, BMC Bioinformatics.

[172]  Shilpa Garg,et al.  Read-Based Phasing of Related Individuals , 2016 .

[173]  Páll Melsted,et al.  Efficient counting of k-mers in DNA sequences using a bloom filter , 2011, BMC Bioinformatics.

[174]  M. Pop,et al.  The Theory and Practice of Genome Sequence Assembly. , 2015, Annual review of genomics and human genetics.