GenHap: a novel computational method based on genetic algorithms for haplotype assembly

The computational problem of inferring the full haplotype of a cell starting from read sequencing data is known as haplotype assembly, and consists in assigning all heterozygous Single Nucleotide Polymorphisms (SNPs) to exactly one of the two chromosomes. Indeed, the knowledge of complete haplotypes is generally more informative than analyzing single SNPs and plays a fundamental role in many medical applications. To reconstruct the two haplotypes, we addressed the weighted Minimum Error Correction (wMEC) problem, which is a successful approach for haplotype assembly. This NP-hard problem consists in computing the two haplotypes that partition the sequencing reads into two disjoint sub-sets, with the least number of corrections to the SNP values. To this aim, we propose here GenHap, a novel computational method for haplotype assembly based on Genetic Algorithms, yielding optimal solutions by means of a global search process. In order to evaluate the effectiveness of our approach, we run GenHap on two synthetic (yet realistic) datasets, based on the Roche/454 and PacBio RS II sequencing technologies. We compared the performance of GenHap against HapCol, an efficient state-of-the-art algorithm for haplotype phasing. Our results show that GenHap always obtains high accuracy solutions (in terms of haplotype error rate), and is up to 4x faster than HapCol in the case of Roche/454 instances and up to 20x faster when compared on the PacBio RS II dataset. Finally, we assessed the performance of GenHap on two different real datasets. Future-generation sequencing technologies, producing longer reads with higher coverage, can highly benefit from GenHap, thanks to its capability of efficiently solving large instances of the haplotype assembly problem.

[1]  Kalyanmoy Deb,et al.  An Evolutionary Many-Objective Optimization Algorithm Using Reference-Point-Based Nondominated Sorting Approach, Part I: Solving Problems With Box Constraints , 2014, IEEE Transactions on Evolutionary Computation.

[2]  J. Zook,et al.  Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls , 2013, Nature Biotechnology.

[3]  Zhi-Zhong Chen,et al.  Exact algorithms for haplotype assembly from whole-genome sequence data , 2013, Bioinform..

[4]  Dmitry Pushkarev,et al.  Whole-genome haplotyping using long reads and statistical methods , 2014, Nature Biotechnology.

[5]  Eleazar Eskin,et al.  Optimal algorithms for haplotype assembly from whole-genome sequence data , 2010, Bioinform..

[6]  Pietro Liò,et al.  NuChart: An R Package to Study Gene Spatial Neighbourhoods with Multi-Omics Annotations , 2013, PloS one.

[7]  Benedict Paten,et al.  Improved data analysis for the MinION nanopore sequencer , 2015, Nature Methods.

[8]  Fengzhu Sun,et al.  Haplotype block structure and its applications to association studies: power and study designs. , 2002, American journal of human genetics.

[9]  Leo van Iersel,et al.  WhatsHap: Weighted Haplotype Assembly for Future-Generation Sequencing Reads , 2015, J. Comput. Biol..

[10]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[11]  Timothy B. Stockwell,et al.  The Diploid Genome Sequence of an Individual Human , 2007, PLoS biology.

[12]  C. Ponting,et al.  Sequencing depth and coverage: key considerations in genomic analyses , 2014, Nature Reviews Genetics.

[13]  Joong Chae Na,et al.  PEATH: single-individual haplotyping by a probabilistic evolutionary algorithm with toggling , 2018, Bioinform..

[14]  Paola Bonizzoni,et al.  HapCol: accurate and memory-efficient haplotype assembly from long reads , 2016, Bioinform..

[15]  Richard Durbin,et al.  A large genome center's improvements to the Illumina sequencing system , 2008, Nature Methods.

[16]  Mauricio O. Carneiro,et al.  Pacific biosciences sequencing technology for genotyping and variation discovery in human data , 2012, BMC Genomics.

[17]  Gonçalo R. Abecasis,et al.  The variant call format and VCFtools , 2011, Bioinform..

[18]  Kang Zhang,et al.  Identification of methylation haplotype blocks aids in deconvolution of heterogeneous tissue samples and tumor tissue-of-origin mapping from plasma DNA , 2017, Nature Genetics.

[19]  Eric Boerwinkle,et al.  Understanding the accuracy of statistical haplotype inference with sequence data of known phase , 2007, Genetic epidemiology.

[20]  Zohar Yakhini,et al.  Extending partial haplotypes to full genome haplotypes using chromosome conformation capture data , 2016 .

[21]  James E. Baker,et al.  Adaptive Selection Methods for Genetic Algorithms , 1985, International Conference on Genetic Algorithms.

[22]  Paola Bonizzoni,et al.  HapCHAT: Adaptive haplotype assembly for efficiently leveraging high coverage in long reads , 2017 .

[23]  David E. Goldberg,et al.  Genetic Algorithms, Tournament Selection, and the Effects of Noise , 1995, Complex Syst..

[24]  Paola Bonizzoni,et al.  On the Fixed Parameter Tractability and Approximability of the Minimum Error Correction Problem , 2015, CPM.

[25]  Volodymyr Kuleshov,et al.  Probabilistic single-individual haplotyping , 2014, Bioinform..

[26]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[27]  S. Gabriel,et al.  The Structure of Haplotype Blocks in the Human Genome , 2002, Science.

[28]  Bonnie Berger,et al.  HapTree: A Novel Bayesian Framework for Single Individual Polyplotyping Using NGS Data , 2014, PLoS Comput. Biol..

[29]  Ivan Merelli,et al.  PWHATSHAP: efficient haplotyping for future generation sequencing , 2016, BMC Bioinformatics.

[30]  Sorin Istrail,et al.  Haplotype assembly in polyploid genomes and identical by descent shared tracts , 2013, Bioinform..

[31]  Kin-Fan Au,et al.  PacBio Sequencing and Its Applications , 2015, Genom. Proteom. Bioinform..

[32]  Albert Y. Zomaya,et al.  Using genetic algorithm in reconstructing single individual haplotype with minimum error correction , 2012, J. Biomed. Informatics.

[33]  Giovanni Pezzulo,et al.  Divide et impera: subgoaling reduces the complexity of probabilistic inference and problem solving , 2015, Journal of The Royal Society Interface.

[34]  M. Daly,et al.  Genome-wide association studies for common diseases and complex traits , 2005, Nature Reviews Genetics.

[35]  Jorge Duitama,et al.  ReFHap: a reliable and fast algorithm for single individual haplotyping , 2010, BCB '10.

[36]  Richard J. Roberts,et al.  The advantages of SMRT sequencing , 2013, Genome Biology.

[37]  T. Thomas,et al.  GemSIM: general, error-model based simulator of next-generation sequencing data , 2012, BMC Genomics.

[38]  Peter M. Rice,et al.  The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants , 2009, Nucleic acids research.

[39]  Russell Schwartz,et al.  Algorithmic strategies for the single nucleotide polymorphism haplotype assembly problem , 2002, Briefings Bioinform..

[40]  Matthew W. Snyder,et al.  Haplotype-resolved genome sequencing: experimental methods and applications , 2015, Nature Reviews Genetics.

[41]  G. McVean,et al.  Integrating mapping-, assembly- and haplotype-based approaches for calling variants in clinical sequencing applications , 2014, Nature Genetics.

[42]  M. Daly,et al.  High-resolution haplotype structure in the human genome , 2001, Nature Genetics.

[43]  Xiang-Sun Zhang,et al.  Haplotype reconstruction from SNP fragments by minimum error correction , 2005, Bioinform..

[44]  M. Nachman,et al.  Single nucleotide polymorphisms and recombination rate in humans. , 2001, Trends in genetics : TIG.

[45]  Hugh E. Olsen,et al.  The Oxford Nanopore MinION: delivery of nanopore sequencing to the genomics community , 2016, Genome Biology.

[46]  Harvey J. Greenberg,et al.  Opportunities for Combinatorial Optimization in Computational Biology , 2004, INFORMS J. Comput..

[47]  Marco S. Nobile,et al.  GPU-Powered Multi-Swarm Parameter Estimation of Biological Systems: A Master-Slave Approach , 2018, 2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP).

[48]  David Haussler,et al.  The UCSC Genome Browser database: 2018 update , 2017, Nucleic Acids Res..