HapCol: accurate and memory-efficient haplotype assembly from long reads

MOTIVATION Haplotype assembly is the computational problem of reconstructing haplotypes in diploid organisms and is of fundamental importance for characterizing the effects of single-nucleotide polymorphisms on the expression of phenotypic traits. Haplotype assembly highly benefits from the advent of 'future-generation' sequencing technologies and their capability to produce long reads at increasing coverage. Existing methods are not able to deal with such data in a fully satisfactory way, either because accuracy or performances degrade as read length and sequencing coverage increase or because they are based on restrictive assumptions. RESULTS By exploiting a feature of future-generation technologies-the uniform distribution of sequencing errors-we designed an exact algorithm, called HapCol, that is exponential in the maximum number of corrections for each single-nucleotide polymorphism position and that minimizes the overall error-correction score. We performed an experimental analysis, comparing HapCol with the current state-of-the-art combinatorial methods both on real and simulated data. On a standard benchmark of real data, we show that HapCol is competitive with state-of-the-art methods, improving the accuracy and the number of phased positions. Furthermore, experiments on realistically simulated datasets revealed that HapCol requires significantly less computing resources, especially memory. Thanks to its computational efficiency, HapCol can overcome the limits of previous approaches, allowing to phase datasets with higher coverage and without the traditional all-heterozygous assumption. AVAILABILITY AND IMPLEMENTATION Our source code is available under the terms of the GNU General Public License at http://hapcol.algolab.eu/ CONTACT bonizzoni@disco.unimib.it SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

[1]  Leo van Iersel,et al.  WhatsHap: Weighted Haplotype Assembly for Future-Generation Sequencing Reads , 2015, J. Comput. Biol..

[2]  K. Verstrepen,et al.  Fosmid-based whole genome haplotyping of a HapMap trio child: evaluation of Single Individual Haplotyping techniques , 2011, Nucleic acids research.

[3]  Russell Schwartz,et al.  Algorithmic strategies for the single nucleotide polymorphism haplotype assembly problem , 2002, Briefings Bioinform..

[4]  Paola Bonizzoni,et al.  On the Fixed Parameter Tractability and Approximability of the Minimum Error Correction Problem , 2015, CPM.

[5]  Harvey J. Greenberg,et al.  Opportunities for Combinatorial Optimization in Computational Biology , 2004, INFORMS J. Comput..

[6]  Benedict Paten,et al.  Improved data analysis for the MinION nanopore sequencer , 2015, Nature Methods.

[7]  Leo van Iersel,et al.  The Complexity of the Single Individual SNP Haplotyping Problem , 2005, Algorithmica.

[8]  M. DePristo,et al.  A framework for variation discovery and genotyping using next-generation DNA sequencing data , 2011, Nature Genetics.

[9]  Mauricio O. Carneiro,et al.  The advantages of SMRT sequencing , 2013, Genome Biology.

[10]  Donald Ervin Knuth,et al.  The Art of Computer Programming , 1968 .

[11]  Paola Bonizzoni,et al.  An Efficient Algorithm for Haplotype Inference on Pedigrees with Recombinations and Mutations , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[12]  Mauricio O. Carneiro,et al.  Pacific biosciences sequencing technology for genotyping and variation discovery in human data , 2012, BMC Genomics.

[13]  Volodymyr Kuleshov,et al.  Probabilistic single-individual haplotyping , 2014, Bioinform..

[14]  Russell Schwartz,et al.  SNPs Problems, Complexity, and Algorithms , 2001, ESA.

[15]  Jorge Duitama,et al.  ReFHap: a reliable and fast algorithm for single individual haplotyping , 2010, BCB '10.

[16]  Sorin Istrail,et al.  HapCompass: A Fast Cycle Basis Algorithm for Accurate Haplotype Assembly of Sequence Data , 2012, J. Comput. Biol..

[17]  Zhi-Zhong Chen,et al.  Exact algorithms for haplotype assembly from whole-genome sequence data , 2013, Bioinform..

[18]  Dmitry Pushkarev,et al.  Whole-genome haplotyping using long reads and statistical methods , 2014, Nature Biotechnology.

[19]  Vineet Bafna,et al.  HapCUT: an efficient and accurate algorithm for the haplotype assembly problem , 2008, ECCB.

[20]  Eleazar Eskin,et al.  Optimal algorithms for haplotype assembly from whole-genome sequence data , 2010, Bioinform..

[21]  Eleazar Eskin,et al.  Hap-seq: An Optimal Algorithm for Haplotype Phasing with Imputation Using Sequencing Data , 2012, RECOMB.

[22]  Andrew Kasarskis,et al.  Validation of ITD mutations in FLT3 as a therapeutic target in human acute myeloid leukaemia , 2012, Nature.

[23]  B. Browning,et al.  Haplotype phasing: existing methods and new developments , 2011, Nature Reviews Genetics.

[24]  Leo van Iersel,et al.  WhatsHap: Haplotype Assembly for Future-Generation Sequencing Reads , 2014, RECOMB.

[25]  Joshua S. Paul,et al.  Genotype and SNP calling from next-generation sequencing data , 2011, Nature Reviews Genetics.