A fast and practical approach to genotype phasing and imputation on a pedigree with erroneous and incomplete information

The MINIMUM-RECOMBINANT HAPLOTYPE CONFIGURATION problem (MRHC) has been highly successful in providing a sound combinatorial formulation for the important problem of genotype phasing on pedigrees. Despite several algorithmic advances that have improved the efficiency, its applicability to real data sets has been limited since it does not take into account some important phenomena such as mutations, genotyping errors, and missing data. In this work, we propose the MINIMUM-RECOMBINANT HAPLOTYPE CONFIGURATION WITH BOUNDED ERRORS problem (MRHCE), which extends the original MRHC formulation by incorporating the two most common characteristics of real data: errors and missing genotypes (including untyped individuals). We describe a practical algorithm for MRHCE that is based on a reduction to the well-known Satisfiability problem (SAT) and exploits recent advances in the constraint programming literature. An experimental analysis demonstrates the biological soundness of the phasing model and the effectiveness (on both accuracy and performance) of the algorithm under several scenarios. The analysis on real data and the comparison with state-of-the-art programs reveals that our approach couples better scalability to large and complex pedigrees with the explicit inclusion of genotyping errors into the model.

[1]  TAO JIANG,et al.  Efficient Algorithms for Reconstructing Zero-Recombinant Haplotypes on a Pedigree Based on Fast Elimination of Redundant Linear Equations , 2009, SIAM J. Comput..

[2]  Mehdi Sargolzaei,et al.  QMSim: a large-scale genome simulator for livestock , 2009, Bioinform..

[3]  V. Bansal,et al.  The importance of phase information for human genomics , 2011, Nature Reviews Genetics.

[4]  P. Taberlet,et al.  Genotyping errors: causes, consequences and solutions , 2005, Nature Reviews Genetics.

[5]  Tao Jiang,et al.  Inferring Haplotypes from genotypes on a Pedigree with mutations, genotyping Errors and Missing Alleles , 2011, J. Bioinform. Comput. Biol..

[6]  Zhaohui S. Qin,et al.  A second generation human haplotype map of over 3.1 million SNPs , 2007, Nature.

[7]  Xin Li,et al.  An Almost Linear Time Algorithm for a General Haplotype Solution on Tree Pedigrees with no Recombination and its Extensions , 2009, J. Bioinform. Comput. Biol..

[8]  R. Elston,et al.  A general model for the genetic analysis of pedigree data. , 1971, Human heredity.

[9]  Lusheng Wang,et al.  Identification of linked regions using high-density SNP genotype data in linkage analysis , 2008, Bioinform..

[10]  Lusheng Wang,et al.  Linked region detection using high-density SNP genotype data via the minimum recombinant model of pedigree haplotype inference , 2009, BMC Bioinformatics.

[11]  E. Lander,et al.  Construction of multilocus genetic linkage maps in humans. , 1987, Proceedings of the National Academy of Sciences of the United States of America.

[12]  Tao Jiang,et al.  Efficient Inference of Haplotypes from Genotypes on a Pedigree , 2003, J. Bioinform. Comput. Biol..

[13]  Alun Thomas,et al.  Simulating realistic zero loop pedigrees using a bipartite Prufer code and graphical modelling. , 2004, Mathematical medicine and biology : a journal of the IMA.

[14]  Paola Bonizzoni,et al.  An Efficient Algorithm for Haplotype Inference on Pedigrees with Recombinations and Mutations , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[15]  Tao Jiang,et al.  Computing the Minimum Recombinant Haplotype Configuration from Incomplete Genotype Data on a Pedigree by Integer Linear Programming , 2005, J. Comput. Biol..

[16]  D. Qian,et al.  Minimum-recombinant haplotyping in pedigrees. , 2002, American journal of human genetics.

[17]  Benjamin J. Wright,et al.  Genome-wide haplotype association study identifies the SLC22A3-LPAL2-LPA gene cluster as a risk locus for coronary artery disease , 2009, Nature Genetics.

[18]  Harris A. Lewin,et al.  It's a Bull's Market , 2009, Science.

[19]  Magnus Björk,et al.  Successful SAT Encoding Techniques , 2009, J. Satisf. Boolean Model. Comput..

[20]  Paola Bonizzoni,et al.  The Haplotyping problem: An overview of computational models and solutions , 2003, Journal of Computer Science and Technology.

[21]  Jing Xiao,et al.  Complexity and approximation of the minimum recombinant haplotype configuration problem , 2005, Theor. Comput. Sci..

[22]  Albert Oliveras,et al.  Cardinality Networks: a theoretical and empirical study , 2011, Constraints.