An Efficient Algorithm for Haplotype Inference on Pedigrees with a Small Number of Recombinants

Combinatorial (or rule-based) methods for inferring haplotypes from genotypes on a pedigree have been studied extensively in the recent literature. These methods generally try to reconstruct the haplotypes of each individual so that the total number of recombinants is minimized in the pedigree. The problem is NP-hard, although it is known that the number of recombinants in a practical dataset is usually very small. In this paper, we consider the question of how to efficiently infer haplotypes on a large pedigree when the number of recombinants is bounded by a small constant, i.e. the so called k-recombinant haplotype configuration (k-RHC) problem. We introduce a simple probabilistic model for k-RHC where the prior haplotype probability of a founder and the haplotype transmission probability from a parent to a child are all assumed to follow the uniform distribution and k random recombination events are assumed to have taken place uniformly and independently in the pedigree. We present an O(mnlog k+1n) time algorithm for k-RHC on tree pedigrees without mating loops, where m is the number of loci and n is the size of the input pedigree, and prove that when 90log n<m<n3, the algorithm can correctly find a feasible haplotype configuration that obeys the Mendelian law of inheritance and requires no more than k recombinants with probability $1 -O(k^{2}\frac{\log^{2}n}{mn}+\frac{1}{n^{2}})$. The algorithm is efficient when k is of a moderate value and could thus be used to infer haplotypes from genotypes on large tree pedigrees efficiently in practice. We have implemented the algorithm as a C++ program named Tree-k-RHC. The implementation incorporates several ideas for dealing with missing data and data with a large number of recombinants effectively. Our experimental results on both simulated and real datasets show that Tree-k-RHC can reconstruct haplotypes with a high accuracy and is much faster than the best combinatorial method in the literature.

[1]  Michael R. Fellows,et al.  Parameterized Complexity , 1998 .

[2]  K. Dawson,et al.  A Markov chain Monte Carlo strategy for sampling from the joint posterior distribution of pedigrees and population parameters under a Fisher-Wright model with partial selfing. , 2007, Theoretical population biology.

[3]  P. Tam The International HapMap Consortium. The International HapMap Project (Co-PI of Hong Kong Centre which responsible for 2.5% of genome) , 2003 .

[4]  Michael S. Waterman,et al.  Genetic mapping and DNA sequencing , 1996 .

[5]  †The International HapMap Consortium The International HapMap Project , 2003, Nature.

[6]  Ming-Yang Kao,et al.  Linear-Time Haplotype Inference on Pedigrees Without Recombinations , 2006, WABI.

[7]  D. Qian,et al.  Minimum-recombinant haplotyping in pedigrees. , 2002, American journal of human genetics.

[8]  Tao Jiang,et al.  Computing the Minimum Recombinant Haplotype Configuration from Incomplete Genotype Data on a Pedigree by Integer Linear Programming , 2005, J. Comput. Biol..

[9]  H. Kappen,et al.  Haplotype Inference in General Pedigrees Using the Cluster Variation Method , 2007, Genetics.

[10]  Tao Jiang,et al.  Minimum Recombiant Haplotype Configuration on Tree Pedigrees , 2003, WABI.

[11]  Tao Jiang,et al.  An exact solution for finding minimum recombinant haplotype configurations on pedigrees with missing data by integer linear programming , 2004, RECOMB.

[12]  Tao Jiang,et al.  A Survey on Haplotyping Algorithms for Tightly Linked Markers , 2008, J. Bioinform. Comput. Biol..

[13]  Hong Shen,et al.  k-Recombination Haplotype Inference in Pedigrees , 2005, International Conference on Computational Science.

[14]  E. Lander,et al.  Construction of multilocus genetic linkage maps in humans. , 1987, Proceedings of the National Academy of Sciences of the United States of America.

[15]  L. Excoffier,et al.  Maximum-likelihood estimation of molecular haplotype frequencies in a diploid population. , 1995, Molecular biology and evolution.

[16]  Tao Jiang,et al.  Linear-Time Reconstruction of Zero-Recombinant Mendelian Inheritance on Pedigrees without Mating Loops , 2007 .

[17]  Xin Li,et al.  Efficient haplotype inference from pedigrees with missing data using linear systems with disjoint-set data structures. , 2008, Computational systems bioinformatics. Computational Systems Bioinformatics Conference.

[18]  Xi Chen,et al.  Complexity and Approximation of the Minimum Recombination Haplotype Configuration Problem , 2005, ISAAC.

[19]  TAO JIANG,et al.  Efficient Algorithms for Reconstructing Zero-Recombinant Haplotypes on a Pedigree Based on Fast Elimination of Redundant Linear Equations , 2009, SIAM J. Comput..

[20]  Qin Zhang,et al.  A method for haplotype inference in general pedigrees without recombination , 2007 .

[21]  Fan Liu,et al.  Breaking Loops in Large Complex Pedigrees , 2007, Human Heredity.

[22]  S. Gabriel,et al.  The Structure of Haplotype Blocks in the Human Genome , 2002, Science.

[23]  Daniel F. Gudbjartsson,et al.  Allegro, a new computer program for multipoint linkage analysis , 2000, Nature genetics.

[24]  Tao Jiang,et al.  Efficient rule-based haplotyping algorithms for pedigree data , 2003, RECOMB '03.

[25]  J. O’Connell Zero‐recombinant haplotyping: Applications to fine mapping using SNPs , 2000, Genetic epidemiology.

[26]  J. Weller,et al.  Efficient Inference of Haplotypes From Genotypes on a Large Animal Pedigree , 2006, Genetics.

[27]  Jing Xiao,et al.  Fast elimination of redundant linear equations and reconstruction of recombination-free mendelian inheritance on a pedigree , 2007, SODA '07.

[28]  L Kruglyak,et al.  Parametric and nonparametric linkage analysis: a unified multipoint approach. , 1996, American journal of human genetics.

[29]  Dan Gusfield,et al.  On the Complexity of Fundamental Computational Problems in Pedigree Analysis , 2003, J. Comput. Biol..

[30]  Anthony Jf Griffiths,et al.  Modern genetic analysis : integrating genes and genomes , 2002 .

[31]  G. Abecasis,et al.  Merlin—rapid analysis of dense genetic maps using sparse gene flow trees , 2002, Nature Genetics.