Combining haplotypers

Statistically resolving the underlying haplotype pair for a genotype measurement is an important intermediate step in gene mapping studies, and has received much attention recently. Consequently, a variety of methods for this problem have been developed. Different methods employ different statistical models, and thus implicitly encode different assumptions about the nature of the underlying haplotype structure. Depending on the population sample in question, their relative performance can vary greatly, and it is unclear which method to choose for a particular sample. Instead of choosing a single method, we explore combining predictions returned by different methods in a principled way, and thereby circumvent the problem of method selection. We propose several techniques for combining haplotype reconstructions and analyze their computational properties. In an experimental study on realworld haplotype data we show that such techniques can provide more accurate and robust reconstructions, and are useful for outlier detection. Typically, the combined prediction is at least as accurate as or even more accurate than the best individual method, effectively circumventing the method selection problem. Computing Reviews (1998)

[1]  Carla Savage,et al.  A Survey of Combinatorial Gray Codes , 1997, SIAM Rev..

[2]  Yoram Singer,et al.  An Efficient Boosting Algorithm for Combining Preferences by , 2013 .

[3]  Moni Naor,et al.  Rank aggregation methods for the Web , 2001, WWW '01.

[4]  P. Donnelly,et al.  A new statistical method for haplotype reconstruction from population data. , 2001, American journal of human genetics.

[5]  M. Daly,et al.  High-resolution haplotype structure in the human genome , 2001, Nature Genetics.

[6]  Dan Gusfield,et al.  An Overview of Combinatorial Methods for Haplotype Inference , 2002, Computational Methods for SNPs and Haplotype Inference.

[7]  Christian N. S. Pedersen,et al.  The consensus string problem and the complexity of comparing hidden Markov models , 2002, J. Comput. Syst. Sci..

[8]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[9]  Jonas Sj̈obergh Combining POS-taggers for improved accuracy on Swedish text , 2003 .

[10]  Peter Donnelly,et al.  A comparison of bayesian methods for haplotype reconstruction from population genotype data. , 2003, American journal of human genetics.

[11]  Jeong Seop Sim,et al.  The consensus string problem for a metric is NP-complete , 2003, J. Discrete Algorithms.

[12]  Ronald Fagin,et al.  Comparing and aggregating rankings with ties , 2004, PODS '04.

[13]  Hannu Toivonen,et al.  A Markov Chain Approach to Reconstruction of Long Haplotypes , 2003, Pacific Symposium on Biocomputing.

[14]  Ming Li,et al.  On the k-Closest Substring and k-Consensus Pattern Problems , 2004, CPM.

[15]  Yoram Singer,et al.  Logistic Regression, AdaBoost and Bregman Distances , 2000, Machine Learning.

[16]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[17]  Eran Halperin,et al.  Haplotype reconstruction from genotype data using Imperfect Phylogeny , 2004, Bioinform..

[18]  Joachim M. Buhmann,et al.  Combining partitions by probabilistic label aggregation , 2005, KDD '05.

[19]  Aristides Gionis,et al.  Clustering aggregation , 2005, 21st International Conference on Data Engineering (ICDE'05).

[20]  M. Stephens,et al.  Accounting for Decay of Linkage Disequilibrium in Haplotype Inference and Missing-data Imputation , 2022 .

[21]  M. Olivier A haplotype map of the human genome , 2003, Nature.

[22]  Heikki Mannila,et al.  A Hidden Markov Technique for Haplotype Reconstruction , 2005, WABI.

[23]  Ron Shamir,et al.  The Incomplete Perfect Phylogeny Haplotype Problem , 2005, J. Bioinform. Comput. Biol..

[24]  Ron Shamir,et al.  GERBIL: Genotype resolution and block identification using likelihood. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[25]  Ron Shamir,et al.  A Block-Free Hidden Markov Model for Genotypes and Its Application to Disease Association , 2005, J. Comput. Biol..

[26]  Paul Scheet,et al.  A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. , 2006, American journal of human genetics.

[27]  Taneli Mielikäinen,et al.  Aggregating time partitions , 2006, KDD '06.

[28]  Heikki Mannila,et al.  Constrained hidden Markov models for population-based haplotyping , 2007, BMC Bioinformatics.

[29]  Kenshi Hayashi,et al.  D-HaploDB: a database of definitive haplotypes determined by genotyping complete hydatidiform mole samples , 2006, Nucleic Acids Res..

[30]  Nir Ailon,et al.  Aggregating inconsistent information: Ranking and clustering , 2008 .