Multivariate Imputation of Genotype Data Using Short and Long Range Disequilibrium

Missing values in genetic data are a common issue. In this paper we explore several machine learning techniques for creating models that can be used to impute the missing genotypes using multiple genetic markers. We map the machine learning techniques to different patterns of transmission and, in particular, we contrast the effect of short and long range disequilibrium between markers. The assumption of short range disequilibrium implies that only physically close genetic variants are informative for reconstructing missing genotypes, while this assumption is relaxed in long range disequilibrium and physically distant genetic variants become informative for imputation. We evaluate the accuracy of a flexible feature selection model that fits both patterns of transmission using six real datasets of single nucleotide polymorphisms (SNP). The results show an increased accuracy compared to standard imputation models.

[1]  S. Gabriel,et al.  The Structure of Haplotype Blocks in the Human Genome , 2002, Science.

[2]  David W. Aha,et al.  Instance-Based Learning Algorithms , 1991, Machine Learning.

[3]  Roel Wieringa,et al.  An integrated framework for ought-to-be and ought-to-do constraints , 2004, Artificial Intelligence and Law.

[4]  Russell Schwartz,et al.  Relaxing Haplotype Block Models for Association Testing , 2006, Pacific Symposium on Biocomputing.

[5]  Richard E. Neapolitan,et al.  Learning Bayesian networks , 2007, KDD '07.

[6]  Toshihiro Tanaka The International HapMap Project , 2003, Nature.

[7]  J. Ross Quinlan,et al.  Improved Use of Continuous Attributes in C4.5 , 1996, J. Artif. Intell. Res..

[8]  P. Sebastiani,et al.  Association of klotho, bone morphogenic protein 6, and annexin A2 polymorphisms with sickle cell osteonecrosis. , 2005, Blood.

[9]  S. P. Fodor,et al.  Blocks of Limited Haplotype Diversity Revealed by High-Resolution Scanning of Human Chromosome 21 , 2001, Science.

[10]  Ron Kohavi,et al.  Irrelevant Features and the Subset Selection Problem , 1994, ICML.

[11]  J. V. Moran,et al.  Initial sequencing and analysis of the human genome. , 2001, Nature.

[12]  Ron Kohavi,et al.  The Wrapper Approach , 1998 .

[13]  Lior Rokach,et al.  Data Mining And Knowledge Discovery Handbook , 2005 .

[14]  Sinead B. O'Leary,et al.  Genetic variation in the 5q31 cytokine gene cluster confers susceptibility to Crohn disease , 2001, Nature Genetics.

[15]  M. Daly,et al.  High-resolution haplotype structure in the human genome , 2001, Nature Genetics.