Imputing missing genotypic data of single-nucleotide polymorphisms using neural networks

With advances in high-throughput single-nucleotide polymorphism (SNP) genotyping, the amount of genotype data available for genetic studies is steadily increasing, and with it comes new abilities to study multigene interactions as well as to develop higher dimensional genetic models that more closely represent the polygenic nature of common disease risk. The combined impact of even small amounts of missing data on a multi-SNP analysis may be considerable. In this study, we present a neural network method for imputing missing SNP genotype data. We compared its imputation accuracy with fastPHASE and an expectation–maximization algorithm implemented in HelixTree. In a simulation data set of 1000 SNPs and 1000 subjects, 1, 5 and 10% of genotypes were randomly masked. Four levels of linkage disequilibrium (LD), LD R2<0.2, R2<0.5, R2<0.8 and no LD threshold, were examined to evaluate the impact of LD on imputation accuracy. All three methods are capable of imputing most missing genotypes accurately (accuracy >86%). The neural network method accurately predicted 92.0–95.9% of the missing genotypes. In a real data set comparison with 419 subjects and 126 SNPs from chromosome 2, the neural network method achieves the highest imputation accuracies >83.1% with missing rate from 1 to 5%. Using 90 HapMap subjects with 1962 SNPs, fastPHASE had the highest accuracy (∼97%) while the other two methods had >95% accuracy. These results indicate that the neural network model is an accurate and convenient tool, requiring minimal parameter tuning for SNP data recovery, and provides a valuable alternative to usual complete-case analysis.

[1]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[2]  C. Carlson,et al.  Selecting a maximally informative set of single-nucleotide polymorphisms for association analyses using linkage disequilibrium. , 2004, American journal of human genetics.

[3]  Joel Parker,et al.  Inferring missing genotypes in large SNP panels using fast nearest-neighbor searches over sliding windows , 2007, ISMB/ECCB.

[4]  John Scott Bridle,et al.  Probabilistic Interpretation of Feedforward Classification Network Outputs, with Relationships to Statistical Pattern Recognition , 1989, NATO Neurocomputing.

[5]  P. Taberlet,et al.  Genotyping errors: causes, consequences and solutions , 2005, Nature Reviews Genetics.

[6]  George Cybenko,et al.  Approximation by superpositions of a sigmoidal function , 1989, Math. Control. Signals Syst..

[7]  Peter Müller,et al.  Issues in Bayesian Analysis of Neural Network Models , 1998, Neural Computation.

[8]  Ingo Ruczinski,et al.  Imputation Methods to Improve Inference in Snp Association Studies , 2022 .

[9]  William N. Venables,et al.  Modern Applied Statistics with S , 2010 .

[10]  Paul Scheet,et al.  A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. , 2006, American journal of human genetics.

[11]  D. Rubin Multiple Imputation After 18+ Years , 1996 .

[12]  Richard R. Hudson,et al.  Generating samples under a Wright-Fisher neutral model of genetic variation , 2002, Bioinform..

[13]  Jeanette C Papp,et al.  Detection and integration of genotyping errors in statistical genetics. , 2002, American journal of human genetics.

[14]  E S Lander,et al.  Systematic detection of errors in genetic linkage data. , 1992, Genomics.

[15]  Eric Boerwinkle,et al.  Positional Identification of Hypertension Susceptibility Genes on Chromosome 2 , 2004, Hypertension.

[16]  Low-Tone Ho,et al.  Tree-structured supervised learning and the genetics of hypertension. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[17]  Kurt Hornik,et al.  Multilayer feedforward networks are universal approximators , 1989, Neural Networks.

[18]  D G Clayton,et al.  Fine genetic mapping using haplotype analysis and the missing data problem , 1998, Annals of human genetics.

[19]  Yoshua Bengio,et al.  Pattern Recognition and Neural Networks , 1995 .

[20]  Peter Holmans,et al.  Effects of Differential Genotyping Error Rate on the Type I Error Probability of Case-Control Studies , 2006, Human Heredity.

[21]  A. Raftery Approximate Bayes factors and accounting for model uncertainty in generalised linear models , 1996 .

[22]  L. Wasserman,et al.  A Reference Bayesian Test for Nested Hypotheses and its Relationship to the Schwarz Criterion , 1995 .

[23]  Jurg Ott,et al.  Assessment and management of single nucleotide polymorphism genotype errors in genetic association analysis. , 2000 .

[24]  Zhaohui S. Qin,et al.  A comparison of phasing algorithms for trios and unrelated individuals. , 2006, American journal of human genetics.

[25]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[26]  Stephen J Finch,et al.  What SNP genotyping errors are most costly for genetic association studies? , 2004, Genetic epidemiology.

[27]  Bradley Efron,et al.  Missing Data, Imputation, and the Bootstrap , 1994 .

[28]  Steven C. Hunt,et al.  Multi-center genetic study of hypertension: The Family Blood Pressure Program (FBPP). , 2002, Hypertension.

[29]  Roderick J. A. Little Regression with Missing X's: A Review , 1992 .

[30]  Geoffrey E. Hinton,et al.  Bayesian Learning for Neural Networks , 1995 .