Genotype imputation via matrix completion

Most current genotype imputation methods are model-based and computationally intensive, taking days to impute one chromosome pair on 1000 people. We describe an efficient genotype imputation method based on matrix completion. Our matrix completion method is implemented in MATLAB and tested on real data from HapMap 3, simulated pedigree data, and simulated low-coverage sequencing data derived from the 1000 Genomes Project. Compared with leading imputation programs, the matrix completion algorithm embodied in our program MENDEL-IMPUTE achieves comparable imputation accuracy while reducing run times significantly. Implementation in a lower-level language such as Fortran or C is apt to further improve computational efficiency.

[1]  G. Dahlberg,et al.  Genetics of human populations. , 1948, Advances in genetics.

[2]  Regina C. Elandt-Johnson,et al.  Probability models and statistical methods in genetics , 1972 .

[3]  B. N. Curnow,et al.  Probability Models and Statistical Methods in Genetics , 1973 .

[4]  Gene H. Golub,et al.  Matrix computations , 1983 .

[5]  K. Lange,et al.  An algorithm for automatic genotype elimination. , 1987, American journal of human genetics.

[6]  Gene H. Golub,et al.  Matrix computations (3rd ed.) , 1996 .

[7]  M Farrall,et al.  Measured haplotype analysis of the angiotensin-I converting enzyme gene. , 1998, Human molecular genetics.

[8]  D. Schaid Mathematical and Statistical Methods for Genetic Analysis , 1999 .

[9]  Kenneth Lange,et al.  Numerical analysis for statisticians , 1999 .

[10]  D. Hunter,et al.  Optimization Transfer Using Surrogate Objective Functions , 2000 .

[11]  J. Pritchard,et al.  Linkage disequilibrium in humans: models and data. , 2001, American journal of human genetics.

[12]  M. Daly,et al.  High-resolution haplotype structure in the human genome , 2001, Nature Genetics.

[13]  M. Stephens,et al.  Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data. , 2003, Genetics.

[14]  C Cannings,et al.  Mathematical and Statistical Methods for Genetic Analysis (2nd ed) , 2004, Heredity.

[15]  Paul Scheet,et al.  A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. , 2006, American journal of human genetics.

[16]  P. Donnelly,et al.  A new multipoint method for genome-wide association studies by imputation of genotypes , 2007, Nature Genetics.

[17]  Zhaoxia Yu,et al.  Methods to impute missing genotypes for population data , 2007, Human Genetics.

[18]  B. Browning,et al.  Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. , 2007, American journal of human genetics.

[19]  Kenneth Lange,et al.  Penalized estimation of haplotype frequencies , 2008, Bioinform..

[20]  Michael Krawczak,et al.  A comprehensive evaluation of SNP genotype imputation , 2009, Human Genetics.

[21]  Pall I. Olason,et al.  Detection of sharing by descent, long-range phasing and haplotype imputation , 2008, Nature Genetics.

[22]  G. Abecasis,et al.  Genotype imputation. , 2009, Annual review of genomics and human genetics.

[23]  P. Donnelly,et al.  A Flexible and Accurate Genotype Imputation Method for the Next Generation of Genome-Wide Association Studies , 2009, PLoS genetics.

[24]  B. Browning,et al.  A unified approach to genotype imputation and haplotype-phase inference for large data sets of trios and unrelated individuals. , 2009, American journal of human genetics.

[25]  Marc Teboulle,et al.  A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems , 2009, SIAM J. Imaging Sci..

[26]  Yehuda Koren,et al.  Matrix Factorization Techniques for Recommender Systems , 2009, Computer.

[27]  Emmanuel J. Candès,et al.  A Singular Value Thresholding Algorithm for Matrix Completion , 2008, SIAM J. Optim..

[28]  J. Marchini,et al.  Genotype imputation for genome-wide association studies , 2010, Nature Reviews Genetics.

[29]  Matthew Stephens,et al.  USING LINEAR PREDICTORS TO IMPUTE ALLELE FREQUENCIES FROM SUMMARY OR POOLED GENOTYPE DATA. , 2010, The annals of applied statistics.

[30]  G. Abecasis,et al.  MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes , 2010, Genetic epidemiology.

[31]  D. Altshuler,et al.  A map of human genome variation from population-scale sequencing , 2010, Nature.

[32]  Robert Tibshirani,et al.  Spectral Regularization Algorithms for Learning Large Incomplete Matrices , 2010, J. Mach. Learn. Res..

[33]  Sharon R Grossman,et al.  Integrating common and rare genetic variation in diverse human populations , 2010, Nature.

[34]  Nilanjan Chatterjee,et al.  Efficient study design for next generation sequencing , 2011, Genetic epidemiology.

[35]  Hua Zhou,et al.  A quasi-Newton acceleration for high-dimensional optimization algorithms , 2011, Stat. Comput..

[36]  Emmanuel J. Candès,et al.  Exact Matrix Completion via Convex Optimization , 2008, Found. Comput. Math..

[37]  L. Liang,et al.  Extremely low-coverage sequencing and imputation increases power for genome-wide association studies , 2012, Nature Genetics.

[38]  O. Delaneau,et al.  A linear complexity phasing method for thousands of genomes , 2011, Nature Methods.