An ensemble-based approach to imputation of moderate-density genotypes for genomic selection with application to Angus cattle.

Summary Imputation of moderate-density genotypes from low-density panels is of increasing interest in genomic selection, because it can dramatically reduce genotyping costs. Several imputation software packages have been developed, but they vary in imputation accuracy, and imputed genotypes may be inconsistent among methods. An AdaBoost-like approach is proposed to combine imputation results from several independent software packages, i.e. Beagle(v3.3), IMPUTE(v2.0), fastPHASE(v1.4), AlphaImpute, findhap(v2) and Fimpute(v2), with each package serving as a basic classifier in an ensemble-based system. The ensemble-based method computes weights sequentially for all classifiers, and combines results from component methods via weighted majority 'voting' to determine unknown genotypes. The data included 3078 registered Angus cattle, each genotyped with the Illumina BovineSNP50 BeadChip. SNP genotypes on three chromosomes (BTA1, BTA16 and BTA28) were used to compare imputation accuracy among methods, and the application involved the imputation of 50K genotypes covering 29 chromosomes based on a set of 5K genotypes. Beagle and Fimpute had the greatest accuracy among the six imputation packages, which ranged from 0·8677 to 0·9858. The proposed ensemble method was better than any of these packages, but the sequence of independent classifiers in the voting scheme affected imputation accuracy. The ensemble systems yielding the best imputation accuracies were those that had Beagle as first classifier, followed by one or two methods that utilized pedigree information. A salient feature of the proposed ensemble method is that it can solve imputation inconsistencies among different imputation methods, hence leading to a more reliable system for imputing genotypes relative to independent methods.

[1]  J. Marchini,et al.  Genotype imputation for genome-wide association studies , 2010, Nature Reviews Genetics.

[2]  P. VanRaden,et al.  Prediction of unobserved single nucleotide polymorphism genotypes of Jersey cattle using reference panels and population-based imputation algorithms. , 2010, Journal of dairy science.

[3]  M P L Calus,et al.  Imputation of missing single nucleotide polymorphism genotypes using a multivariate mixed model framework. , 2011, Journal of animal science.

[4]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[5]  Tad S. Sonstegard,et al.  Design of a Bovine Low-Density SNP Array Optimized for Imputation , 2012, PloS one.

[6]  P. VanRaden,et al.  Genomic evaluations with many more genotypes , 2011, Genetics Selection Evolution.

[7]  T. Druet,et al.  Marker imputation with low-density marker panels in Dutch Holstein cattle. , 2010, Journal of dairy science.

[8]  P. Donnelly,et al.  A Flexible and Accurate Genotype Imputation Method for the Next Generation of Genome-Wide Association Studies , 2009, PLoS genetics.

[9]  Pall I. Olason,et al.  Detection of sharing by descent, long-range phasing and haplotype imputation , 2008, Nature Genetics.

[10]  Nikunj C. Oza,et al.  Online Ensemble Learning , 2000, AAAI/IAAI.

[11]  R. Fernando,et al.  Genomic Selection Using Low-Density Marker Panels , 2009, Genetics.

[12]  Bruce Tier,et al.  A combined long-range phasing and long haplotype imputation method to impute phase for SNP genotypes , 2011, Genetics Selection Evolution.

[13]  Tom Druet,et al.  A Hidden Markov Model Combining Linkage and Linkage Disequilibrium Information for Haplotype Reconstruction and Quantitative Trait Locus Fine Mapping , 2010, Genetics.

[14]  Geoffrey E. Hinton,et al.  Adaptive Mixtures of Local Experts , 1991, Neural Computation.

[15]  B.V. Dasarathy,et al.  A composite classifier system design: Concepts and methodology , 1979, Proceedings of the IEEE.

[16]  David H. Wolpert,et al.  Stacked generalization , 1992, Neural Networks.

[17]  J. Kijas,et al.  Accuracy of genotype imputation in sheep breeds. , 2012, Animal genetics.

[18]  Timothy P. L. Smith,et al.  Development and Characterization of a High Density SNP Genotyping Assay for Cattle , 2009, PloS one.

[19]  Nasser M. Nasrabadi,et al.  Pattern Recognition and Machine Learning , 2006, Technometrics.

[20]  D Gianola,et al.  Predictive ability of direct genomic values for lifetime net merit of Holstein sires using selected subsets of single nucleotide polymorphism markers. , 2009, Journal of dairy science.

[21]  Hsiao-Pei Yang,et al.  Genomic Selection in Plant Breeding: A Comparison of Models , 2012 .

[22]  Paul Scheet,et al.  A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. , 2006, American journal of human genetics.

[23]  John A Woolliams,et al.  Imputation of Missing Genotypes From Sparse to High Density Using Long-Range Phasing , 2011, Genetics.

[24]  R. Schapire The Strength of Weak Learnability , 1990, Machine Learning.

[25]  Lars Kai Hansen,et al.  Neural Network Ensembles , 1990, IEEE Trans. Pattern Anal. Mach. Intell..

[26]  R. Fernando,et al.  Accuracies of genomic breeding values in American Angus beef cattle using K-means clustering for cross-validation , 2011, Genetics Selection Evolution.

[27]  M. Goddard,et al.  Prediction of total genetic value using genome-wide dense marker maps. , 2001, Genetics.

[28]  R. Polikar,et al.  Ensemble based systems in decision making , 2006, IEEE Circuits and Systems Magazine.

[29]  B. Browning,et al.  A unified approach to genotype imputation and haplotype-phase inference for large data sets of trios and unrelated individuals. , 2009, American journal of human genetics.

[30]  V Ducrocq,et al.  Effect of imputing markers from a low-density chip on the reliability of genomic breeding values in Holstein populations. , 2011, Journal of dairy science.

[31]  Michael I. Jordan,et al.  Hierarchical Mixtures of Experts and the EM Algorithm , 1994, Neural Computation.