Training set optimization of genomic prediction by means of EthAcc

Genomic prediction is a useful tool for plant and animal breeding programs and is starting to be used to predict human diseases as well. A shortcoming that slows down the genomic selection deployment is that the accuracy of the prediction is not known a priori. We propose EthAcc (Estimated THeoretical ACCuracy) as a method for estimating the accuracy given a training set that is genotyped and phenotyped. EthAcc is based on a causal quantitative trait loci model estimated by a genome-wide association study. This estimated causal model is crucial; therefore, we compared different methods to find the one yielding the best EthAcc. The multilocus mixed model was found to perform the best. We compared EthAcc to accuracy estimators that can be derived via a mixed marker model. We showed that EthAcc is the only approach to correctly estimate the accuracy. Moreover, in case of a structured population, in accordance with the achieved accuracy, EthAcc showed that the biggest training set is not always better than a smaller and closer training set. We then performed training set optimization with EthAcc and compared it to CDmean. EthAcc outperformed CDmean on real datasets from sugar beet, maize, and wheat. Nonetheless, its performance was mainly due to the use of an optimal but inaccessible set as a start of the optimization algorithm. EthAcc’s precision and algorithm issues prevent it from reaching a good training set with a random start. Despite this drawback, we demonstrated that a substantial gain in accuracy can be obtained by performing training set optimization.

[1]  Doug Speed,et al.  MultiBLUP: improved SNP-based prediction for complex traits , 2014, Genome research.

[2]  Bjarni J. Vilhjálmsson,et al.  An efficient multi-locus mixed model approach for genome-wide association studies in structured populations , 2012, Nature Genetics.

[3]  Sébastien Lê,et al.  FactoMineR: An R Package for Multivariate Analysis , 2008 .

[4]  M. D. de Cara,et al.  Detecting inbreeding depression for reproductive traits in Iberian pigs using genome-wide data , 2015, Genetics Selection Evolution.

[5]  B. Mangin,et al.  On the accuracy in high‐dimensional linear models and its application to genomic selection , 2018, Scandinavian Journal of Statistics.

[6]  Michael E. Goddard,et al.  Genomic selection: A paradigm shift in animal breeding , 2016 .

[7]  A. Melchinger,et al.  Maximizing the Reliability of Genomic Selection by Optimizing the Calibration Set of Reference Individuals: Comparison of Methods in Two Diverse Groups of Maize Inbreds (Zea mays L.) , 2012, Genetics.

[8]  F. Collins,et al.  A new initiative on precision medicine. , 2015, The New England journal of medicine.

[9]  E. Thompson,et al.  Efficient Estimation of Realized Kinship from Single Nucleotide Polymorphism Genotypes , 2017, Genetics.

[10]  Donald E. Grierson,et al.  Comparison among five evolutionary-based optimization algorithms , 2005, Adv. Eng. Informatics.

[11]  Eva Bauer,et al.  Genome Properties and Prospects of Genomic Prediction of Hybrid Performance in a Breeding Program of Maize , 2014, Genetics.

[12]  M. Goddard,et al.  Prediction of total genetic value using genome-wide dense marker maps. , 2001, Genetics.

[13]  J. Poland,et al.  Training set optimization under population structure in genomic selection , 2014, Theoretical and Applied Genetics.

[14]  P. Waldmann,et al.  Evaluation of the lasso and the elastic net in genome-wide association studies , 2013, Front. Genet..

[15]  R. Fernando,et al.  Genomic-Assisted Prediction of Genetic Value With Semiparametric Procedures , 2006, Genetics.

[16]  Jean-Luc Jannink,et al.  Shrinkage Estimation of the Realized Relationship Matrix , 2012, G3: Genes | Genomes | Genetics.

[17]  Deniz Akdemir,et al.  Optimization of genomic selection training populations with a genetic algorithm , 2015, Genetics Selection Evolution.

[18]  A. Melchinger,et al.  Optimum breeding strategies using genomic selection for hybrid breeding in wheat, maize, rye, barley, rice and triticale , 2016, Theoretical and Applied Genetics.

[19]  S Brard,et al.  Is the use of formulae a reliable way to predict the accuracy of genomic selection? , 2015, Journal of animal breeding and genetics = Zeitschrift fur Tierzuchtung und Zuchtungsbiologie.

[20]  Approximated prediction of genomic selection accuracy when reference and candidate populations are related , 2016, Genetics Selection Evolution.

[21]  Jeffrey B. Endelman,et al.  Ridge Regression and Other Kernels for Genomic Selection with R Package rrBLUP , 2011 .

[22]  V. Allard,et al.  Genome-wide association analysis to identify chromosomal regions determining components of earliness in wheat , 2011, Theoretical and Applied Genetics.

[23]  G. Charmet,et al.  A worldwide bread wheat core collection arrayed in a 384-well plate , 2007, Theoretical and Applied Genetics.

[24]  P. Kambadur,et al.  Variable-Selection Emerges on Top in Empirical Comparison of Whole-Genome Complex-Trait Prediction Methods , 2015, PloS one.

[25]  M. McMullen,et al.  A unified mixed-model method for association mapping that accounts for multiple levels of relatedness , 2006, Nature Genetics.

[26]  F. V. van Eeuwijk,et al.  Improvement of Predictive Ability by Uniform Coverage of the Target Genetic Space , 2016, G3: Genes, Genomes, Genetics.

[27]  Daniel Gianola,et al.  Predicting genetic predisposition in humans: the promise of whole-genome markers , 2010, Nature Reviews Genetics.

[28]  Zitong Li,et al.  Overview of LASSO-related penalized regression methods for quantitative trait mapping and genomic selection , 2012, Theoretical and Applied Genetics.

[29]  B. Mangin,et al.  On the Accuracy of Genomic Selection , 2016, PloS one.

[30]  H. Zou The Adaptive Lasso and Its Oracle Properties , 2006 .

[31]  R. Elston,et al.  A general model for the genetic analysis of pedigree data. , 1971, Human heredity.

[32]  Hsiao-Pei Yang,et al.  Genomic Selection in Plant Breeding: A Comparison of Models , 2012 .

[33]  X. Chen,et al.  Random forests for genomic data analysis. , 2012, Genomics.

[34]  Hans D. Daetwyler,et al.  Accuracy of Predicting the Genetic Risk of Disease Using a Genome-Wide Approach , 2008, PloS one.

[35]  Yoshua Bengio,et al.  Inference for the Generalization Error , 1999, Machine Learning.

[36]  Ina Hoeschele,et al.  Penalized Multimarker vs. Single-Marker Regression Methods for Genome-Wide Association Studies of Quantitative Traits , 2014, Genetics.

[37]  M. Goddard Genomic selection: prediction of accuracy and maximisation of long term response , 2009, Genetica.

[38]  B. Mangin,et al.  Genomic Prediction of Sunflower Hybrids Oil Content , 2017, Front. Plant Sci..

[39]  Andrés Legarra,et al.  Performance of Genomic Selection in Mice , 2008, Genetics.

[40]  M. Goddard,et al.  Accelerating improvement of livestock with genomic selection. , 2013, Annual review of animal biosciences.

[41]  Mikko J. Sillanpää,et al.  Back to Basics for Bayesian Model Building in Genomic Selection , 2012, Genetics.

[42]  Sam Clark,et al.  Estimation of genomic prediction accuracy from reference populations with varying degrees of relationship , 2017, bioRxiv.

[43]  V. Allard,et al.  Predictions of heading date in bread wheat (Triticum aestivum L.) using QTL-based parameters of an ecophysiological model , 2014, Journal of experimental botany.

[44]  M. Goddard,et al.  Using the genomic relationship matrix to predict the accuracy of genomic selection. , 2011, Journal of animal breeding and genetics = Zeitschrift fur Tierzuchtung und Zuchtungsbiologie.

[45]  P. VanRaden,et al.  Efficient methods to compute genomic predictions. , 2008, Journal of dairy science.