Pitfalls and Remedies for Cross Validation with Multi-trait Genomic Prediction Methods

Incorporating measurements on correlated traits into genomic prediction models can increase prediction accuracy and selection gain. However, multi-trait genomic prediction models are complex and prone to overfitting which may result in a loss of prediction accuracy relative to single-trait genomic prediction. Cross-validation is considered the gold standard method for selecting and tuning models for genomic prediction in both plant and animal breeding. When used appropriately, cross-validation gives an accurate estimate of the prediction accuracy of a genomic prediction model, and can effectively choose among disparate models based on their expected performance in real data. However, we show that a naive cross-validation strategy applied to the multi-trait prediction problem can be severely biased and lead to sub-optimal choices between single and multi-trait models when secondary traits are used to aid in the prediction of focal traits and these secondary traits are measured on the individuals to be tested. We use simulations to demonstrate the extent of the problem and propose three partial solutions: 1) a parametric solution from selection index theory, 2) a semi-parametric method for correcting the cross-validation estimates of prediction accuracy, and 3) a fully non-parametric method which we call CV2*: validating model predictions against focal trait measurements from genetically related individuals. The current excitement over high-throughput phenotyping suggests that more comprehensive phenotype measurements will be useful for accelerating breeding programs. Using an appropriate cross-validation strategy should more reliably determine if and when combining information across multiple traits is useful.

[1]  M. Goddard,et al.  Prediction of total genetic value using genome-wide dense marker maps. , 2001, Genetics.

[2]  M. Goddard,et al.  Invited review: Genomic selection in dairy cattle: progress and challenges. , 2009, Journal of dairy science.

[3]  G. de los Campos,et al.  Genomic Selection in Plant Breeding: Methods, Models, and Perspectives. , 2017, Trends in plant science.

[4]  M. Calus,et al.  Whole-Genome Regression and Prediction Methods Applied to Plant and Animal Breeding , 2013, Genetics.

[5]  D. Bates,et al.  Fitting Linear Mixed-Effects Models Using lme4 , 2014, 1406.5823.

[6]  M. Calus,et al.  Genomic Prediction in Animals and Plants: Simulation of Data, Validation, Reporting, and Benchmarking , 2013, Genetics.

[7]  J. Marchini,et al.  A multiple phenotype imputation method for genetic studies , 2016, Nature Genetics.

[8]  Hsiao-Pei Yang,et al.  Genomic Selection in Plant Breeding: A Comparison of Models , 2012 .

[9]  José Crossa,et al.  Multi-trait, Multi-environment Deep Learning Modeling for Genomic-Enabled Prediction of Plant Traits , 2018, G3: Genes, Genomes, Genetics.

[10]  José Crossa,et al.  Genomic Prediction of Breeding Values when Modeling Genotype × Environment Interaction using Pedigree and Dense Molecular Markers , 2012 .

[11]  Ignacio Aguilar,et al.  Resource allocation optimization with multi-trait genomic prediction for bread wheat (Triticum aestivum L.) baking quality , 2018, Theoretical and Applied Genetics.

[12]  C. Schön,et al.  Bias and Sampling Error of the Estimated Proportion of Genotypic Variance Explained by Quantitative Trait Loci Determined From Experimental Data in Maize Using Cross Validation and Validation With Independent Samples. , 2000, Genetics.

[13]  Marco Lopez-Cruz,et al.  Genetic image-processing using regularized selection indices , 2019, bioRxiv.

[14]  Marco Lopez-Cruz,et al.  Increased Prediction Accuracy in Wheat Breeding Trials Using a Marker × Environment Interaction Genomic Selection Model , 2015, G3: Genes, Genomes, Genetics.

[15]  F. Agakov,et al.  Genomic prediction of complex human traits: relatedness, trait architecture and predictive meta-models , 2015, Human molecular genetics.

[16]  Antonio Reverter,et al.  Semi-parametric estimates of population accuracy and bias of predictions of breeding values and future phenotypes using the LR method , 2018, Genetics Selection Evolution.

[17]  Robert D. Finn,et al.  The Pfam protein families database: towards a more sustainable future , 2015, Nucleic Acids Res..

[18]  Daniel Gianola,et al.  Cross-Validation Without Doing Cross-Validation in Genome-Enabled Prediction , 2016, G3: Genes, Genomes, Genetics.

[19]  Suchismita Mondal,et al.  Regularized selection indices for breeding value prediction using hyper-spectral image data , 2019 .

[20]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[21]  Andrey Ziyatdinov,et al.  lme4qtl: linear mixed models with flexible covariance structure for genetic studies of related individuals , 2017, bioRxiv.

[22]  P R Amer,et al.  Implications of avoiding overlap between training and testing data sets when evaluating genomic predictions of genetic merit. , 2010, Journal of dairy science.

[23]  D. Gianola Priors in Whole-Genome Regression: The Bayesian Alphabet Returns , 2013, Genetics.

[24]  Lorena González Pérez,et al.  Canopy Temperature and Vegetation Indices from High-Throughput Phenotyping Improve Accuracy of Pedigree and Genomic Selection for Grain Yield in Wheat , 2016, G3: Genes, Genomes, Genetics.

[25]  R. Bernardo Breeding for Quantitative Traits in Plants , 2002 .

[26]  Kurt Hornik,et al.  The Design and Analysis of Benchmark Experiments , 2005 .

[27]  Jean-Luc Jannink,et al.  Genomic selection in plant breeding. , 2014, Methods in molecular biology.

[28]  M P L Calus,et al.  Effect of predictor traits on accuracy of genomic breeding values for feed intake based on a limited cow reference population. , 2013, Animal : an international journal of animal bioscience.

[29]  Jean-Luc Jannink,et al.  Multiple-Trait Genomic Selection Methods Increase Genetic Value Prediction Accuracy , 2012, Genetics.

[30]  Karin Meyer,et al.  A review of theoretical aspects in the estimation of breeding values for multi-trait selection , 1986 .

[31]  L. Breiman,et al.  Submodel selection and evaluation in regression. The X-random case , 1992 .

[32]  M. Calus,et al.  Accuracy of multi-trait genomic selection using different methods , 2011, Genetics Selection Evolution.