Machine learning methods and predictive ability metrics for genome-wide prediction of complex traits $

Abstract Genome-wide prediction of complex traits has become increasingly important in animal and plant breeding, and is receiving increasing attention in human genetics. Most common approaches are whole-genome regression models where phenotypes are regressed on thousands of markers concurrently, applying different prior distributions to marker effects. While use of shrinkage or regularization in SNP regression models has delivered improvements in predictive ability in genome-based evaluations, serious over-fitting problems may be encountered as the ratio between markers and available phenotypes continues increasing. Machine learning is an alternative approach for prediction and classification, capable of dealing with the dimensionality problem in a computationally flexible manner. In this article we provide an overview of non-parametric and machine learning methods used in genome wide prediction, discuss their similarities as well as their relationship to some well-known parametric approaches. Although the most suitable method is usually case dependent, we suggest the use of support vector machines and random forests for classification problems, whereas Reproducing Kernel Hilbert Spaces regression and boosting may suit better regression problems, with the former having the more consistently higher predictive ability. Neural Networks may suffer from over-fitting and may be too computationally demanded when the number of neurons is large. We further discuss on the metrics used to evaluate predictive ability in model comparison under cross-validation from a genomic selection point of view. We suggest use of predictive mean squared error as a main but not only metric for model comparison. Visual tools may greatly assist on the choice of the most accurate model.

[1]  Simon Haykin,et al.  Neural Networks: A Comprehensive Foundation , 1998 .

[2]  Xiao-Lin Wu,et al.  Modeling relationships between calving traits: a comparison between standard and recursive mixed models , 2010, Genetics Selection Evolution.

[3]  Emilio Porcu,et al.  Predicting Genetic Values: A Kernel-Based Best Linear Unbiased Prediction With Genomic Data , 2011, Genetics.

[4]  J. Rodgers,et al.  Thirteen ways to look at the correlation coefficient , 1988 .

[5]  K. Weigel,et al.  Predicting complex quantitative traits with Bayesian neural networks: a case study with Jersey cows and wheat , 2011, BMC Genetics.

[6]  Guosheng Su,et al.  A common reference population from four European Holstein populations increases reliability of genomic predictions , 2011, Genetics Selection Evolution.

[7]  O. González-Recio,et al.  The gradient boosting algorithm and random boosting for genome-assisted evaluation in large data sets. , 2013, Journal of dairy science.

[8]  O. González-Recio,et al.  Genome-wide prediction of discrete traits using bayesian regressions and machine learning , 2011, Genetics Selection Evolution.

[9]  Kent A Weigel,et al.  Genome-assisted prediction of a quantitative trait measured in parents and progeny: application to food conversion rate in chickens , 2009, Genetics Selection Evolution.

[10]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[11]  K. Weigel,et al.  Enhancing Genome-Enabled Prediction by Bagging Genomic BLUP , 2014, PloS one.

[12]  J Crossa,et al.  Genomic prediction in CIMMYT maize and wheat breeding programs , 2013, Heredity.

[13]  L. A. García-Cortés,et al.  Combining Genomic and Genealogical Information in a Reproducing Kernel Hilbert Spaces Regression Model for Genome-Enabled Predictions in Dairy Cattle , 2014, PloS one.

[14]  P Pérez-Rodríguez,et al.  Genome-enabled methods for predicting litter size in pigs: a comparison. , 2013, Animal : an international journal of animal bioscience.

[15]  D. Gianola Priors in Whole-Genome Regression: The Bayesian Alphabet Returns , 2013, Genetics.

[16]  Sándor Suhai,et al.  Role and Results of statistical methods in protein fold class prediction , 2001 .

[17]  Brad Warner,et al.  Understanding Neural Networks as Statistical Tools , 1996 .

[18]  D. Allison,et al.  A Comprehensive Genetic Approach for Improving Prediction of Skin Cancer Risk in Humans , 2012, Genetics.

[19]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[20]  D Stetten,et al.  The impact of genetics. , 1973, Clinical chemistry.

[21]  D. Gianola,et al.  Comparison Between Linear and Non-parametric Regression Models for Genome-Enabled Prediction in Wheat , 2012, G3: Genes | Genomes | Genetics.

[22]  D Gianola,et al.  Reproducing kernel Hilbert spaces regression: a general framework for genetic evaluation. , 2009, Journal of animal science.

[23]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[24]  G. Wahba,et al.  Some results on Tchebycheffian spline functions , 1971 .

[25]  J Crossa,et al.  Genomic-enabled prediction with classification algorithms , 2014, Heredity.

[26]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[27]  J. G. Taylor,et al.  ARTIFICIAL NEURAL NETWORKS, 2 , 1992 .

[28]  Eric R. Ziegel,et al.  Understanding Neural Networks , 1980 .

[29]  Hao Helen Zhang,et al.  Component selection and smoothing in multivariate nonparametric regression , 2006, math/0702659.

[30]  O. González-Recio,et al.  Comparison of methods for the implementation of genome-assisted evaluation of Spanish dairy cattle. , 2013, Journal of dairy science.

[31]  Daniel Gianola,et al.  Kernel-based variance component estimation and whole-genome prediction of pre-corrected phenotypes and progeny tests for dairy cow health traits , 2014, Front. Genet..

[32]  C. D. Page,et al.  Random Forests approach for identifying additive and epistatic single nucleotide polymorphisms associated with residual feed intake in dairy cattle. , 2013, Journal of dairy science.

[33]  J. E. Cairns,et al.  Genome-enabled prediction of genetic values using radial basis function neural networks , 2012, Theoretical and Applied Genetics.

[34]  Matthew Self,et al.  Bayesian Classification , 1988, AAAI.

[35]  Geoffrey E. Hinton,et al.  Bayesian Learning for Neural Networks , 1995 .

[36]  H. Zou The Adaptive Lasso and Its Oracle Properties , 2006 .

[37]  Brian J Reich,et al.  Surface Estimation, Variable Selection, and the Nonparametric Oracle Property. , 2011, Statistica Sinica.

[38]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[39]  P Pérez-Rodríguez,et al.  Technical note: An R package for fitting Bayesian regularized neural networks with applications in animal breeding. , 2013, Journal of animal science.

[40]  Daniel Gianola,et al.  Application of support vector regression to genome-assisted prediction of quantitative traits , 2011, Theoretical and Applied Genetics.

[41]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[42]  D. Gianola,et al.  Reproducing Kernel Hilbert Spaces Regression Methods for Genomic Assisted Prediction of Quantitative Traits , 2008, Genetics.

[43]  Daniel Gianola,et al.  Marker-assisted prediction of non-additive genetic values , 2011, Genetica.

[44]  K. Weigel,et al.  Assets of imputation to ultra-high density for productive and functional traits. , 2013, Journal of dairy science.

[45]  Bruce Tier,et al.  A comparison of five methods to predict genomic breeding values of dairy bulls from genome-wide SNP markers , 2009, Genetics Selection Evolution.

[46]  M. Calus,et al.  Whole-Genome Regression and Prediction Methods Applied to Plant and Animal Breeding , 2013, Genetics.

[47]  P Pérez-Rodríguez,et al.  Model averaging for genome-enabled prediction with reproducing kernel Hilbert spaces: a case study with pig litter size and wheat yield. , 2014, Journal of animal breeding and genetics = Zeitschrift fur Tierzuchtung und Zuchtungsbiologie.

[48]  B. Mallick,et al.  Bayesian classification of tumours by using gene expression data , 2005 .

[49]  José Crossa,et al.  Semi-parametric genomic-enabled prediction of genetic values using reproducing kernel Hilbert spaces methods. , 2010, Genetics research.

[50]  R. Fernando,et al.  The Impact of Genetic Relationship Information on Genome-Assisted Breeding Values , 2007, Genetics.

[51]  M. Wand,et al.  Penalized Splines and Reproducing Kernel Methods , 2006 .

[52]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[53]  Kent A Weigel,et al.  L2-Boosting algorithm applied to high-dimensional problems in genomic selection. , 2010, Genetics research.

[54]  C. R. Henderson SIRE EVALUATION AND GENETIC TRENDS , 1973 .

[55]  Adele Cutler,et al.  An application of Random Forests to a genome-wide association dataset: Methodological considerations & new findings , 2010, BMC Genetics.

[56]  Daniel J Schaid,et al.  Genomic Similarity and Kernel Methods II: Methods for Genomic Information , 2010, Human Heredity.

[57]  Kent A Weigel,et al.  Predicting complex traits using a diffusion kernel on genetic markers with an application to dairy cattle and wheat data , 2013, Genetics Selection Evolution.

[58]  M. Goddard,et al.  Prediction of total genetic value using genome-wide dense marker maps. , 2001, Genetics.

[59]  M Hajmeer,et al.  Survival curves of Listeria monocytogenes in chorizos modeled with artificial neural networks. , 2006, Food microbiology.

[60]  R. Fernando,et al.  Genomic-Assisted Prediction of Genetic Value With Semiparametric Procedures , 2006, Genetics.

[61]  Kent A Weigel,et al.  Nonparametric Methods for Incorporating Genomic Information Into Genetic Evaluations: An Application to Mortality in Broilers , 2008, Genetics.

[62]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[63]  Stewart Bauck,et al.  Predicting expected progeny difference for marbling score in Angus cattle using artificial neural networks and Bayesian regression models , 2013, Genetics Selection Evolution.

[64]  P. Visscher,et al.  Common SNPs explain a large proportion of heritability for human height , 2011 .

[65]  Paul M VanRaden,et al.  International genomic evaluation methods for dairy cattle , 2010, Genetics Selection Evolution.