Genomic-enabled prediction with classification algorithms

Pearson’s correlation coefficient (ρ) is the most commonly reported metric of the success of prediction in genomic selection (GS). However, in real breeding ρ may not be very useful for assessing the quality of the regression in the tails of the distribution, where individuals are chosen for selection. This research used 14 maize and 16 wheat data sets with different trait–environment combinations. Six different models were evaluated by means of a cross-validation scheme (50 random partitions each, with 90% of the individuals in the training set and 10% in the testing set). The predictive accuracy of these algorithms for selecting individuals belonging to the best α=10, 15, 20, 25, 30, 35, 40% of the distribution was estimated using Cohen’s kappa coefficient (κ) and an ad hoc measure, which we call relative efficiency (RE), which indicates the expected genetic gain due to selection when individuals are selected based on GS exclusively. We put special emphasis on the analysis for α=15%, because it is a percentile commonly used in plant breeding programmes (for example, at CIMMYT). We also used ρ as a criterion for overall success. The algorithms used were: Bayesian LASSO (BL), Ridge Regression (RR), Reproducing Kernel Hilbert Spaces (RHKS), Random Forest Regression (RFR), and Support Vector Regression (SVR) with linear (lin) and Gaussian kernels (rbf). The performance of regression methods for selecting the best individuals was compared with that of three supervised classification algorithms: Random Forest Classification (RFC) and Support Vector Classification (SVC) with linear (lin) and Gaussian (rbf) kernels. Classification methods were evaluated using the same cross-validation scheme but with the response vector of the original training sets dichotomised using a given threshold. For α=15%, SVC-lin presented the highest κ coefficients in 13 of the 14 maize data sets, with best values ranging from 0.131 to 0.722 (statistically significant in 9 data sets) and the best RE in the same 13 data sets, with values ranging from 0.393 to 0.948 (statistically significant in 12 data sets). RR produced the best mean for both κ and RE in one data set (0.148 and 0.381, respectively). Regarding the wheat data sets, SVC-lin presented the best κ in 12 of the 16 data sets, with outcomes ranging from 0.280 to 0.580 (statistically significant in 4 data sets) and the best RE in 9 data sets ranging from 0.484 to 0.821 (statistically significant in 5 data sets). SVC-rbf (0.235), RR (0.265) and RHKS (0.422) gave the best κ in one data set each, while RHKS and BL tied for the last one (0.234). Finally, BL presented the best RE in two data sets (0.738 and 0.750), RFR (0.636) and SVC-rbf (0.617) in one and RHKS in the remaining three (0.502, 0.458 and 0.586). The difference between the performance of SVC-lin and that of the rest of the models was not so pronounced at higher percentiles of the distribution. The behaviour of regression and classification algorithms varied markedly when selection was done at different thresholds, that is, κ and RE for each algorithm depended strongly on the selection percentile. Based on the results, we propose classification method as a promising alternative for GS in plant breeding.

[1]  Deepayan Sarkar,et al.  Lattice: Multivariate Data Visualization with R , 2008 .

[2]  M. Sorrells,et al.  Plant Breeding with Genomic Selection: Gain per Unit Time and Cost , 2010 .

[3]  Francisco Herrera,et al.  Addressing the Classification with Imbalanced Data: Open Problems and New Challenges on Class Distribution , 2011, HAIS.

[4]  Ky L. Mathews,et al.  Genomic Prediction of Genetic Values for Resistance to Wheat Rusts , 2012 .

[5]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[6]  Michael K. Gilson,et al.  Virtual Screening of Molecular Databases Using a Support Vector Machine , 2005, J. Chem. Inf. Model..

[7]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[8]  R. Fernando,et al.  Genomic-Assisted Prediction of Genetic Value With Semiparametric Procedures , 2006, Genetics.

[9]  José Crossa,et al.  Genomic Prediction of Breeding Values when Modeling Genotype × Environment Interaction using Pedigree and Dense Molecular Markers , 2012 .

[10]  John Bell,et al.  A review of methods for the assessment of prediction errors in conservation presence/absence models , 1997, Environmental Conservation.

[11]  Gabor Grothendieck,et al.  Lattice: Multivariate Data Visualization with R , 2008 .

[12]  D. Gianola Priors in Whole-Genome Regression: The Bayesian Alphabet Returns , 2013, Genetics.

[13]  J Crossa,et al.  Prediction of genetic values of quantitative traits with epistatic effects in plant breeding populations , 2012, Heredity.

[14]  G. Casella,et al.  The Bayesian Lasso , 2008 .

[15]  D. Falconer,et al.  Introduction to Quantitative Genetics. , 1961 .

[16]  D. Gianola,et al.  Comparison Between Linear and Non-parametric Regression Models for Genome-Enabled Prediction in Wheat , 2012, G3: Genes | Genomes | Genetics.

[17]  Sung-Bae Cho,et al.  Hybrid Artificial Intelligent Systems , 2015, Lecture Notes in Computer Science.

[18]  J Crossa,et al.  Genomic prediction in CIMMYT maize and wheat breeding programs , 2013, Heredity.

[19]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[20]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[21]  Hao Helen Zhang,et al.  Hard or Soft Classification? Large-Margin Unified Machines , 2011, Journal of the American Statistical Association.

[22]  José Crossa,et al.  Genomic‐Enabled Prediction Based on Molecular Markers and Pedigree Using the Bayesian Linear Regression Package in R , 2010, The plant genome.

[23]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[24]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[25]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[26]  D. Gianola,et al.  Reproducing Kernel Hilbert Spaces Regression Methods for Genomic Assisted Prediction of Quantitative Traits , 2008, Genetics.

[27]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[28]  M. Goddard,et al.  Prediction of total genetic value using genome-wide dense marker maps. , 2001, Genetics.

[29]  D. Allison,et al.  A Comprehensive Genetic Approach for Improving Prediction of Skin Cancer Risk in Humans , 2012, Genetics.

[30]  M. Calus,et al.  Whole-Genome Regression and Prediction Methods Applied to Plant and Animal Breeding , 2013, Genetics.

[31]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[32]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[33]  Hsiao-Pei Yang,et al.  Genomic Selection in Plant Breeding: A Comparison of Models , 2012 .

[34]  D. Falconer Introduction to quantitative genetics. 1. ed. , 1984 .

[35]  K. Weigel,et al.  Machine learning classification procedure for selecting SNPs in genomic selection: application to early mortality in broilers. , 2007, Developments in biologicals.

[36]  O. González-Recio,et al.  Genome-wide prediction of discrete traits using bayesian regressions and machine learning , 2011, Genetics Selection Evolution.

[37]  Jeffrey B. Endelman,et al.  Ridge Regression and Other Kernels for Genomic Selection with R Package rrBLUP , 2011 .

[38]  J. E. Cairns,et al.  Genome-enabled prediction of genetic values using radial basis function neural networks , 2012, Theoretical and Applied Genetics.

[39]  Ashutosh Kumar Singh,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2010 .