Prediction model optimization using full model selection with regression trees demonstrated with FTIR data from bovine milk.

Predictive modeling is the development of a model that is best able to predict an outcome based on given input variables. Model algorithms are different processes that are used to define functions that transform the data within models. Common algorithms include logistic regression (LR), linear discriminant analysis (LDA), classification and regression trees (CART), naïve Bayes (NB), and k-nearest neighbor (KNN). Data preprocessing option, such as feature extraction and reduction, and model algorithms are commonly selected empirically in epidemiological studies even though these decisions can significantly affect model performance. Accordingly, full model selection (FMS) methods were developed to provide a systematic approach to select predictive modeling methods; however, current limitations of FMS, such as its dependency on user-selected hyperparameters, have prevented their routine incorporation into analyses for model performance optimization. Here we present the use of regression trees as an innovative method to apply FMS. Regression tree FMS (rtFMS) requires the development of a model for every combination of predictive modeling method options under consideration. The iterated, cross-validation performances of these models are then passed through a regression tree for selection of a final model. We demonstrate the benefits of rtFMS using a milk Fourier transform infrared spectroscopy dataset, wherein we build prediction models for two blood metabolic health parameters in dairy cows, nonesterified fatty acids (NEFA) and β-hydroxybutyrate acid (BHBA). The goal for building NEFA and BHBA prediction models is to provide a milk-based screening tool for metabolic health in dairy cattle that can be incorporated automatically in milk analysis routines. These models could be used in conjunction with physical exams, cow side tests, and other indications to initiate medical intervention. In contrast to previously reported FMS methods, rtFMS is not a black box, is simple to implement and interpret, it does not have hyperparameters, and it illustrates the relative importance of modeling options. Additionally, rtFMS allows for indirect comparisons among models developed using different datasets. Finally, rtFMS eliminates user bias due to personal preference for certain methods and rtFMS removes the dependency on published comparisons of methods. Thus, rtFMS provides clear benefits over the empirical selection of data preprocessing options and model algorithms.

[1]  D. Kelton,et al.  Evaluation of five cowside tests for use with milk to detect subclinical ketosis in dairy cows. , 1998, Journal of dairy science.

[2]  Åsmund Rinnan,et al.  Pre-processing in vibrational spectroscopy – when, why and how , 2014 .

[3]  Seymour Geisser,et al.  8. Predictive Inference: An Introduction , 1995 .

[4]  Russell C. Eberhart,et al.  A new optimizer using particle swarm theory , 1995, MHS'95. Proceedings of the Sixth International Symposium on Micro Machine and Human Science.

[5]  Yoshua Bengio,et al.  Random Search for Hyper-Parameter Optimization , 2012, J. Mach. Learn. Res..

[6]  R. Pralle,et al.  Predicting blood β-hydroxybutyrate using milk Fourier transform infrared spectrum, milk composition, and producer-reported variables with multiple linear regression, partial least squares regression, and artificial neural network. , 2018, Journal of dairy science.

[7]  R. Blowey,et al.  Bovine medicine. Diseases and husbandry of cattle , 1992 .

[8]  Claus Weihs,et al.  klaR Analyzing German Business Cycles , 2005, Data Analysis and Decision Support.

[9]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[10]  M. Neville,et al.  Regulation of milk lipid secretion and composition. , 1997, Annual review of nutrition.

[11]  D. Döpfer,et al.  Identifying poor metabolic adaptation during early lactation in dairy cows using cluster analysis. , 2018, Journal of dairy science.

[12]  Max Kuhn,et al.  Applied Predictive Modeling , 2013 .

[13]  J. Pierna,et al.  Standardization of milk mid-infrared spectrometers for the transfer and use of multiple models. , 2017, Journal of dairy science.

[14]  H. Martens,et al.  Predicting the Fatty Acid Composition of Milk: A Comparison of Two Fourier Transform Infrared Sampling Techniques , 2010, Applied spectroscopy.

[15]  B. De Baets,et al.  Milk fatty acids as possible biomarkers to early diagnose elevated concentrations of blood plasma nonesterified fatty acids in dairy cows. , 2014, Journal of dairy science.

[16]  Rupert Lanzenberger,et al.  Correlations and anticorrelations in resting-state functional connectivity MRI: A quantitative comparison of preprocessing strategies , 2009, NeuroImage.

[17]  Rohit Bhargava,et al.  Using Fourier transform IR spectroscopy to analyze biological materials , 2014, Nature Protocols.

[18]  D. Nydam,et al.  Dry period plane of energy: Effects on glucose tolerance in transition dairy cows. , 2016, Journal of dairy science.

[19]  R. Clegg,et al.  Lipid metabolism in the lactating mammary gland. , 1997, Biochimica et biophysica acta.

[20]  Trevor Hastie,et al.  An Introduction to Statistical Learning , 2013, Springer Texts in Statistics.

[21]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[22]  Bhavna Bansal,et al.  Full model selection using Bat algorithm , 2015, 2015 International Conference on Cognitive Computing and Information Processing(CCIP).

[23]  Daniel B. Mark,et al.  TUTORIAL IN BIOSTATISTICS MULTIVARIABLE PROGNOSTIC MODELS: ISSUES IN DEVELOPING MODELS, EVALUATING ASSUMPTIONS AND ADEQUACY, AND MEASURING AND REDUCING ERRORS , 1996 .

[24]  B. Dagnachew,et al.  An attempt at predicting blood β-hydroxybutyrate from Fourier-transform mid-infrared spectra of milk using multivariate mixed models in Polish dairy cattle. , 2017, Journal of dairy science.

[25]  J. Riedl,et al.  Non-targeted detection of paprika adulteration using mid-infrared spectroscopy and one-class classification - Is it data preprocessing that makes the performance? , 2018, Food chemistry.

[26]  Achim Zeileis,et al.  Partykit: a modular toolkit for recursive partytioning in R , 2015, J. Mach. Learn. Res..

[27]  Benjamin Smith,et al.  PRFFECT: a versatile tool for spectroscopists , 2018 .

[28]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[29]  D. Nydam,et al.  A 100-Year Review: Metabolic health indicators and management of dairy cattle. , 2017, Journal of dairy science.

[30]  Saso Dzeroski,et al.  Comparison of Tree-Based Methods for Multi-target Regression on Data Streams , 2015, NFMCP.

[31]  B. Kemp,et al.  Short communication: ketone body concentration in milk determined by Fourier transform infrared spectroscopy: value for the detection of hyperketonemia in dairy cows. , 2010, Journal of dairy science.

[32]  Quan Sun,et al.  Meta-Learning and the Full Model Selection Problem , 2014 .

[33]  K. Hornik,et al.  Unbiased Recursive Partitioning: A Conditional Inference Framework , 2006 .

[34]  A. Bell Regulation of organic nutrient metabolism during transition from late pregnancy to early lactation. , 1995, Journal of animal science.

[35]  P. W. Hansen,et al.  Vibrational Spectroscopy in the Analysis of Dairy Products and Wine , 2006 .

[36]  G. Socrates,et al.  Infrared Characteristic Group Frequencies , 1980 .

[37]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[38]  J. H. van der Maas,et al.  Determination of acetone in cow milk by Fourier transform infrared spectroscopy for the detection of subclinical ketosis. , 2001, Journal of dairy science.

[39]  W. Heuwieser,et al.  Prevalence of subclinical ketosis and relationships with postpartum diseases in European dairy cows. , 2013, Journal of dairy science.

[40]  Genetic and environmental information in goat milk Fourier transform infrared spectra. , 2013, Journal of dairy science.

[41]  Y. Ozaki,et al.  Spectra-structure correlations of saturated and unsaturated medium-chain fatty acids. Near-infrared and anharmonic DFT study of hexanoic acid and sorbic acid. , 2017, Spectrochimica acta. Part A, Molecular and biomolecular spectroscopy.

[42]  B. Wickham,et al.  Final OptiMIR Scientific and Expert Meeting: From milk analysis to advisory tools (Palais des Congrès, Namur, Belgium, 16-17 April 2015) , 2015 .

[43]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[44]  Hugo Jair Escalante,et al.  Particle Swarm Model Selection , 2009, J. Mach. Learn. Res..

[45]  J. Hair Multivariate data analysis , 1972 .

[46]  Senén Barro,et al.  Do we need hundreds of classifiers to solve real world classification problems? , 2014, J. Mach. Learn. Res..

[47]  R S Pralle,et al.  Predicting hyperketonemia by logistic and linear regression using test-day milk and performance variables in early-lactation Holstein and Jersey cows. , 2017, Journal of dairy science.

[48]  L. Armentano,et al.  Use of milk fatty acids to estimate plasma nonesterified fatty acid concentrations as an indicator of animal energy balance. , 2017, Journal of dairy science.

[49]  J. Duckworth Mathematical Data Preprocessing , 2015 .

[50]  Peiqiang Yu,et al.  Comparison of grating-based near-infrared (NIR) and Fourier transform mid-infrared (ATR-FT/MIR) spectroscopy based on spectral preprocessing and wavelength selection for the determination of crude protein and moisture content in wheat , 2017 .

[51]  P. Dardenne,et al.  Potential use of milk mid-infrared spectra to predict individual methane emission of dairy cows. , 2012, Animal : an international journal of animal bioscience.

[52]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[53]  Luís Torgo,et al.  Data Mining with R: Learning with Case Studies , 2010 .

[54]  Y. Etzion,et al.  Determination of protein concentration in raw milk by mid-infrared fourier transform infrared/attenuated total reflectance spectroscopy. , 2004, Journal of dairy science.

[55]  P. W. Hansen,et al.  Screening of dairy cows for ketosis by use of infrared spectroscopy and multivariate calibration. , 1999, Journal of dairy science.

[56]  S. Garrigues,et al.  Nutritional parameters of commercially available milk samples by FTIR and chemometric techniques , 2004 .

[57]  D. Nydam,et al.  Epidemiology of subclinical ketosis in early lactation dairy cattle. , 2012, Journal of dairy science.

[58]  S. LeBlanc,et al.  Metabolic predictors of displaced abomasum in dairy cattle. , 2005, Journal of dairy science.

[59]  Tapio Elomaa,et al.  Multi-target regression with rule ensembles , 2012, J. Mach. Learn. Res..

[60]  Bruno G Botelho,et al.  Development and analytical validation of a screening method for simultaneous detection of five adulterants in raw milk using mid-infrared spectroscopy and PLS-DA. , 2015, Food chemistry.

[61]  Andrew Gelman,et al.  Data Analysis Using Regression and Multilevel/Hierarchical Models , 2006 .

[62]  François Chollet,et al.  Deep Learning with Python , 2017 .

[63]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[64]  Steven D. Brown,et al.  Transfer of multivariate calibration models: a review , 2002 .