Which regression method to use? Making informed decisions in “data-rich/knowledge poor” scenarios – The Predictive Analytics Comparison framework (PAC)

Abstract In the big data and Manufacturing 4.0 era, there is a growing interest in using advanced analytical platforms to develop predictive modeling approaches that take advantage of the wealthy of data available. Typically, practitioners have their own favorite methods to address the modeling task, as a result of their technical background, past experience or software available, among other possible reasons. However, the importance of this task in the future justifies and requires more informed decisions about the predictive solution to adopt. Therefore, a wider variety of methods should be considered and assessed before taking the final decision. Having passed through this process many times and in different application scenarios (chemical industry, biofuels, drink and food, shipping industry, etc.), the authors developed a software framework that is able to speed up the selection process, while securing a rigorous and robust assessment: the Predictive Analytics Comparison framework (PAC). PAC is a systematic and robust framework for model screening and development that was developed in Matlab, but its implementation can be carried out on other software platforms. It comprises four essential blocks: i) Analytics Domain; ii) Data Domain; iii) Comparison Engine; iv) Results Report. PAC was developed for the case of a single response variable, but can be extended to multiple responses by considering each one separately. Some case studies will be presented in this article in order to illustrate PAC's efficiency and robustness for problem-specific methods screening, in the absence of prior knowledge. For instance, the analysis of a real world dataset reveals that, even when addressing the same predictive problem and using the same response variable, the best modeling approach may not be the one foreseen a priori and may not even be always the same when different predictor sets are used. With an increasing frequency, situations like these raise considerable challenges to practitioners, underlining the importance of having a tool such as PAC to assist them in making more informed decisions and to benefit from the availability of data in Manufacturing 4.0 environments.

[1]  Age K. Smilde,et al.  Variable importance in latent variable regression models , 2014 .

[2]  S. Wold,et al.  The Collinearity Problem in Linear Regression. The Partial Least Squares (PLS) Approach to Generalized Inverses , 1984 .

[3]  Jez Willian Batista Braga,et al.  Determination of viscosity index in lubricant oils by infrared spectroscopy and PLSR , 2014 .

[4]  S. Wold,et al.  PLS-regression: a basic tool of chemometrics , 2001 .

[5]  C. F. Jeff Wu,et al.  Experiments , 2021, Wiley Series in Probability and Statistics.

[6]  J. S. Hunter,et al.  Statistics for Experimenters: Design, Innovation, and Discovery , 2006 .

[7]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[8]  Pedro M Saraiva,et al.  Aroma ageing trends in GC/MS profiles of liqueur wines. , 2010, Analytica chimica acta.

[9]  D. Rubin,et al.  Statistical Analysis with Missing Data. , 1989 .

[10]  D. Massart,et al.  Dealing with missing data , 2001 .

[11]  T. Hesterberg,et al.  Least angle and ℓ1 penalized regression: A review , 2008, 0802.0964.

[12]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[13]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.

[14]  Marco S. Reis,et al.  Challenges in the Specification and Integration of Measurement Uncertainty in the Development of Data-Driven Models for the Chemical Processing Industry , 2015 .

[15]  Yong-Soo Kim,et al.  Comparison of the decision tree, artificial neural network, and linear regression methods based on the number and types of independent variables and sample size , 2008, Expert Syst. Appl..

[16]  Costel Sârbu,et al.  Simultaneous Spectrophotometric Determination of Aspirin, Paracetamol, Caffeine, and Chlorphenamine from Pharmaceutical Formulations Using Multivariate Regression Methods , 2010 .

[17]  Sunil Erevelles,et al.  Big Data consumer analytics and the transformation of marketing , 2016 .

[18]  G. Tutz,et al.  An introduction to recursive partitioning: rationale, application, and characteristics of classification and regression trees, bagging, and random forests. , 2009, Psychological methods.

[19]  Jay Lee,et al.  Recent advances and trends in predictive manufacturing systems in big data environment , 2013 .

[20]  Alberto Ferrer,et al.  Comparison of different predictive models for nutrient estimation in a sequencing batch reactor for wastewater treatment , 2006 .

[21]  Johan Bollen,et al.  Twitter mood predicts the stock market , 2010, J. Comput. Sci..

[22]  D. Massart,et al.  Dealing with missing data: Part II , 2001 .

[23]  I. Johnstone,et al.  Statistical challenges of high-dimensional data , 2009, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[24]  L. Brás,et al.  A bootstrap‐based strategy for spectral interval selection in PLS regression , 2008 .

[25]  Marco S. Reis,et al.  Assessment and Prediction of Lubricant Oil Properties Using Infrared Spectroscopy and Advanced Predictive Analytics , 2017 .

[26]  Kai Yang,et al.  A predictive analytics approach to reducing 30-day avoidable readmissions among patients with heart failure, acute myocardial infarction, pneumonia, or COPD , 2015, Health care management science.

[27]  Yoav Freund,et al.  A Short Introduction to Boosting , 1999 .

[28]  Dong-Sheng Cao,et al.  The boosting: A new idea of building models , 2010 .

[29]  Paul H. C. Eilers,et al.  Uncommon penalties for common problems , 2017 .

[30]  B. Kowalski,et al.  Partial least-squares regression: a tutorial , 1986 .

[31]  Noel D.G. White,et al.  Comparison of Partial Least Squares Regression (PLSR) and Principal Components Regression (PCR) Methods for Protein and Hardness Predictions using the Near-Infrared (NIR) Hyperspectral Images of Bulk Samples of Canadian Wheat , 2014, Food and Bioprocess Technology.

[32]  Douglas B. Kell,et al.  Genetic algorithms as a method for variable selection in multiple linear regression and partial least squares regression, with applications to pyrolysis mass spectrometry , 1997 .

[33]  A. E. Hoerl,et al.  Ridge regression: biased estimation for nonorthogonal problems , 2000 .

[34]  C. Jun,et al.  Performance of some variable selection methods when multicollinearity is present , 2005 .

[35]  Amir F. Atiya,et al.  An Empirical Comparison of Machine Learning Models for Time Series Forecasting , 2010 .

[36]  Ricardo R. Rendall,et al.  Advanced predictive methods for wine age prediction: Part I - A comparison study of single-block regression approaches based on variable selection, penalized regression, latent variables and tree-based ensemble methods. , 2017, Talanta.

[37]  Harald Martens,et al.  Quantitative Big Data: where chemometrics can contribute , 2015 .

[38]  Rasmus Bro,et al.  Variable selection in regression—a tutorial , 2010 .

[39]  Bernhard Schölkopf,et al.  A tutorial on support vector regression , 2004, Stat. Comput..

[40]  Pedro M. Saraiva,et al.  Integration of data uncertainty in linear regression and process optimization , 2005 .

[41]  Paul Geladi,et al.  Principal Component Analysis , 1987, Comprehensive Chemometrics.

[42]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[43]  Tahir Mehmood,et al.  A review of variable selection methods in Partial Least Squares Regression , 2012 .

[44]  J Elith,et al.  A working guide to boosted regression trees. , 2008, The Journal of animal ecology.

[45]  Pedro M. Saraiva,et al.  Analysis and assessment of Madeira wine ageing over an extended time period through GC-MS and chemometric analysis. , 2010, Analytica chimica acta.

[46]  Rasmus Bro,et al.  A tutorial on the Lasso approach to sparse modeling , 2012 .

[47]  Marco S. Reis,et al.  Madeira wine ageing prediction based on different analytical techniques: UV–vis, GC-MS, HPLC-DAD , 2011 .

[48]  Marco S. Reis,et al.  Applications of a new empirical modelling framework for balancing model interpretation and prediction accuracy through the incorporation of clusters of functionally related variables , 2013 .

[49]  John F. MacGregor,et al.  Interpretation of regression coefficients under a latent variable regression model , 2001 .

[50]  S. Chatterjee,et al.  Regression Analysis by Example , 1979 .

[51]  Tormod Næs,et al.  Understanding the collinearity problem in regression and discriminant analysis , 2001 .

[52]  David Makowski,et al.  Comparison of regression techniques to predict response of oilseed rape yield to variation in climatic conditions in Denmark , 2017 .

[53]  R. Boggia,et al.  Genetic algorithms as a strategy for feature selection , 1992 .

[54]  S. Engelsen,et al.  Interval Partial Least-Squares Regression (iPLS): A Comparative Chemometric Study with an Example from Near-Infrared Spectroscopy , 2000 .

[55]  M Gishen,et al.  Analysis of elements in wine using near infrared spectroscopy and partial least squares regression. , 2008, Talanta.

[56]  A. Ferrer,et al.  Dealing with missing data in MSPC: several methods, different interpretations, some examples , 2002 .

[57]  Alison J. Burnham,et al.  Frameworks for latent variable multivariate regression , 1996 .

[58]  N. Draper,et al.  Applied Regression Analysis , 1967 .

[59]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[60]  Pedro M. Saraiva,et al.  A comparative study of linear regression methods in noisy environments , 2004 .

[61]  I. Jolliffe Principal Component Analysis , 2002 .

[62]  Riccardo Leardi,et al.  Genetic algorithms in chemistry. , 2007, Journal of chromatography. A.

[63]  Andrea D. Magrì,et al.  Artificial neural networks in chemometrics: History, examples and perspectives , 2008 .

[64]  Douglas C. Montgomery,et al.  Applied Statistics and Probability for Engineers, Third edition , 1994 .

[65]  Fei Tao,et al.  Big Data in product lifecycle management , 2015, The International Journal of Advanced Manufacturing Technology.

[66]  Alison J. Burnham,et al.  LATENT VARIABLE MULTIVARIATE REGRESSION MODELING , 1999 .

[67]  Roman Rosipal,et al.  Kernel Partial Least Squares Regression in Reproducing Kernel Hilbert Space , 2002, J. Mach. Learn. Res..

[68]  Veda C. Storey,et al.  Business Intelligence and Analytics: From Big Data to Big Impact , 2012, MIS Q..

[69]  D. E. Nichols,et al.  Quantitative structure-activity relationship modeling of dopamine D(1) antagonists using comparative molecular field analysis, genetic algorithms-partial least-squares, and K nearest neighbor methods. , 1999, Journal of medicinal chemistry.

[70]  D B Kell,et al.  Variable selection in discriminant partial least-squares analysis. , 1998, Analytical chemistry.