Design choice and machine learning model performances

An increasing number of publications present the joint application of Design of Experiments (DOE) and machine learning (ML) as a methodology to collect and analyze data on a specific industrial phenomenon. However, the literature shows that the choice of the design for data collection and model for data analysis is often driven by incidental factors, rather than by statistical or algorithmic advantages, thus there is a lack of studies which provide guidelines on what designs and ML models to jointly use for data collection and analysis. This is the first time in the literature that a paper discusses the choice of design in relation to the ML model performances. An extensive study is conducted that considers 12 experimental designs, 7 families of predictive models, 7 test functions that emulate physical processes, and 8 noise settings, both homoscedastic and heteroscedastic. The results of the research can have an immediate impact on the work of practitioners, providing guidelines for practical applications of DOE and ML.

[1]  Connie M. Borror,et al.  Response Surface Methodology: A Retrospective and Literature Survey , 2004 .

[2]  Barry L. Nelson,et al.  Stochastic kriging for simulation metamodeling , 2008, 2008 Winter Simulation Conference.

[3]  Peter Goos,et al.  A-optimal versus D-optimal design of screening experiments , 2020 .

[4]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[5]  M. Stein Large sample properties of simulations using latin hypercube sampling , 1987 .

[6]  L. Corain,et al.  Ranking of Multivariate Populations: A Permutation Approach with Applications , 2016 .

[7]  Yves Deville,et al.  DiceKriging, DiceOptim: Two R Packages for the Analysis of Computer Experiments by Kriging-Based Metamodeling and Optimization , 2012 .

[8]  Morris H. Degroot Multiple Decision Procedures , 2006 .

[9]  Peter Goos,et al.  I-Optimal Design of Mixture Experiments , 2016 .

[10]  R. H. Myers,et al.  Fraction of Design Space to Assess Prediction Capability of Response Surface Designs , 2003 .

[11]  Margaret J. Robertson,et al.  Design and Analysis of Experiments , 2006, Handbook of statistics.

[12]  William A. Brenneman,et al.  Robust experimental designs for model calibration , 2020, 2008.00547.

[13]  V. Roshan Joseph,et al.  Designing computer experiments with multiple types of factors: The MaxPro approach , 2020, Journal of Quality Technology.

[14]  P. Alam ‘W’ , 2021, Composites Engineering.

[15]  T. J. Mitchell,et al.  Exploratory designs for computational experiments , 1995 .

[16]  Connie M. Borror,et al.  Generating and Assessing Exact G-Optimal Designs , 2010 .

[17]  Dan Cornford,et al.  Optimal design for correlated processes with input-dependent noise , 2014, Comput. Stat. Data Anal..

[18]  V. Roshan Joseph,et al.  Space-filling designs for computer experiments: A review , 2016 .

[19]  L. Salmaso,et al.  Interval selection: A case‐study‐based approach , 2021 .

[20]  Kurt Hornik,et al.  kernlab - An S4 Package for Kernel Methods in R , 2004 .

[21]  Mike Ludkovski,et al.  Replication or Exploration? Sequential Design for Stochastic Simulation Experiments , 2017, Technometrics.

[22]  田口 玄一,et al.  Introduction to quality engineering : designing quality into products and processes , 1986 .

[23]  Willis A. Jensen Open problems and issues in optimal design , 2018, Quality Engineering.

[24]  D. Ginsbourger,et al.  A benchmark of kriging-based infill criteria for noisy optimization , 2013, Structural and Multidisciplinary Optimization.

[25]  田口 玄一,et al.  System of experimental design : engineering methods to optimize quality and minimize costs , 1987 .

[26]  Peter A. Parker,et al.  Response surface split‐plot designs: A literature review , 2018, Qual. Reliab. Eng. Int..

[27]  Christopher J. Nachtsheim,et al.  A Class of Three-Level Designs for Definitive Screening in the Presence of Second-Order Effects , 2011 .

[28]  Trevor Hastie,et al.  An Introduction to Statistical Learning , 2013, Springer Texts in Statistics.

[29]  George E. P. Box,et al.  Statistics as a catalyst to learning by scientific method , 1999 .

[30]  G. Box,et al.  On the Experimental Attainment of Optimum Conditions , 1951 .

[31]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2001, Springer Series in Statistics.

[32]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[33]  Haitao Liu,et al.  When Gaussian Process Meets Big Data: A Review of Scalable GPs , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[34]  Martín Tanco,et al.  Practical applications of design of experiments in the field of engineering: a bibliographical review , 2008, Qual. Reliab. Eng. Int..

[35]  Luigi Salmaso,et al.  A permutation approach for ranking of multivariate populations , 2014, J. Multivar. Anal..

[36]  V. R. Joseph,et al.  Maximum projection designs for computer experiments , 2015 .

[37]  Richard J. Beckman,et al.  A Comparison of Three Methods for Selecting Values of Input Variables in the Analysis of Output From a Computer Code , 2000, Technometrics.

[38]  Robert B. Gramacy,et al.  Cases for the nugget in modeling computer experiments , 2010, Statistics and Computing.

[39]  B. M. Brown,et al.  Permutation Tests for Complex Data: Theory, Applications and Software by F. Pesarin and L. Salmaso , 2012 .

[40]  P. Alam ‘E’ , 2021, Composites Engineering: An A–Z Guide.

[41]  Xiang Li,et al.  Regularities in data from factorial experiments , 2006, Complex..

[42]  Luigi Salmaso,et al.  Machine learning and design of experiments with an application to product innovation in the chemical industry , 2021 .

[43]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[44]  L. Salmaso,et al.  Design of Experiments and machine learning for product innovation: A systematic literature review , 2021, Qual. Reliab. Eng. Int..

[45]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[46]  R. K. Meyer,et al.  The Coordinate-Exchange Algorithm for Constructing Exact Optimal Experimental Designs , 1995 .

[47]  Danna Zhou,et al.  d. , 1840, Microbial pathogenesis.

[48]  Victor Picheny,et al.  Comparison of Kriging-based algorithms for simulation optimization with heterogeneous noise , 2017, Eur. J. Oper. Res..

[49]  E. LeDell,et al.  H2O AutoML: Scalable Automatic Machine Learning , 2020 .