Regression with small data sets: a case study using code surrogates in additive manufacturing

There has been an increasing interest in recent years in the mining of massive data sets whose sizes are measured in terabytes. However, there are some problems where collecting even a single data point is very expensive, resulting in data sets with only tens or hundreds of samples. One such problem is that of building code surrogates, where a computer simulation is run using many different values of the input parameters and a regression model is built to relate the outputs of the simulation to the inputs. A good surrogate can be very useful in sensitivity analysis, uncertainty analysis, and in designing experiments, but the cost of running expensive simulations at many sample points can be high. In this paper, we use a problem from the domain of additive manufacturing to show that even with small data sets we can build good quality surrogates by appropriately selecting the input samples and the regression algorithm. Our work is broadly applicable to simulations in other domains and the ideas proposed can be used in time-constrained machine learning tasks, such as hyper-parameter optimization.

[1]  Don P. Mitchell,et al.  Spectrally optimal sampling for distribution ray tracing , 1991, SIGGRAPH.

[2]  Lior Rokach,et al.  Pattern Classification Using Ensemble Methods , 2009, Series in Machine Perception and Artificial Intelligence.

[3]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[4]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[5]  Wei Chen,et al.  Assessing the Reliability of Complex Models: Mathematical and Statistical Foundations of Verification, Validation, and Uncertainty Quantification , 2012 .

[6]  Andreas Holzinger,et al.  Data Mining with Decision Trees: Theory and Applications , 2015, Online Inf. Rev..

[7]  Andrew W. Moore,et al.  Locally Weighted Learning for Control , 1997, Artificial Intelligence Review.

[8]  Thomas W. Eagar,et al.  Temperature fields produced by traveling distributed heat sources , 1983 .

[9]  Pei-Yi Hao,et al.  New support vector algorithms with parametric insensitive/margin model , 2010, Neural Networks.

[10]  Yoshua Bengio,et al.  Random Search for Hyper-Parameter Optimization , 2012, J. Mach. Learn. Res..

[11]  Jack Beuth,et al.  Process Mapping for Qualification Across Multiple Direct Metal Additive Manufacturing Processes , 2013 .

[12]  Art B. Owen,et al.  Latin supercube sampling for very high-dimensional simulations , 1998, TOMC.

[13]  G. Oehlert A first course in design and analysis of experiments , 2000 .

[14]  Yoshua Bengio,et al.  Model Selection for Small Sample Regression , 2002, Machine Learning.

[15]  Angela B. Shiflet,et al.  Introduction to Computational Science: Modeling and Simulation for the Sciences , 2006 .

[16]  Michael A Babyak,et al.  What You See May Not Be What You Get: A Brief, Nontechnical Introduction to Overfitting in Regression-Type Models , 2004, Psychosomatic medicine.

[17]  Yu Guo,et al.  Sample size and statistical power considerations in high-dimensionality data settings: a comparative study of classification algorithms , 2010, BMC Bioinformatics.

[18]  Tom Craeghs,et al.  A pragmatic model for selective laser melting with evaporation , 2009 .

[19]  A. Isaksson,et al.  Cross-validation and bootstrapping are unreliable in small sample classification , 2008, Pattern Recognit. Lett..

[20]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[21]  Ewout W Steyerberg,et al.  The number of subjects per variable required in linear regression analyses. , 2015, Journal of clinical epidemiology.

[22]  Gabriel Huerta,et al.  Uncertainty Quantification in Climate Modeling and Projection , 2016 .

[23]  Art B. Owen,et al.  Quasi-Monte Carlo Sampling by , 2003, SIGGRAPH 2003.

[24]  Chandrika Kamath,et al.  Scientific Data Mining - A Practical Perspective , 2009 .

[25]  Andrew W. Moore,et al.  Locally Weighted Learning , 1997, Artificial Intelligence Review.

[26]  J. Friedman Multivariate adaptive regression splines , 1990 .

[27]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[28]  Dominic Mazzoni,et al.  Automated Knowledge Discovery from Simulators , 2006, SDM.

[29]  Chandrika Kamath,et al.  Density of additively-manufactured, 316L SS parts using laser powder-bed fusion at powers up to 400 W , 2014 .

[30]  Miguel Á. Carreira-Perpiñán,et al.  A Review of Dimension Reduction Techniques , 2009 .

[31]  Runze Li,et al.  Design and Modeling for Computer Experiments , 2005 .

[32]  Chandrika Kamath,et al.  Data mining and statistical inference in selective laser melting , 2016, The International Journal of Advanced Manufacturing Technology.

[33]  Andrey V. Gusarov,et al.  Single track formation in selective laser melting of metal powders , 2010 .

[34]  O. Mangasarian,et al.  Robust linear programming discrimination of two linearly inseparable sets , 1992 .

[35]  Jack P. C. Kleijnen Design and Analysis of Simulation Experiments , 2007 .