Benchmarking support vector regression against partial least squares regression and artificial neural network: Effect of sample size on model performance

It has become easy to obtain multivariate chemical data of high dimensions. However, it may be expensive or time consuming to obtain a large number of samples or to acquire reference measures, so the number of samples available for multivariate calibration modelling may be limited. If data contains nonlinear relationships, nonlinear methods are required for the calibration task. The combination of limited amounts of data of high dimensions and highly flexible nonlinear methods may result in overfitted models which in turn perform badly on new data. Therefore, for real world applications, it is desirable to understand how the sample size affects model prediction performance. For this purpose, we compared partial least squares regression, artificial neural network, and support vector regression applied to three real world nonlinear datasets of which two were of high dimensions. We evaluated the effect of calibration sample size (i) on test set performance, including variation in test set performance due to sampling variation and (ii) tested if the cross-validated performance was adequate for assessing the predictive ability. We demonstrated the applicability of artificial neural network and support vector regression for real world data of limited size and showed that support vector regression had advantages over artificial neural network: (i) fewer calibration samples were required to obtain a desired model performance, (ii) support vector regression was less sensitive to sampling variation for small sample sets and (iii) cross-validation was an approximately unbiased option for evaluating the true support vector regression model performance even for small sample sets.

[1]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[2]  Charles E. Miller,et al.  Sources of Non-Linearity in near Infrared Methods , 1993 .

[3]  D Haussler,et al.  Knowledge-based analysis of microarray gene expression data by using support vector machines. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[4]  Marcelo Blanco,et al.  Determination of olive oil free fatty acid by fourier transform infrared spectroscopy , 1999 .

[5]  Bernard F. Buxton,et al.  Drug Design by Machine Learning: Support Vector Machines for Pharmaceutical Data Analysis , 2001, Comput. Chem..

[6]  Antonio Moreno Jiménez,et al.  A Review: Artificial Neural Networks as Tool for Control Food Industry Process , 2015 .

[7]  M. Mørup,et al.  Non-linear calibration models for near infrared spectroscopy. , 2014, Analytica chimica acta.

[8]  J. Zupan,et al.  Neural networks: A new method for solving chemical problems or just a passing phase? , 1991 .

[9]  Daniel Svozil,et al.  Introduction to multi-layer feed-forward neural networks , 1997 .

[10]  C. Nantasenamat,et al.  Prediction of bond dissociation enthalpy of antioxidant phenols by support vector machine. , 2008, Journal of molecular graphics & modelling.

[11]  Florian Nigsch,et al.  Why Are Some Properties More Difficult To Predict than Others? A Study of QSPR Models of Solubility, Melting Point, and Log P , 2008, J. Chem. Inf. Model..

[12]  I. D. Gates,et al.  Support vector regression to predict porosity and permeability: Effect of sample size , 2012, Comput. Geosci..

[13]  Bernhard Schölkopf,et al.  A tutorial on support vector regression , 2004, Stat. Comput..

[14]  Ing-Marie Olsson,et al.  D-optimal onion designs in statistical molecular design , 2004 .

[15]  Age K. Smilde,et al.  Temperature Robust Multivariate Calibration: An Overview of Methods for Dealing with Temperature Influences on near Infrared Spectra , 2005 .

[16]  Kuo-Chen Chou,et al.  Prediction of Protein Structural Classes by Support Vector Machines , 2002, Comput. Chem..

[17]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[18]  Jifeng Ning,et al.  Predicting the anthocyanin content of wine grapes by NIR hyperspectral imaging. , 2015, Food chemistry.

[19]  Shikha Gupta,et al.  Artificial intelligence based modeling for predicting the disinfection by-products in water , 2012 .

[20]  Barry J. Wythoff,et al.  Backpropagation neural networks , 1993 .

[21]  Rubiyah Yusof,et al.  Analytical modeling and simulation of I–V characteristics in carbon nanotube based gas sensors using ANN and SVR methods , 2014 .

[22]  Rikke Ingemann Tange,et al.  Application of Support Vector Regression for Simultaneous Modelling of near Infrared Spectra from Multiple Process Steps , 2015 .

[23]  Chuanhou Gao,et al.  A comparative analysis of support vector machines and extreme learning machines , 2012, Neural Networks.

[24]  Raghavan Srinivasan,et al.  Approximating SWAT Model Using Artificial Neural Network and Support Vector Machine 1 , 2009 .

[25]  S. Wold,et al.  PLS-regression: a basic tool of chemometrics , 2001 .

[26]  Igor V. Tetko,et al.  How Accurately Can We Predict the Melting Points of Drug-like Compounds? , 2014, J. Chem. Inf. Model..

[27]  Inci Batmaz,et al.  A review of data mining applications for quality improvement in manufacturing industry , 2011, Expert Syst. Appl..

[28]  Frank R Burden,et al.  Quantitative structure-property relationship modeling of diverse materials properties. , 2012, Chemical reviews.

[29]  Roman M. Balabin,et al.  Interpolation and extrapolation problems of multivariate regression in analytical chemistry: benchmarking the robustness on near-infrared (NIR) spectroscopy data. , 2012, The Analyst.

[30]  Yang Shao,et al.  Comparison of support vector machine, neural network, and CART algorithms for the land-cover classification using limited training data points , 2012 .

[31]  Roman M. Balabin,et al.  Support vector machine regression (SVR/LS-SVM)--an alternative to neural networks (ANN) for analytical chemistry? Comparison of nonlinear methods on near infrared (NIR) spectroscopy data. , 2011, The Analyst.

[32]  David A. Gough,et al.  Predicting protein-protein interactions from primary structure , 2001, Bioinform..

[33]  A. Belousov,et al.  A flexible classification approach with optimal generalisation performance: support vector machines , 2002 .

[34]  I-Cheng Yeh,et al.  Modeling of strength of high-performance concrete using artificial neural networks , 1998 .