Model selection in multivariate adaptive regression splines (MARS) using information complexity as the fitness function

This paper introduces information-theoretic measure of complexity (ICOMP) criterion for model selection in multivariate adaptive regression splines (MARS) to tradeoff efficiently between how well the model fits the data and the model complexity. As is well known, MARS is a popular nonparametric regression technique used to study the nonlinear relationship between a response variable and the set of predictors with the help of piecewise linear or cubic splines as basis functions. A critical aspect in determining the form of the nonparametric regression model during the MARS strategy is the evaluation of portfolio of submodels to select the best submodel with the appropriate number of knots over subset of predictors. In the usual regression modeling, when a large number of predictor variables are present in the model, and there is no precise information about the exact functional relationships among the variables, many model selection criteria still overfit the model. In this paper, to find the simplest model that balances the overfitting and underfitting for the model, ICOMP is proposed as a powerful model selection criterion for MARS modeling. Here, the model complexity is treated with respect to the interdependency of parameter estimates, as well as the number of free parameters in the model. We develop and study the performance of ICOMP along with several most popular model selection criteria such as Akaike’s information criterion, Schwarz’s Bayesian information criterion and generalized cross-validation in MARS modeling to select the best subset models. We provide two Monte Carlo simulation examples and a real benchmark example to demonstrate the utility and versatility of the proposed model selection approach to determine best functional form of the predictive model. Our numerical examples show that ICOMP provides a general model selection criterion with an insight to the interdependencies and/or correlational structure between parameter estimates in the selected model. This new approach can also be applicable to many complex statistical modeling problems.

[1]  Peter Craven,et al.  Smoothing noisy data with spline functions , 1978 .

[2]  Hamparsum Bozdogan,et al.  Mixture-Model Cluster Analysis Using Model Selection Criteria and a New Informational Measure of Complexity , 1994 .

[3]  C. Radhakrishna Rao,et al.  Sufficient statistics and minimum variance estimates , 1949, Mathematical Proceedings of the Cambridge Philosophical Society.

[4]  H. Cramér Mathematical methods of statistics , 1947 .

[5]  Gerhard-Wilhelm Weber,et al.  A new approach to multivariate adaptive regression splines by using Tikhonov regularization and continuous optimization , 2010 .

[6]  H. Bozdogan,et al.  Akaike's Information Criterion and Recent Developments in Information Complexity. , 2000, Journal of mathematical psychology.

[7]  Hamparsun Bozdogan,et al.  A new class of information complexity (ICOMP) criteria with an application to customer profiling and segmentation , 2009 .

[8]  C. R. Rao,et al.  Adaptive splines and genetic algorithms for optimal statistical modeling , 2000 .

[9]  J. Freidman,et al.  Multivariate adaptive regression splines , 1991 .

[10]  J. Rissanen,et al.  Minmax Entropy Estimation of Models for Vector Processes , 1976 .

[11]  Inci Batmaz,et al.  A computational approach to nonparametric regression: bootstrapping CMARS method , 2015, Machine Learning.

[12]  Cem Iyigun,et al.  Restructuring forward step of MARS algorithm using a new knot selection procedure based on a mapping approach , 2014, J. Glob. Optim..

[13]  Solomon Kullback,et al.  Information Theory and Statistics , 1970, The Mathematical Gazette.

[14]  Hamparsum Bozdogan,et al.  Intelligent Statistical Data Mining with Information Complexity and Genetic Algorithms , 2004 .

[15]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[16]  H. Akaike A new look at the statistical model identification , 1974 .

[17]  G. Weber,et al.  CMARS: a new contribution to nonparametric regression with multivariate adaptive regression splines supported by continuous optimization , 2012 .

[18]  J. Stevens,et al.  An Investigation of Multivariate Adaptive Regression Splines for Modeling and Analysis of Univariate and Semi-Multivariate Time Series Systems , 1991 .

[19]  G. Weber,et al.  RCMARS: Robustification of CMARS with different scenarios under polyhedral uncertainty set , 2011 .

[20]  H. Akaike A Bayesian analysis of the minimum AIC procedure , 1978 .

[21]  Edwin J C G van den Oord,et al.  Multivariate adaptive regression splines: a powerful method for detecting disease–risk relationship differences among subgroups , 2006, Statistics in medicine.

[22]  Hamparsum Bozdogan,et al.  Subset selection in vector autoregressive models using the genetic algorithm with informational complexity as the fitness function , 1998 .

[23]  J. Friedman,et al.  FLEXIBLE PARSIMONIOUS SMOOTHING AND ADDITIVE MODELING , 1989 .

[24]  H. Bozdogan Model selection and Akaike's Information Criterion (AIC): The general theory and its analytical extensions , 1987 .

[25]  G. Wahba Smoothing noisy data with spline functions , 1975 .

[26]  S. Sclove Application of model-selection criteria to some problems in multivariate analysis , 1987 .

[27]  van M.H. Emden,et al.  An analysis of complexity , 1971 .

[28]  S. Kullback,et al.  Information Theory and Statistics , 1959 .

[29]  Hamparsum Bozdogan,et al.  Misspecified Multivariate Regression Models Using the Genetic Algorithm and Information Complexity as the Fitness Function , 2012 .

[30]  H. M. Vinkers,et al.  Multivariate adaptive regression splines—studies of HIV reverse transcriptase inhibitors , 2004 .

[31]  C. R. Rao,et al.  Information and the Accuracy Attainable in the Estimation of Statistical Parameters , 1992 .

[32]  K. Chaloner,et al.  Bayesian Experimental Design: A Review , 1995 .

[33]  Carlos E. Thomaz,et al.  Maximum entropy covariance estimate for statistical pattern recognition , 2004 .

[34]  Tian-Shyug Lee,et al.  Mining the customer credit using classification and regression tree and multivariate adaptive regression splines , 2006, Comput. Stat. Data Anal..

[35]  Andrew H. Sung,et al.  Intrusion Detection Systems Using Adaptive Regression Splines , 2004, ICEIS.

[36]  D. Haughton,et al.  Informational complexity criteria for regression models , 1998 .

[37]  Yuehjen E. Shao,et al.  Mining the breast cancer pattern using artificial neural networks and multivariate adaptive regression splines , 2004, Expert Syst. Appl..

[38]  H. Bozdogan On the information-based measure of covariance complexity and its application to the evaluation of multivariate linear models , 1990 .

[39]  Claudia Biermann,et al.  Mathematical Methods Of Statistics , 2016 .

[40]  Takeshi Amemiya,et al.  Selection of Regressors , 1980 .

[41]  Gints Jekabsons,et al.  Adaptive Regression Splines toolbox for Matlab/Octave , 2015 .

[42]  D. Poskitt Precision, Complexity and Bayesian Model Determination , 1987 .

[43]  Yvan Vander Heyden,et al.  The evaluation of two‐step multivariate adaptive regression splines for chromatographic retention prediction of peptides , 2007, Proteomics.

[44]  R. A. Leibler,et al.  On Information and Sufficiency , 1951 .

[45]  H. Akaike,et al.  Information Theory and an Extension of the Maximum Likelihood Principle , 1973 .

[46]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[47]  C. R. Rao,et al.  Minimum variance and the estimation of several parameters , 1947, Mathematical Proceedings of the Cambridge Philosophical Society.

[48]  A. Barron,et al.  Discussion: Multivariate Adaptive Regression Splines , 1991 .

[49]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[50]  Cheryl Hild,et al.  The use of information-based model evaluation criteria in the GMDH algorithm , 1995 .