A non-linear data mining parameter selection algorithm for continuous variables

In this article, we propose a new data mining algorithm, by which one can both capture the non-linearity in data and also find the best subset model. To produce an enhanced subset of the original variables, a preferred selection method should have the potential of adding a supplementary level of regression analysis that would capture complex relationships in the data via mathematical transformation of the predictors and exploration of synergistic effects of combined variables. The method that we present here has the potential to produce an optimal subset of variables, rendering the overall process of model selection more efficient. This algorithm introduces interpretable parameters by transforming the original inputs and also a faithful fit to the data. The core objective of this paper is to introduce a new estimation technique for the classical least square regression framework. This new automatic variable transformation and model selection method could offer an optimal and stable model that minimizes the mean square error and variability, while combining all possible subset selection methodology with the inclusion variable transformations and interactions. Moreover, this method controls multicollinearity, leading to an optimal set of explanatory variables.

[1]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[2]  C. L. Mallows Some comments on C_p , 1973 .

[3]  C. Mallows More comments on C p , 1995 .

[4]  E. Candès,et al.  Controlling the false discovery rate via knockoffs , 2014, 1404.5609.

[5]  S. U. Gulumbe,et al.  Identifying the Limitation of Stepwise Selection for Variable Selection in Regression Analysis , 2015 .

[6]  Joel Stein,et al.  Heart Disease and Stroke Statistics – At-a-Glance , 2015 .

[7]  S. Fienberg,et al.  Efficient Calculation of All Possible Regressions , 1968 .

[8]  R. Stolzenberg,et al.  Multiple Regression Analysis , 2004 .

[9]  G. M. Furnival All Possible Regressions with Less Computation , 1971 .

[10]  Michael J. Pencina,et al.  Arterial Stiffness and Cardiovascular Events: The Framingham Heart Study , 2010, Circulation.

[11]  Arthur E. Hoerl,et al.  Ridge Regression: Biased Estimation for Nonorthogonal Problems , 2000, Technometrics.

[12]  M. Stone Continuum regression: Cross-validated sequentially constructed prediction embracing ordinary least s , 1990 .

[13]  Qiong Yang,et al.  The Third Generation Cohort of the National Heart, Lung, and Blood Institute's Framingham Heart Study: design, recruitment, and initial examination. , 2007, American journal of epidemiology.

[14]  Donald W. Marquaridt Generalized Inverses, Ridge Regression, Biased Linear Estimation, and Nonlinear Estimation , 1970 .

[15]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[16]  Philipp Birken,et al.  Numerical Linear Algebra , 2011, Encyclopedia of Parallel Computing.

[17]  A. E. Hoerl,et al.  Ridge Regression: Applications to Nonorthogonal Problems , 1970 .

[18]  Emmanuel J. Candès,et al.  Near-Optimal Signal Recovery From Random Projections: Universal Encoding Strategies? , 2004, IEEE Transactions on Information Theory.

[19]  Venkat Reddy Konasani,et al.  Multiple Regression Analysis , 2015 .

[20]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[21]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[22]  G. Box,et al.  Transformation of the Independent Variables , 1962 .

[23]  Colin L. Mallows,et al.  Some Comments on Cp , 2000, Technometrics.

[24]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[25]  Jacob Cohen,et al.  A power primer. , 1992, Psychological bulletin.

[26]  D. Cox,et al.  An Analysis of Transformations , 1964 .

[27]  J. A. Morgan,et al.  Calculation of the Residual Sum of Squares for all Possible Regressions , 1972 .

[28]  C. L. Mallows Some Comments onCp , 1973 .

[29]  Bernd Hamm,et al.  Head-to-head comparison of left ventricular function assessment with 64-row computed tomography, biplane left cineventriculography, and both 2- and 3-dimensional transthoracic echocardiography: comparison with magnetic resonance imaging as the reference standard. , 2012, Journal of the American College of Cardiology.

[30]  M. Safar,et al.  Therapeutic studies and arterial stiffness in hypertension: recommendations of the European Society of Hypertension , 2000, Journal of hypertension.

[31]  Robert W. Wilson,et al.  Regressions by Leaps and Bounds , 2000, Technometrics.

[32]  D. Levy,et al.  Changes in Arterial Stiffness and Wave Reflection With Advancing Age in Healthy Men and Women: The Framingham Heart Study , 2004, Hypertension.

[33]  R. Tibshirani,et al.  Sparse Principal Component Analysis , 2006 .

[34]  J. Brian Gray,et al.  Introduction to Linear Regression Analysis , 2002, Technometrics.

[35]  Terence Tao,et al.  The Dantzig selector: Statistical estimation when P is much larger than n , 2005, math/0506081.

[36]  Michael A. Saunders,et al.  Atomic Decomposition by Basis Pursuit , 1998, SIAM J. Sci. Comput..

[37]  W. Kannel,et al.  An investigation of coronary heart disease in families. The Framingham offspring study. , 1979, American journal of epidemiology.

[38]  M. J. Garside,et al.  The Best Sub‐Set in Multiple Regression Analysis , 1965 .

[39]  R. Salmerón,et al.  Collinearity: revisiting the variance inflation factor in ridge regression , 2015 .

[40]  R. Turner,et al.  Homeostasis model assessment: insulin resistance and β-cell function from fasting plasma glucose and insulin concentrations in man , 1985, Diabetologia.

[41]  L. Aarts,et al.  Methods in pharmacology: measurement of cardiac output. , 2011, British journal of clinical pharmacology.

[42]  T. Dawber,et al.  Epidemiological approaches to heart disease: the Framingham Study. , 1951, American journal of public health and the nation's health.