Learning the hypotheses space from data through a U-curve algorithm: a statistically consistent complexity regularizer for Model Selection

This paper proposes a data-driven systematic, consistent and nonexhaustive approach to Model Selection, that is an extension of the classical agnostic PAC learning model. In this approach, learning problems are modeled not only by a hypothesis space H, but also by a Learning Space L(H), a poset of subspaces of H, which covers H and satisfies a property regarding the VC dimension of related subspaces, that is a suitable algebraic search space for Model Selection algorithms. Our main contributions are a data-driven general learning algorithm to perform implicitly regularized Model Selection on L(H) and a framework under which one can, theoretically, better estimate a target hypothesis with a given sample size by properly modeling L(H) and employing high computational power. A remarkable consequence of this approach are conditions under which a non-exhaustive search of L(H) can return an optimal solution. The results of this paper lead to a practical property of Machine Learning, that the lack of experimental data may be mitigated by a high computational capacity. In a context of continuous popularization of computational power, this property may help understand why Machine Learning has become so important, even where data is expensive and hard to get.

[1]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[2]  Vladimir Vapnik,et al.  Chervonenkis: On the uniform convergence of relative frequencies of events to their probabilities , 1971 .

[3]  Jie Ding,et al.  Model Selection Techniques: An Overview , 2018, IEEE Signal Processing Magazine.

[4]  Vladimir Koltchinskii,et al.  Rademacher penalties and structural risk minimization , 2001, IEEE Trans. Inf. Theory.

[5]  Eduardo Sontag VC dimension of neural networks , 1998 .

[6]  S. Geer,et al.  General oracle inequalities for model selection , 2009 .

[7]  Mikhail Belkin,et al.  Reconciling modern machine-learning practice and the classical bias–variance trade-off , 2018, Proceedings of the National Academy of Sciences.

[8]  Yann LeCun,et al.  Towards Understanding the Role of Over-Parametrization in Generalization of Neural Networks , 2018, ArXiv.

[9]  Edward R. Dougherty,et al.  Multiresolution Design of Aperture Operators , 2002, Journal of Mathematical Imaging and Vision.

[10]  Shai Ben-David,et al.  Understanding Machine Learning: From Theory to Algorithms , 2014 .

[11]  P. Massart,et al.  Concentration inequalities and model selection , 2007 .

[12]  László Györfi,et al.  A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.

[13]  Jerome H. Friedman,et al.  On Bias, Variance, 0/1—Loss, and the Curse-of-Dimensionality , 2004, Data Mining and Knowledge Discovery.

[14]  Edward R. Dougherty,et al.  Error Estimation for Pattern Recognition , 2015 .

[15]  Davide Anguita,et al.  Tikhonov, Ivanov and Morozov regularization for support vector machine learning , 2015, Machine Learning.

[16]  Peter L. Bartlett,et al.  Oracle inequalities for computationally budgeted model selection , 2011, COLT.

[17]  Carlos Eduardo Ferreira,et al.  An Efficient, Parallelized Algorithm for Optimal Conditional Entropy-Based Feature Selection , 2020, Entropy.

[18]  Edward R. Dougherty,et al.  Multiresolution Analysis for Optimal Binary Filters , 2001, Journal of Mathematical Imaging and Vision.

[19]  Trevor Darrell,et al.  Pyramid Match Kernels: Discriminative Classification with Sets of Image Features (version 2) , 2006 .

[20]  V. Koltchinskii,et al.  Oracle inequalities in empirical risk minimization and sparse recovery problems , 2011 .

[21]  Gavin C. Cawley,et al.  On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation , 2010, J. Mach. Learn. Res..

[22]  David Hinkley,et al.  Bootstrap Methods: Another Look at the Jackknife , 2008 .

[23]  Brian A. Davey,et al.  An Introduction to Lattices and Order , 1989 .

[24]  J. Norris Appendix: probability and measure , 1997 .

[25]  Junior Barrera,et al.  featsel: A framework for benchmarking of feature selection algorithms and cost functions , 2017, SoftwareX.

[26]  Trevor Hastie,et al.  Estimating the Error Rate of a Prediction Rule: Improvement on Cross-Validation , 2008 .

[27]  E. Bell,et al.  The Iterated Exponential Integers , 1938 .

[28]  Isabelle Guyon,et al.  Model Selection: Beyond the Bayesian/Frequentist Divide , 2010, J. Mach. Learn. Res..

[29]  Abbas Mehrabian,et al.  Nearly-tight VC-dimension bounds for piecewise linear neural networks , 2017, COLT.

[30]  M. Stone Cross‐Validatory Choice and Assessment of Statistical Predictions , 1976 .

[31]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[32]  Davide Anguita,et al.  In-Sample and Out-of-Sample Model Selection and Error Estimation for Support Vector Machines , 2012, IEEE Transactions on Neural Networks and Learning Systems.

[33]  Guillaume Lecué,et al.  Oracle inequalities for cross-validation type procedures , 2012 .

[34]  Sylvain Arlot,et al.  A survey of cross-validation procedures for model selection , 2009, 0907.4728.

[35]  Joel Edu Sánchez Castro,et al.  Model selection for learning boolean hypothesis , 2018 .

[36]  P. Bartlett FAST RATES FOR ESTIMATION ERROR AND ORACLE INEQUALITIES FOR MODEL SELECTION , 2008, Econometric Theory.

[37]  Frank Hutter,et al.  Neural Architecture Search: A Survey , 2018, J. Mach. Learn. Res..

[38]  John Riordan,et al.  The Arithmetic of Bell and Stirling Numbers , 1948 .

[39]  Carlos Eduardo Ferreira,et al.  Optimal Boolean lattice-based algorithms for the U-curve optimization problem , 2019, Inf. Sci..

[40]  Peter L. Bartlett,et al.  Nearly-tight VC-dimension and Pseudodimension Bounds for Piecewise Linear Neural Networks , 2017, J. Mach. Learn. Res..

[41]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[42]  A. Ng Feature selection, L1 vs. L2 regularization, and rotational invariance , 2004, Twenty-first international conference on Machine learning - ICML '04.

[43]  Sebastian Raschka,et al.  Model Evaluation, Model Selection, and Algorithm Selection in Machine Learning , 2018, ArXiv.

[44]  Annette M. Molinaro,et al.  Prediction error estimation: a comparison of resampling methods , 2005, Bioinform..

[45]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, STOC '84.

[46]  Charu C. Aggarwal,et al.  Neural Networks and Deep Learning , 2018, Springer International Publishing.

[47]  Marek Karpinski,et al.  Polynomial Bounds for VC Dimension of Sigmoidal and General Pfaffian Neural Networks , 1997, J. Comput. Syst. Sci..

[48]  A. V. D. Vaart,et al.  Oracle inequalities for multi-fold cross validation , 2006 .

[49]  David Correa Martins,et al.  U-curve: A branch-and-bound optimization algorithm for U-shaped cost functions on Boolean lattices applied to the feature selection problem , 2010, Pattern Recognit..

[50]  Charles A. Micchelli,et al.  Learning the Kernel Function via Regularization , 2005, J. Mach. Learn. Res..

[51]  Misha Denil,et al.  Predicting Parameters in Deep Learning , 2014 .

[52]  S. Hyakin,et al.  Neural Networks: A Comprehensive Foundation , 1994 .

[53]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[54]  R. Tibshirani,et al.  Improvements on Cross-Validation: The 632+ Bootstrap Method , 1997 .

[55]  G. Banon,et al.  Minimal representations for translation-invariant set mappings by mathematical morphology , 1991 .

[56]  P. Bartlett,et al.  Margin-adaptive model selection in statistical learning , 2008, 0804.2937.

[57]  Marcelo da Silva Reis Minimization of decomposable in U-shaped curves functions defined on poset chains - algorithms and applications , 2012 .

[58]  Edward R. Dougherty,et al.  A fast Branch-and-Bound algorithm for U-curve feature selection , 2018, Pattern Recognit..

[59]  Ron Kohavi,et al.  Irrelevant Features and the Subset Selection Problem , 1994, ICML.