Fast convergence of extended Rademacher Complexity bounds

In this work we propose some new generalization bounds for binary classifiers, based on global Rademacher Complexity (RC), which exhibit fast convergence rates by combining state-of-the-art results by Talagrand on empirical processes and the exploitation of unlabeled patterns. In this framework, we are able to improve both the constants and the convergence rates of existing RC-based bounds. All the proposed bounds are based on empirical quantities, so that they can be easily computed in practice, and are provided both in implicit and explicit forms: the formers are the tightest ones, while the latter ones allow to get more insights about the impact of Talagrand's results and the exploitation of unlabeled patterns in the learning process. Finally, we verify the quality of the bounds, with respect to the theoretical limit, showing the room for further improvements in the common scenario of binary classification.

[1]  Peter L. Bartlett,et al.  Localized Rademacher Complexities , 2002, COLT.

[2]  A. Tsybakov,et al.  Fast learning rates for plug-in classifiers , 2007, 0708.2321.

[3]  Colin McDiarmid,et al.  Surveys in Combinatorics, 1989: On the method of bounded differences , 1989 .

[4]  Davide Anguita,et al.  An improved analysis of the Rademacher data-dependent bound using its self bounding property , 2013, Neural Networks.

[5]  V. Koltchinskii Local Rademacher complexities and oracle inequalities in risk minimization , 2006, 0708.0083.

[6]  Jason Weston,et al.  Trading convexity for scalability , 2006, ICML.

[7]  E. Mammen,et al.  Smooth Discrimination Analysis , 1999 .

[8]  Kazuoki Azuma WEIGHTED SUMS OF CERTAIN DEPENDENT RANDOM VARIABLES , 1967 .

[9]  M. Talagrand A new look at independence , 1996 .

[10]  Ayhan Demiriz,et al.  Semi-Supervised Support Vector Machines , 1998, NIPS.

[11]  Shiliang Sun,et al.  PAC-bayes bounds with data dependent priors , 2012, J. Mach. Learn. Res..

[12]  David A. McAllester PAC-Bayesian model averaging , 1999, COLT '99.

[13]  Manfred K. Warmuth,et al.  Sample compression, learnability, and the Vapnik-Chervonenkis dimension , 1995, Machine Learning.

[14]  Nicu Sebe,et al.  Semisupervised learning of classifiers: theory, algorithms, and their application to human-computer interaction , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  E. S. Pearson,et al.  THE USE OF CONFIDENCE OR FIDUCIAL LIMITS ILLUSTRATED IN THE CASE OF THE BINOMIAL , 1934 .

[16]  Ohad Shamir,et al.  Learnability, Stability and Uniform Convergence , 2010, J. Mach. Learn. Res..

[17]  David Haussler,et al.  Occam's Razor , 1987, Inf. Process. Lett..

[18]  Davide Anguita,et al.  In-Sample and Out-of-Sample Model Selection and Error Estimation for Support Vector Machines , 2012, IEEE Transactions on Neural Networks and Learning Systems.

[19]  B. Efron The jackknife, the bootstrap, and other resampling plans , 1987 .

[20]  Davide Anguita,et al.  Maximal Discrepancy vs. Rademacher Complexity for error estimation , 2011, ESANN.

[21]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[22]  Davide Anguita,et al.  Unlabeled patterns to tighten Rademacher complexity error bounds for kernel classifiers , 2014, Pattern Recognit. Lett..

[23]  P. Bartlett,et al.  Local Rademacher complexities , 2005, math/0508275.

[24]  Steven Abney,et al.  Semisupervised Learning for Computational Linguistics , 2007 .

[25]  S. Boucheron,et al.  A sharp concentration inequality with applications , 1999, Random Struct. Algorithms.

[26]  V. Koltchinskii Rejoinder: Local Rademacher complexities and oracle inequalities in risk minimization , 2006, 0708.0135.

[27]  Davide Anguita,et al.  In-sample model selection for Support Vector Machines , 2011, The 2011 International Joint Conference on Neural Networks.

[28]  Ambuj Tewari,et al.  Smoothness, Low Noise and Fast Rates , 2010, NIPS.

[29]  Jean-Yves Audibert Fast learning rates in statistical inference through aggregation , 2007, math/0703854.

[30]  Gaston H. Gonnet,et al.  On the LambertW function , 1996, Adv. Comput. Math..

[31]  John Langford,et al.  Computable Shell Decomposition Bounds , 2000, J. Mach. Learn. Res..

[32]  Ian Dennis Longstaff,et al.  A pattern recognition approach to understanding the multi-layer perception , 1987, Pattern Recognit. Lett..

[33]  P. Massart,et al.  About the constants in Talagrand's concentration inequalities for empirical processes , 2000 .

[34]  Peter L. Bartlett,et al.  Rademacher and Gaussian Complexities: Risk Bounds and Structural Results , 2003, J. Mach. Learn. Res..

[35]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[36]  Mikhail Belkin,et al.  Semi-Supervised Learning on Riemannian Manifolds , 2004, Machine Learning.

[37]  Mikhail Belkin,et al.  Regularization and Semi-supervised Learning on Large Graphs , 2004, COLT.

[38]  O. Bousquet A Bennett concentration inequality and its application to suprema of empirical processes , 2002 .

[39]  Xiaojin Zhu,et al.  Semi-Supervised Learning , 2010, Encyclopedia of Machine Learning.

[40]  T. Poggio,et al.  General conditions for predictivity in learning theory , 2004, Nature.

[41]  Sylvain Arlot,et al.  A survey of cross-validation procedures for model selection , 2009, 0907.4728.

[42]  Vladimir Koltchinskii,et al.  Rademacher penalties and structural risk minimization , 2001, IEEE Trans. Inf. Theory.

[43]  M. Talagrand,et al.  Probability in Banach Spaces: Isoperimetry and Processes , 1991 .

[44]  André Elisseeff,et al.  Stability and Generalization , 2002, J. Mach. Learn. Res..

[45]  Thierry Klein Une inégalité de concentration à gauche pour les processus empiriques , 2002 .

[46]  J. Langford Tutorial on Practical Prediction Theory for Classification , 2005, J. Mach. Learn. Res..

[47]  E. Rio,et al.  Concentration around the mean for maxima of empirical processes , 2005, math/0506594.

[48]  Davide Anguita,et al.  A Learning Machine with a Bit-Based Hypothesis Space , 2013, ESANN.