Robust Boosting via Convex Optimization: Theory and Applications

In this work we consider statistical learning problems. A learning machine aims to extract information from a set of training examples such that it is able to predict the associated label on unseen examples. We consider the case where the resulting classification or regression rule is a combination of simple rules – also called base hypotheses. The so-called boosting algorithms iteratively find a weighted linear combination of base hypotheses that predict well on unseen data. We address the following issues: The statistical learning theory framework for analyzing boosting methods. We study learning theoretic guarantees on the prediction performance on unseen examples. Recently, large margin classification techniques emerged as a practical result of the theory of generalization, in particular Boosting and Support Vector Machines. A large margin implies a good generalization performance. Hence, we analyze how large the margins in boosting are and find an improved algorithm that is able to generate the maximum margin solution. How can boosting methods be related to mathematical optimization techniques? To analyze the properties of the resulting classification or regression rule, it is of high importance to understand whether and under which conditions boosting converges. We show that boosting can be used to solve large scale constrained optimization problems, whose solutions are well characterizable. To show this, we relate boosting methods to methods known from mathematical optimization, and derive convergence guarantees for a quite general family of boosting algorithms. How to make Boosting noise robust? One of the problems of current boosting techniques is that they are sensitive to noise in the training sample. In order to make boosting robust, we transfer the soft margin idea from support vector learning to boosting. We develop theoretically motivated regularized algorithms that exhibit a high noise robustness. How to adapt boosting to regression problems? Boosting methods are originally designed for classification problems. To extend the boosting idea to regression problems, we use the previous convergence results and relations to semi-infinite programming to design boosting-like algorithms for regression problems. We show that these leveraging algorithms have desirable theoretical and practical properties. Can boosting techniques be useful in practice? The presented theoretical results are guided by simulation results either to illustrate properties of the proposed algorithms or to show that they work well in practice. We report on successful applications in a non-intrusive power monitoring system, chaotic time series analysis and a drug discovery process. Preface I started my research …

[1]  I. Omiaj,et al.  Extensions of a Theory of Networks for Approximation and Learning : dimensionality reduction and clustering , 2022 .

[2]  Manfred K. Warmuth,et al.  Relating Data Compression and Learnability , 2003 .

[3]  Leonid Mosheyev,et al.  Penalty/Barrier multiplier algorthm for semidefinit programming , 2000 .

[4]  Vladimir Vapnik,et al.  Chervonenkis: On the uniform convergence of relative frequencies of events to their probabilities , 1971 .

[5]  Peter L. Bartlett,et al.  Functional Gradient Techniques for Combining Hypotheses , 2000 .

[6]  L. Bregman The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming , 1967 .

[7]  Gunnar Rätsch,et al.  Constructing Boosting Algorithms from SVMs: An Application to One-Class Classification , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[8]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[9]  Yoram Singer,et al.  Improved Boosting Algorithms Using Confidence-rated Predictions , 1998, COLT' 98.

[10]  Kenneth O. Kortanek,et al.  Semi-Infinite Programming: Theory, Methods, and Applications , 1993, SIAM Rev..

[11]  Olvi L. Mangasarian,et al.  Arbitrary-norm separating plane , 1999, Oper. Res. Lett..

[12]  Ran El-Yaniv,et al.  Localized Boosting , 2000, COLT.

[13]  R. Schapire The Strength of Weak Learnability , 1990, Machine Learning.

[14]  Xiaohua Xia,et al.  Active power residential non-intrusive appliance load monitoring system , 2009, AFRICON 2009.

[15]  B. Ripley,et al.  Robust Statistics , 2018, Encyclopedia of Mathematical Geosciences.

[16]  Jonathan M. Borwein,et al.  Adjoint Process Duality , 1983, Math. Oper. Res..

[17]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[18]  Bernhard Schölkopf,et al.  Comparing support vector machines with Gaussian kernels to radial basis function classifiers , 1997, IEEE Trans. Signal Process..

[19]  Wenxin Jiang Does Boosting Over t: Views From an Exact Solution , 2000 .

[20]  John E. Moody,et al.  The Effective Number of Parameters: An Analysis of Generalization and Regularization in Nonlinear Learning Systems , 1991, NIPS.

[21]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[22]  John Shawe-Taylor,et al.  Sparsity vs. Large Margins for Linear Classifiers , 2000, COLT.

[23]  David Haussler,et al.  Equivalence of models for polynomial learnability , 1988, COLT '88.

[24]  Panos M. Pardalos,et al.  Quadratic programming with one negative eigenvalue is NP-hard , 1991, J. Glob. Optim..

[25]  Yoav Freund,et al.  Boosting a weak learning algorithm by majority , 1995, COLT '90.

[26]  Gunnar Rätsch,et al.  A Mathematical Programming Approach to the Kernel Fisher Algorithm , 2000, NIPS.

[27]  Robert E. Schapire,et al.  A Brief Introduction to Boosting , 1999, IJCAI.

[28]  David Haussler,et al.  What Size Net Gives Valid Generalization? , 1989, Neural Computation.

[29]  Ralf Herbrich,et al.  Adaptive margin support vector machines for classification , 1999 .

[30]  B. Scholkopf,et al.  Fisher discriminant analysis with kernels , 1999, Neural Networks for Signal Processing IX: Proceedings of the 1999 IEEE Signal Processing Society Workshop (Cat. No.98TH8468).

[31]  Gunnar Rätsch,et al.  Barrier Boosting , 2000, COLT.

[32]  Gunnar Rätsch,et al.  Engineering Support Vector Machine Kerneis That Recognize Translation Initialion Sites , 2000, German Conference on Bioinformatics.

[33]  Gunnar Rätsch,et al.  An asymptotic analysis of AdaBoost in the binary classification case , 1998 .

[34]  Ayhan Demiriz,et al.  Linear Programming Boosting via Column Generation , 2002, Machine Learning.

[35]  Fernando Pérez-Cruz,et al.  Fast Training of Support Vector Classifiers , 2000, NIPS.

[36]  Gunnar Rätsch,et al.  Soft Margins for AdaBoost , 2001, Machine Learning.

[37]  Michael I. Jordan,et al.  Advances in Neural Information Processing Systems 30 , 1995 .

[38]  Nathan Intrator,et al.  Boosted Mixture of Experts: An Ensemble Learning Scheme , 1999, Neural Computation.

[39]  Gunnar Rätsch,et al.  An Arcing algorithm with an intuitive learning control parameter , 2001 .

[40]  Yoram Singer,et al.  Leveraged Vector Machines , 1999, NIPS.

[41]  Tong Zhang,et al.  A General Greedy Approximation Algorithm with Applications , 2001, NIPS.

[42]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, STOC '84.

[43]  Bernhard Schölkopf,et al.  New Support Vector Algorithms , 2000, Neural Computation.

[44]  S. D. Pietra,et al.  Duality and Auxiliary Functions for Bregman Distances , 2001 .

[45]  T. Terlaky,et al.  Logarithmic barrier decomposition methods for semi-infinite programming , 1997 .

[46]  Javed A. Aslam Improving Algorithms for Boosting , 2000, COLT.

[47]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[48]  Wenxin Jiang On weak base hypotheses and their implications for boosting regression and classification , 2002 .

[49]  Gunnar Rätsch,et al.  Predicting Time Series with Support Vector Machines , 1997, ICANN.

[50]  O. Mangasarian,et al.  Robust linear programming discrimination of two linearly inseparable sets , 1992 .

[51]  Gunnar Rätsch,et al.  Learning to Predict the Leave-One-Out Error of Kernel Based Classifiers , 2001, ICANN.

[52]  André Elisseeff,et al.  Algorithmic Stability and Generalization Performance , 2000, NIPS.

[53]  Jun Rokui,et al.  Improving the Generalization Performance of the Minimum Classification Error Learning and Its Application to Neural Networks , 1998, ICONIP.

[55]  Shun-ichi Amari,et al.  A Theory of Pattern Recognition , 1968 .

[56]  B. Schölkopf,et al.  General cost functions for support vector regression. , 1998 .

[57]  Gunnar Rätsch,et al.  Input space versus feature space in kernel-based methods , 1999, IEEE Trans. Neural Networks.

[58]  O. Mangasarian Linear and Nonlinear Separation of Patterns by Linear Programming , 1965 .

[59]  John Shawe-Taylor,et al.  A Column Generation Algorithm For Boosting , 2000, ICML.

[60]  D. Cox,et al.  Asymptotic Analysis of Penalized Likelihood and Related Estimators , 1990 .

[61]  Harris Drucker,et al.  Boosting and Other Ensemble Methods , 1994, Neural Computation.

[62]  P. Tseng,et al.  On the convergence of the coordinate descent method for convex differentiable minimization , 1992 .

[63]  Gunnar Rätsch,et al.  On the Convergence of Leveraging , 2001, NIPS.

[64]  Peter L. Bartlett,et al.  Improved Generalization Through Explicit Optimization of Margins , 2000, Machine Learning.

[65]  Gunnar Rätsch,et al.  Active Learning in the Drug Discovery Process , 2001, NIPS.

[66]  T Poggio,et al.  Regularization Algorithms for Learning That Are Equivalent to Multilayer Networks , 1990, Science.

[67]  David W. Opitz,et al.  An Empirical Evaluation of Bagging and Boosting , 1997, AAAI/IAAI.

[68]  Wenxin Jiang,et al.  Is regularization unnecessary for boosting? , 2001, AISTATS.

[69]  Wenxin Jiang Some Results on Weakly Accurate Base Learners for Boosting Regression and Classification , 2000, Multiple Classifier Systems.

[70]  G. Wahba,et al.  Some results on Tchebycheffian spline functions , 1971 .

[71]  Claudio Gentile,et al.  Linear Hinge Loss and Average Margin , 1998, NIPS.

[72]  Nathan Intrator,et al.  Boosting Regression Estimators , 1999, Neural Computation.

[73]  Manfred K. Warmuth,et al.  The Perceptron Algorithm Versus Winnow: Linear Versus Logarithmic Mistake Bounds when Few Input Variables are Relevant (Technical Note) , 1997, Artif. Intell..

[74]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[75]  J. Friedman Special Invited Paper-Additive logistic regression: A statistical view of boosting , 2000 .

[76]  Richard Bellman,et al.  Adaptive Control Processes: A Guided Tour , 1961, The Mathematical Gazette.

[77]  Gunnar Rätsch,et al.  An Improvement of AdaBoost to Avoid Overfitting , 1998, ICONIP.

[78]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[79]  Wenxin Jiang Process consistency for AdaBoost , 2003 .

[80]  Wenxin Jiang,et al.  Some Theoretical Aspects of Boosting in the Presence of Noisy Data , 2001, ICML.

[81]  G. W. Hart,et al.  Nonintrusive appliance load monitoring , 1992, Proc. IEEE.

[82]  O. Mangasarian,et al.  Lipschitz continuity of solutions of linear inequalities, programs and complementarity problems , 1987 .

[83]  David Haussler,et al.  Probabilistic kernel regression models , 1999, AISTATS.

[84]  Gunnar Rätsch,et al.  Kernel PCA pattern reconstruction via approximate pre-images. , 1998 .

[85]  André Elisseeff,et al.  Stability and Generalization , 2002, J. Mach. Learn. Res..

[86]  L. Glass,et al.  Oscillation and chaos in physiological control systems. , 1977, Science.

[87]  Thomas Richardson,et al.  Boosting methodology for regression problems , 1999, AISTATS.

[88]  Eddy Mayoraz,et al.  DynaBoost: Combining Boosted Hypotheses in a Dynamic Way , 1999 .

[89]  J. Lafferty Additive models, boosting, and inference for generalized divergences , 1999, COLT '99.

[90]  Gunnar Rätsch,et al.  v-Arc: Ensemble Learning in the Presence of Outliers , 1999, NIPS.

[91]  Manfred K. Warmuth,et al.  Sample compression, learnability, and the Vapnik-Chervonenkis dimension , 1995, Machine Learning.

[92]  Temple F. Smith Occam's razor , 1980, Nature.

[93]  John D. Lafferty,et al.  Inducing Features of Random Fields , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[94]  Heinz H. Bauschke,et al.  Legendre functions and the method of random Bregman projections , 1997 .

[95]  Bernhard Schölkopf,et al.  Support vector learning , 1997 .

[96]  Robert E. Schapire,et al.  Design and analysis of efficient learning algorithms , 1992, ACM Doctoral dissertation award ; 1991.

[97]  Toniann Pitassi,et al.  A Gradient-Based Boosting Algorithm for Regression Problems , 2000, NIPS.

[98]  John Moody,et al.  Fast Learning in Networks of Locally-Tuned Processing Units , 1989, Neural Computation.

[99]  K. Kiwiel Relaxation Methods for Strictly Convex Regularizations of Piecewise Linear Programs , 1998 .

[100]  William H. Press,et al.  Numerical recipes in C , 2002 .

[101]  Ralf Herbrich Learning linear classifiers: theory and algorithms , 2001 .

[102]  Manfred K. Warmuth,et al.  Bounds on approximate steepest descent for likelihood maximization in exponential families , 1994, IEEE Trans. Inf. Theory.

[103]  Paul S. Bradley,et al.  Feature Selection via Concave Minimization and Support Vector Machines , 1998, ICML.

[104]  Umesh V. Vazirani,et al.  An Introduction to Computational Learning Theory , 1994 .

[105]  H. Akaike A new look at the statistical model identification , 1974 .

[106]  L. Devroye Bounds for the Uniform Deviation of Empirical Measures , 1982 .

[107]  Alexander H. Waibel,et al.  A novel objective function for improved phoneme recognition using time delay neural networks , 1990, International 1989 Joint Conference on Neural Networks.

[108]  K. Schittkowski,et al.  NONLINEAR PROGRAMMING , 2022 .

[109]  Marc Teboulle,et al.  An Interior Proximal Algorithm and the Exponential Multiplier Method for Semidefinite Programming , 1998, SIAM J. Optim..

[110]  Richard Bellman,et al.  Adaptive Control Processes: A Guided Tour , 1961, The Mathematical Gazette.

[111]  Manfred K. Warmuth,et al.  The perceptron algorithm vs. Winnow: linear vs. logarithmic mistake bounds when few input variables are relevant , 1995, COLT '95.

[112]  B. Schölkopf,et al.  Linear programs for automatic accuracy control in regression. , 1999 .

[113]  Bernhard Schölkopf,et al.  A Generalized Representer Theorem , 2001, COLT/EuroCOLT.

[114]  Gunnar Rätsch,et al.  Robust Ensemble Learning , 2000 .

[115]  Michael A. Saunders,et al.  Atomic Decomposition by Basis Pursuit , 1998, SIAM J. Sci. Comput..

[116]  David Haussler,et al.  Learnability and the Vapnik-Chervonenkis dimension , 1989, JACM.

[117]  B. Schölkopf,et al.  Asymptotically Optimal Choice of ε-Loss for Support Vector Machines , 1998 .

[118]  J. Langford,et al.  FeatureBoost: A Meta-Learning Algorithm that Improves Model Robustness , 2000, ICML.

[119]  John Shawe-Taylor,et al.  Towards a strategy for boosting regressors , 2000 .

[120]  L. Galway Spline Models for Observational Data , 1991 .

[121]  J. van Leeuwen,et al.  Neural Networks: Tricks of the Trade , 2002, Lecture Notes in Computer Science.

[122]  A. Hoffman On approximate solutions of systems of linear inequalities , 1952 .

[123]  Harris Drucker,et al.  Boosting Performance in Neural Networks , 1993, Int. J. Pattern Recognit. Artif. Intell..

[124]  J. Copas Regression, Prediction and Shrinkage , 1983 .

[125]  Tomaso A. Poggio,et al.  Extensions of a Theory of Networks for Approximation and Learning , 1990, NIPS.

[126]  John Shawe-Taylor,et al.  A framework for structural risk minimisation , 1996, COLT '96.

[127]  Dale Schuurmans,et al.  Boosting in the Limit: Maximizing the Margin of Learned Ensembles , 1998, AAAI/IAAI.

[128]  Andreas S. Weigend,et al.  Time Series Prediction: Forecasting the Future and Understanding the Past , 1994 .

[129]  Manfred K. Warmuth,et al.  Boosting as entropy projection , 1999, COLT '99.

[130]  J. Kohlmorgen,et al.  Analysis of nonstationary time series by mixtures of self-organizing predictors , 2000, Neural Networks for Signal Processing X. Proceedings of the 2000 IEEE Signal Processing Society Workshop (Cat. No.00TH8501).

[131]  N. Cristianini,et al.  Robust Bounds on Generalization from the Margin Distribution , 1998 .

[132]  S. Amari,et al.  Information geometry of estimating functions in semi-parametric statistical models , 1997 .

[133]  Bernhard Schölkopf,et al.  The connection between regularization operators and support vector kernels , 1998, Neural Networks.

[134]  John Mark,et al.  Introduction to radial basis function networks , 1996 .

[135]  Gunnar Rätsch,et al.  Invariant Feature Extraction and Classification in Kernel Spaces , 1999, NIPS.

[136]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[137]  Paul S. Bradley,et al.  Parsimonious Least Norm Approximation , 1998, Comput. Optim. Appl..

[138]  Stéphane Mallat,et al.  Matching pursuits with time-frequency dictionaries , 1993, IEEE Trans. Signal Process..

[139]  Gunnar Rätsch,et al.  New Methods for Splice Site Recognition , 2002, ICANN.

[140]  Yves Grandvalet Bagging Can Stabilize without Reducing Variance , 2001, ICANN.

[141]  Sanmay Das,et al.  Filters, Wrappers and a Boosting-Based Hybrid for Feature Selection , 2001, ICML.

[142]  Yoav Freund,et al.  A Short Introduction to Boosting , 1999 .

[143]  R. C. Williamson,et al.  Classification on proximity data with LP-machines , 1999 .

[144]  Philip M. Long,et al.  On-line learning of linear functions , 1991, STOC '91.

[145]  Dr. M. G. Worster Methods of Mathematical Physics , 1947, Nature.

[146]  Joachim Piehler Einführung in die lineare Optimierung , 1966 .

[147]  Jason Weston,et al.  Transductive Inference for Estimating Values of Functions , 1999, NIPS.

[148]  Klaus-Robert Müller,et al.  Annealed Competition of Experts for a Segmentation and Classification of Switching Dynamics , 1996, Neural Computation.

[149]  S. D. Pietra,et al.  Statistical Learning Algorithms Based on Bregman Distances , 1997 .

[150]  ObradovicZoran,et al.  Adaptive boosting techniques in heterogeneous and spatial databases , 2001 .

[151]  J. Simonoff Multivariate Density Estimation , 1996 .

[152]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[153]  Gunnar Rätsch,et al.  A New Discriminative Kernel from Probabilistic Models , 2001, Neural Computation.

[154]  Zoran Obradovic,et al.  Adaptive boosting techniques in heterogeneous and spatial databases , 2001, Intell. Data Anal..

[155]  Manfred K. Warmuth,et al.  Additive versus exponentiated gradient updates for linear prediction , 1995, STOC '95.

[156]  Gunnar Rätsch,et al.  Regularizing AdaBoost , 1998, NIPS.

[157]  Thomas G. Dietterich,et al.  Pruning Adaptive Boosting , 1997, ICML.

[158]  David P. Helmbold,et al.  A geometric approach to leveraging weak learners , 1999, Theor. Comput. Sci..

[159]  Mark Herbster,et al.  Tracking the Best Linear Predictor , 2001, J. Mach. Learn. Res..

[160]  Y. Freund,et al.  Adaptive game playing using multiplicative weights , 1999 .

[161]  J. Mercer Functions of Positive and Negative Type, and their Connection with the Theory of Integral Equations , 1909 .

[162]  A. N. Tikhonov,et al.  Solutions of ill-posed problems , 1977 .

[163]  Leo Breiman,et al.  Prediction Games and Arcing Algorithms , 1999, Neural Computation.

[164]  Yoav Freund,et al.  Game theory, on-line prediction and boosting , 1996, COLT '96.

[165]  H. Schwenk,et al.  Adaboosting neural networks , 1997 .

[166]  Y. Censor,et al.  Parallel Optimization: Theory, Algorithms, and Applications , 1997 .

[167]  Harris Drucker,et al.  Improving Regressors using Boosting Techniques , 1997, ICML.

[168]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[169]  Gilles Blanchard Méthodes de mélange et d'agrégation d'estimateurs en reconnaissance de formes : Application aux arbres de décision , 2001 .

[170]  Marti A. Hearst Trends & Controversies: Support Vector Machines , 1998, IEEE Intell. Syst..

[171]  David P. Helmbold,et al.  Leveraging for Regression , 2000, COLT.

[172]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[173]  P. Vannerem,et al.  Classifying LEP data with support vector algorithms. , 1999 .

[174]  Yoram Singer,et al.  Logistic Regression, AdaBoost and Bregman Distances , 2000, Machine Learning.

[175]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[176]  D. Bertsekas,et al.  Multiplier methods for convex programming , 1973, CDC 1973.

[177]  Yoav Freund,et al.  Boosting the margin: A new explanation for the effectiveness of voting methods , 1997, ICML.

[178]  J. Ross Quinlan,et al.  Boosting First-Order Learning , 1996, ALT.

[179]  Shun-ichi Amari,et al.  Network information criterion-determining the number of hidden units for an artificial neural network model , 1994, IEEE Trans. Neural Networks.

[180]  Chuan Long,et al.  Boosting Noisy Data , 2001, ICML.

[181]  Gunnar Rätsch,et al.  Kernel PCA and De-Noising in Feature Spaces , 1998, NIPS.

[182]  Harris Drucker,et al.  Learning algorithms for classification: A comparison on handwritten digit recognition , 1995 .

[183]  David P. Helmbold,et al.  Potential Boosters? , 1999, NIPS.

[184]  Paola Campadelli,et al.  A Boosting Algorithm for Regression , 1997, ICANN.

[185]  Peter L. Bartlett,et al.  The Sample Complexity of Pattern Classification with Neural Networks: The Size of the Weights is More Important than the Size of the Network , 1998, IEEE Trans. Inf. Theory.

[186]  Bernhard Schölkopf,et al.  Generalization Performance of Regularization Networks and Support Vector Machines via Entropy Numbers of Compact Operators , 1998 .

[187]  Gunnar Rätsch,et al.  Sparse Regression Ensembles in Infinite and Finite Hypothesis Spaces , 2002, Machine Learning.

[188]  Klaus-Robert Müller,et al.  Analysis of switching dynamics with competing neural networks , 1995 .

[189]  J. Dussault,et al.  Stable exponential-penalty algorithm with superlinear convergence , 1994 .