Max-Margin Stacking and Sparse Regularization for Linear Classifier Combination and Selection

The main principle of stacked generalization (or Stacking) is using a second-level generalizer to combine the outputs of base classifiers in an ensemble. In this paper, we investigate different combination types under the stacking framework; namely weighted sum (WS), class-dependent weighted sum (CWS) and linear stacked generalization (LSG). For learning the weights, we propose using regularized empirical risk minimization with the hinge loss. In addition, we propose using group sparsity for regularization to facilitate classifier selection. We performed experiments using two different ensemble setups with differing diversities on 8 real-world datasets. Results show the power of regularized learning with the hinge loss function. Using sparse regularization, we are able to reduce the number of selected classifiers of the diverse ensemble without sacrificing accuracy. With the non-diverse ensembles, we even gain accuracy on average by using sparse regularization.

[1]  Alexander K. Seewald,et al.  How to Make Stacking Better and Faster While Also Taking Care of an Unknown Weakness , 2002, International Conference on Machine Learning.

[2]  Thomas G. Dietterich Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms , 1998, Neural Computation.

[3]  Aaas News,et al.  Book Reviews , 1893, Buffalo Medical and Surgical Journal.

[4]  Li Zhang,et al.  Sparse ensembles using weighted combination methods based on linear programming , 2011, Pattern Recognit..

[5]  Ian H. Witten,et al.  Issues in Stacked Generalization , 2011, J. Artif. Intell. Res..

[6]  Subhash C. Bagui,et al.  Combining Pattern Classifiers: Methods and Algorithms , 2005, Technometrics.

[7]  Araceli Sanchis,et al.  GA-stacking: Evolutionary stacked generalization , 2010, Intell. Data Anal..

[8]  Joseph Sill,et al.  Feature-Weighted Linear Stacking , 2009, ArXiv.

[9]  Koby Crammer,et al.  Ultraconservative Online Algorithms for Multiclass Problems , 2001, J. Mach. Learn. Res..

[10]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[11]  Christopher J. Merz,et al.  Using Correspondence Analysis to Combine Classifiers , 1999, Machine Learning.

[12]  Xiaolong Wang,et al.  Reranking for Stacking Ensemble Learning , 2010, ICONIP.

[13]  R. Polikar,et al.  Ensemble based systems in decision making , 2006, IEEE Circuits and Systems Magazine.

[14]  Galina L. Rogova,et al.  Combining the results of several neural network classifiers , 1994, Neural Networks.

[15]  R. Tibshirani,et al.  Combining Estimates in Regression and Classification , 1996 .

[16]  Gregory Z. Grudic,et al.  Regularized Linear Models in Stacked Generalization , 2009, MCS.

[17]  David H. Wolpert,et al.  Stacked generalization , 1992, Neural Networks.

[18]  Johan A. K. Suykens,et al.  Least Squares Support Vector Machine Classifiers , 1999, Neural Processing Letters.

[19]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[20]  Fu Jie Huang,et al.  A Tutorial on Energy-Based Learning , 2006 .

[21]  Ian H. Witten,et al.  Stacking Bagged and Dagged Models , 1997, ICML.

[22]  Bernard Zenko,et al.  Is Combining Classifiers Better than Selecting the Best One , 2002, ICML.

[23]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.

[24]  Hakan Erdogan,et al.  A Unifying Framework for Learning the Linear Combiners for Classifier Ensembles , 2010, 2010 20th International Conference on Pattern Recognition.

[25]  Naonori Ueda,et al.  Optimal Linear Combination of Neural Networks for Improving Classification Performance , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[26]  Saso Dzeroski,et al.  Combining Multiple Models with Meta Decision Trees , 2000, PKDD.