Multi-class SVMs: From Tighter Data-Dependent Generalization Bounds to Novel Algorithms

This paper studies the generalization performance of multi-class classification algorithms, for which we obtain—for the first time—a data-dependent generalization error bound with a logarithmic dependence on the class size, substantially improving the state-of-the-art linear dependence in the existing data-dependent generalization analysis. The theoretical analysis motivates us to introduce a new multi-class classification machine based on lp-norm regularization, where the parameter p controls the complexity of the corresponding bounds. We derive an efficient optimization algorithm based on Fenchel duality theory. Benchmarks on several real-world datasets show that the proposed algorithm can achieve significant accuracy gains over the state of the art.

[1]  M. Talagrand,et al.  Probability in Banach Spaces: Isoperimetry and Processes , 1991 .

[2]  H. Robbins A Remark on Stirling’s Formula , 1955 .

[3]  Maya R. Gupta,et al.  Training highly multiclass classifiers , 2014, J. Mach. Learn. Res..

[4]  Tong Zhang,et al.  Statistical Analysis of Some Multi-Category Large Margin Classification Methods , 2004, J. Mach. Learn. Res..

[5]  M. Kloft,et al.  l p -Norm Multiple Kernel Learning , 2011 .

[6]  Peter L. Bartlett,et al.  Rademacher and Gaussian Complexities: Risk Bounds and Structural Results , 2003, J. Mach. Learn. Res..

[7]  Ohad Shamir,et al.  Multiclass-Multilabel Classification with More Classes than Examples , 2010, AISTATS.

[8]  Ambuj Tewari,et al.  Regularization Techniques for Learning with Matrices , 2009, J. Mach. Learn. Res..

[9]  D. Slepian The one-sided barrier problem for Gaussian noise , 1962 .

[10]  Ivan Laptev,et al.  Learning and Transferring Mid-level Image Representations Using Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[11]  Jason Weston,et al.  Label Embedding Trees for Large Multi-Class Tasks , 2010, NIPS.

[12]  Ambuj Tewari,et al.  Applications of strong convexity--strong smoothness duality to learning with matrices , 2009, ArXiv.

[13]  Trevor Darrell,et al.  Part-Based R-CNNs for Fine-Grained Category Detection , 2014, ECCV.

[14]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[15]  Ameet Talwalkar,et al.  Foundations of Machine Learning , 2012, Adaptive computation and machine learning.

[16]  Jason D. M. Rennie,et al.  Improving Multiclass Text Classification with the Support Vector Machine , 2001 .

[17]  Charles A. Micchelli,et al.  Learning the Kernel Function via Regularization , 2005, J. Mach. Learn. Res..

[18]  Mehryar Mohri,et al.  Multi-Class Classification with Maximum Margin Multiple Kernel , 2013, ICML.

[19]  John Langford,et al.  Conditional Probability Tree Estimation Analysis and Algorithms , 2009, UAI.

[20]  Mehryar Mohri,et al.  Multi-Class Deep Boosting , 2014, NIPS.

[21]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[22]  Davide Anguita,et al.  The Impact of Unlabeled Patterns in Rademacher Complexity Theory for Kernel Classifiers , 2011, NIPS.

[23]  V. Koltchinskii,et al.  Empirical margin distributions and bounding the generalization error of combined classifiers , 2002, math/0405343.

[24]  Tobias Glasmachers,et al.  Universal Consistency of Multi-Class Support Vector Classification , 2010, NIPS.

[25]  Ken Lang,et al.  NewsWeeder: Learning to Filter Netnews , 1995, ICML.

[26]  Chih-Jen Lin,et al.  A sequential dual method for large scale multi-class linear svms , 2008, KDD.

[27]  Pietro Perona,et al.  Caltech-UCSD Birds 200 , 2010 .

[28]  G. Griffin,et al.  Caltech-256 Object Category Dataset , 2007 .

[29]  V. Koltchinskii,et al.  Rademacher Processes and Bounding the Risk of Function Learning , 2004, math/0405338.

[30]  Alexander Zien,et al.  lp-Norm Multiple Kernel Learning , 2011, J. Mach. Learn. Res..

[31]  Andreas Winkelbauer,et al.  Moments and Absolute Moments of the Normal Distribution , 2012, ArXiv.

[32]  Ambuj Tewari,et al.  On the Consistency of Multiclass Classification Methods , 2007, J. Mach. Learn. Res..

[33]  Colin McDiarmid,et al.  Surveys in Combinatorics, 1989: On the method of bounded differences , 1989 .

[34]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[35]  Tong Zhang,et al.  Class-size Independent Generalization Analsysis of Some Discriminative Multi-Category Classification , 2004, NIPS.

[36]  R. A. Vitale Some comparisons for Gaussian processes , 2000, math/0011093.

[37]  Thomas Hofmann,et al.  Learning with Taxonomies: Classifying Documents and Words , 2003 .

[38]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[39]  Ashish Kapoor,et al.  Active learning for large multi-class problems , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[40]  Arnaud Doucet,et al.  A Framework for Kernel-Based Multi-Category Classification , 2007, J. Artif. Intell. Res..

[41]  Yann Guermeur,et al.  Combining Discriminant Models with New Multi-Class SVMs , 2002, Pattern Analysis & Applications.

[42]  Koby Crammer,et al.  On the Algorithmic Implementation of Multiclass Kernel-based Vector Machines , 2002, J. Mach. Learn. Res..