Logarithmic Time One-Against-Some

We create a new online reduction of multiclass classification to binary classification for which training and prediction time scale logarithmically with the number of classes. Compared to previous approaches, we obtain substantially better statistical performance for two reasons: First, we prove a tighter and more complete boosting theorem, and second we translate the results more directly into an algorithm. We show that several simple techniques give rise to an algorithm that can compete with one-against-all in both space and predictive power while offering exponential improvements in speed when the number of classes is large.

[1]  Prateek Jain,et al.  Sparse Local Embeddings for Extreme Multi-label Classification , 2015, NIPS.

[2]  Pascal Vincent,et al.  An Exploration of Softmax Alternatives Belonging to the Spherical Loss Family , 2015, ICLR.

[3]  Jason Weston,et al.  WSABIE: Scaling Up to Large Vocabulary Image Annotation , 2011, IJCAI.

[4]  Xavier Bouthillier,et al.  Efficient Exact Gradient Update for training Deep Networks with Very Large Sparse Targets , 2014, NIPS.

[5]  Geoffrey E. Hinton,et al.  A Scalable Hierarchical Distributed Language Model , 2008, NIPS.

[6]  John Langford,et al.  Error-Correcting Tournaments , 2009, ALT.

[7]  Jason Weston,et al.  Label Partitioning For Sublinear Ranking , 2013, ICML.

[8]  Ping Li,et al.  Asymmetric LSH (ALSH) for Sublinear Time Maximum Inner Product Search (MIPS) , 2014, NIPS.

[9]  Paul N. Bennett,et al.  Refined experts: improving classification in large taxonomies , 2009, SIGIR.

[10]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[11]  Jason Weston,et al.  Label Embedding Trees for Large Multi-Class Tasks , 2010, NIPS.

[12]  Mariusz Bojarski,et al.  On the boosting ability of top-down decision tree learning algorithm for multiclass classification , 2016, ArXiv.

[13]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[14]  Ryan M. Rifkin,et al.  In Defense of One-Vs-All Classification , 2004, J. Mach. Learn. Res..

[15]  Eiji Takimoto,et al.  Top-down decision tree learning as information based boosting , 2003, Theor. Comput. Sci..

[16]  I. Good THE POPULATION FREQUENCIES OF SPECIES AND THE ESTIMATION OF POPULATION PARAMETERS , 1953 .

[17]  Yishay Mansour,et al.  On the Boosting Ability of Top-Down Decision Tree Learning Algorithms , 1999, J. Comput. Syst. Sci..

[18]  Richard J. Lipton,et al.  Cryptographic Primitives Based on Hard Learning Problems , 1993, CRYPTO.

[19]  Manik Varma,et al.  FastXML: a fast, accurate and stable tree-classifier for extreme multi-label learning , 2014, KDD.

[20]  John Langford,et al.  Logarithmic Time Online Multiclass prediction , 2015, NIPS.

[21]  Arnold W. M. Smeulders,et al.  The Amsterdam Library of Object Images , 2004, International Journal of Computer Vision.

[22]  Massimiliano Pontil,et al.  Empirical Bernstein Bounds and Sample-Variance Penalization , 2009, COLT.

[23]  Ivan Laptev,et al.  Learning and Transferring Mid-level Image Representations Using Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[24]  Yishay Mansour,et al.  Boosting Using Branching Programs , 2000, J. Comput. Syst. Sci..

[25]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Yoshua Bengio,et al.  Hierarchical Probabilistic Neural Network Language Model , 2005, AISTATS.

[27]  John Langford,et al.  Beating the hold-out: bounds for K-fold and progressive cross-validation , 1999, COLT '99.