A Hierarchical Spectral Method for Extreme Classification

Extreme classification problems are multiclass and multilabel classification problems where the number of outputs is so large that straightforward strategies are neither statistically nor computationally viable. One strategy for dealing with the computational burden is via a tree decomposition of the output space. While this typically leads to training and inference that scales sublinearly with the number of outputs, it also results in reduced statistical performance. In this work, we identify two shortcomings of tree decomposition methods, and describe two heuristic mitigations. We compose these with an eigenvalue technique for constructing the tree. The end result is a computationally efficient algorithm that provides good statistical performance on several extreme data sets.

[1]  Kilian Q. Weinberger,et al.  Feature hashing for large scale multitask learning , 2009, ICML '09.

[2]  Nikos Karampatziakis,et al.  Fast Label Embeddings for Extremely Large Output Spaces , 2015, International Conference on Learning Representations.

[3]  Jiayu Zhou,et al.  Integrating low-rank and group-sparse structures for robust multi-task learning , 2011, KDD.

[4]  John Langford,et al.  Efficient programmable learning to search , 2014, ArXiv.

[5]  Rich Caruana,et al.  Multitask Learning , 1998, Encyclopedia of Machine Learning and Data Mining.

[6]  Jason Weston,et al.  Label Partitioning For Sublinear Ranking , 2013, ICML.

[7]  Ethem Alpaydin,et al.  Linear Discriminant Trees , 2000, ICML.

[8]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[9]  Manik Varma,et al.  FastXML: a fast, accurate and stable tree-classifier for extreme multi-label learning , 2014, KDD.

[10]  Prateek Jain,et al.  Sparse Local Embeddings for Extreme Multi-label Classification , 2015, NIPS.

[11]  Prateek Jain,et al.  Locally Non-linear Embeddings for Extreme Multi-label Learning , 2015, ArXiv.

[12]  John Langford,et al.  A Credit Assignment Compiler for Joint Prediction , 2014, NIPS.

[13]  Ryan M. Rifkin,et al.  In Defense of One-Vs-All Classification , 2004, J. Mach. Learn. Res..

[14]  John Langford,et al.  Logarithmic Time Online Multiclass prediction , 2015, NIPS.

[15]  Paul Mineiro,et al.  Fast Label Embeddings via Randomized Linear Algebra , 2014, ECML/PKDD.

[16]  Stephen E. Robertson,et al.  Okapi at TREC-7: Automatic Ad Hoc, Filtering, VLC and Interactive , 1998, TREC.

[17]  Xinyun Chen Under Review as a Conference Paper at Iclr 2017 Delving into Transferable Adversarial Ex- Amples and Black-box Attacks , 2016 .

[18]  Paul N. Bennett,et al.  Refined experts: improving classification in large taxonomies , 2009, SIGIR.

[19]  Georgios Paliouras,et al.  LSHTC: A Benchmark for Large-Scale Text Classification , 2015, ArXiv.

[20]  Jason Weston,et al.  Label Embedding Trees for Large Multi-Class Tasks , 2010, NIPS.

[21]  Inderjit S. Dhillon,et al.  Co-clustering documents and words using bipartite spectral graph partitioning , 2001, KDD '01.

[22]  Geoffrey E. Hinton,et al.  A Scalable Hierarchical Distributed Language Model , 2008, NIPS.

[23]  Yoshua Bengio,et al.  Hierarchical Probabilistic Neural Network Language Model , 2005, AISTATS.

[24]  Y. Kushal,et al.  Large Scale Hierarchical Text Classification , 2014 .

[25]  Arnold W. M. Smeulders,et al.  The Amsterdam Library of Object Images , 2004, International Journal of Computer Vision.

[26]  Massimiliano Pontil,et al.  Multi-Task Feature Learning , 2006, NIPS.

[27]  Nathan Halko,et al.  Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions , 2009, SIAM Rev..

[28]  Jeff A. Bilmes,et al.  Deep Canonical Correlation Analysis , 2013, ICML.

[29]  John Langford,et al.  Learning to Search Better than Your Teacher , 2015, ICML.

[30]  Karl J. Friston,et al.  Characterizing the Response of PET and fMRI Data Using Multivariate Linear Models , 1997, NeuroImage.