Probabilistic Label Trees for Extreme Multi-label Classification

Extreme multi-label classification (XMLC) is a learning task of tagging instances with a small subset of relevant labels chosen from an extremely large pool of possible labels. Problems of this scale can be efficiently handled by organizing labels as a tree, like in hierarchical softmax used for multi-class problems. In this paper, we thoroughly investigate probabilistic label trees (PLTs) which can be treated as a generalization of hierarchical softmax for multi-label problems. We first introduce the PLT model and discuss training and inference procedures and their computational costs. Next, we prove the consistency of PLTs for a wide spectrum of performance metrics. To this end, we upperbound their regret by a function of surrogate-loss regrets of node classifiers. Furthermore, we consider a problem of training PLTs in a fully online setting, without any prior knowledge of training instances, their features, or labels. In this case, both node classifiers and the tree structure are trained online. We prove a specific equivalence between the fully online algorithm and an algorithm with a tree structure given in advance. Finally, we discuss several implementations of PLTs and introduce a new one, napkinXC, which we empirically evaluate and compare with state-of-the-art algorithms.

[1]  E. Hüllermeier,et al.  Consistent multilabel ranking through univariate loss minimization , 2012, ICML 2012.

[2]  Yoshua Bengio,et al.  Hierarchical Probabilistic Neural Network Language Model , 2005, AISTATS.

[3]  Manik Varma,et al.  Multi-label learning with millions of labels: recommending advertiser bid phrases for web pages , 2013, WWW.

[4]  Kilian Q. Weinberger,et al.  Feature hashing for large scale multitask learning , 2009, ICML '09.

[5]  Zihan Zhang,et al.  AttentionXML: Label Tree-based Attention-Aware Deep Model for High-Performance Extreme Multi-Label Text Classification , 2019, NeurIPS.

[6]  Yukihiro Tagami,et al.  AnnexML: Approximate Nearest Neighbor Search for Extreme Multi-label Classification , 2017, KDD.

[7]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[8]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[9]  Oluwasanmi Koyejo,et al.  Consistent Multilabel Classification , 2015, NIPS.

[10]  Eyke Hüllermeier,et al.  Online F-Measure Optimization , 2015, NIPS.

[11]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[12]  Bernhard Schölkopf,et al.  DiSMEC: Distributed Sparse Machines for Extreme Multi-label Classification , 2016, WSDM.

[13]  Manik Varma,et al.  FastXML: a fast, accurate and stable tree-classifier for extreme multi-label learning , 2014, KDD.

[14]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[15]  John Langford Vowpal Wabbit , 2014 .

[16]  John Langford,et al.  Learning Reductions That Really Work , 2016, Proceedings of the IEEE.

[17]  Marek Kurzynski,et al.  On the multistage Bayes classifier , 1988, Pattern Recognit..

[18]  Pradeep Ravikumar,et al.  PPDsparse: A Parallel Primal-Dual Sparse Method for Extreme Classification , 2017, KDD.

[19]  Moustapha Cissé,et al.  Efficient softmax approximation for GPUs , 2016, ICML.

[20]  Jason Weston,et al.  Label Partitioning For Sublinear Ranking , 2013, ICML.

[21]  Marek Wydmuch,et al.  Online probabilistic label trees , 2020, ArXiv.

[22]  Jason Weston,et al.  Label Embedding Trees for Large Multi-Class Tasks , 2010, NIPS.

[23]  Eyke Hüllermeier,et al.  Extreme F-measure Maximization using Sparse Probability Estimates , 2016, ICML.

[24]  Yiming Yang,et al.  Deep Learning for Extreme Multi-label Text Classification , 2017, SIGIR.

[25]  John Langford,et al.  Error-Correcting Tournaments , 2009, ALT.

[26]  Chun-Liang Li,et al.  Condensed Filter Tree for Cost-Sensitive Multi-Label Classification , 2014, ICML.

[27]  Ohad Shamir,et al.  Multiclass-Multilabel Classification with More Classes than Examples , 2010, AISTATS.

[28]  Shivani Agarwal,et al.  Surrogate regret bounds for bipartite ranking via strongly proper losses , 2012, J. Mach. Learn. Res..

[29]  Pradeep Ravikumar,et al.  PD-Sparse : A Primal and Dual Sparse Approach to Extreme Multiclass and Multilabel Classification , 2016, ICML.

[30]  Manik Varma,et al.  Extreme Multi-label Loss Functions for Recommendation, Tagging, Ranking & Other Missing Label Applications , 2016, KDD.

[31]  Wojciech Kotlowski,et al.  Surrogate regret bounds for generalized classification performance metrics , 2015, Machine Learning.

[32]  Alexander C. Berg,et al.  Fast and Balanced: Efficient Label Tree Learning for Large Scale Object Recognition , 2011, NIPS.

[33]  Eyke Hüllermeier,et al.  Bayes Optimal Multilabel Classification via Probabilistic Classifier Chains , 2010, ICML.

[34]  Róbert Busa-Fekete,et al.  A no-regret generalization of hierarchical softmax to extreme multi-label classification , 2018, NeurIPS.

[35]  Jian Xu,et al.  Learning Optimal Tree Models Under Beam Search , 2020, ICML.

[36]  J. Fox Applied Regression Analysis, Linear Models, and Related Methods , 1997 .

[37]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[38]  Grigorios Tsoumakas,et al.  Effective and Efficient Multilabel Classification in Domains with Large Number of Labels , 2008 .

[39]  Ankit Singh Rawat,et al.  Multilabel reductions: what is my loss optimising? , 2019, NeurIPS.

[40]  Rohit Babbar,et al.  Bonsai - Diverse and Shallow Trees for Extreme Multi-label Classification , 2019, ArXiv.

[41]  John Langford,et al.  Conditional Probability Tree Estimation Analysis and Algorithms , 2009, UAI.

[42]  Chao Xu,et al.  On the computational complexity of the probabilistic label tree algorithms , 2019, ArXiv.

[43]  Anna Choromanska,et al.  Simultaneous Learning of Trees and Representations for Extreme Classification and Density Estimation , 2016, ICML.

[44]  J. Ian Munro,et al.  Robin hood hashing , 1985, 26th Annual Symposium on Foundations of Computer Science (sfcs 1985).

[45]  Tomas Mikolov,et al.  Bag of Tricks for Efficient Text Classification , 2016, EACL.

[46]  Eyke Hüllermeier,et al.  Consistency of Probabilistic Classifier Trees , 2016, ECML/PKDD.

[47]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[48]  Yves Grandvalet,et al.  Optimizing F-Measures by Cost-Sensitive Classification , 2014, NIPS.

[49]  Yoram Singer,et al.  Efficient Online and Batch Learning Using Forward Backward Splitting , 2009, J. Mach. Learn. Res..

[50]  Thomas Demeester,et al.  Representation learning for very short texts using weighted word embedding aggregation , 2016, Pattern Recognit. Lett..