Enabling Efficiency-Precision Trade-offs for Label Trees in Extreme Classification

Extreme multi-label classification (XMC) aims to learn a model that can tag data points with a subset of relevant labels from an extremely large label set. Real world e-commerce applications like personalized recommendations and product advertising can be formulated as XMC problems, where the objective is to predict for a user a small subset of items from a catalog of several million products. For such applications, a common approach is to organize these labels into a tree, enabling training and inference times that are logarithmic in the number of labels [23]. While training a model once a label tree is available is well studied, designing the structure of the tree is a difficult task that is not yet well understood, and can dramatically impact both model latency and statistical performance. Existing approaches to tree construction either optimize exclusively for statistical performance or optimize exclusively for latency. We propose an efficient information theory inspired algorithm to construct intermediate operating points that trade off between the benefits of both, which was not previously possible. We corroborate our theoretical analysis with numerical results, showing that on the Wiki-500K [4] benchmark dataset our method can reduce a proxy for expected latency by up to 28% while maintaining the same accuracy as Parabel [23]. On several datasets derived from e-commerce customer logs, our modified label tree is able to improve this expected latency metric by up to 20% while maintaining the same accuracy. Finally, we discuss challenges in realizing these latency improvements in deployed models.

[1]  Rohit Babbar,et al.  Bonsai - Diverse and Shallow Trees for Extreme Multi-label Classification , 2019, ArXiv.

[2]  Ping Li,et al.  Asymmetric LSH (ALSH) for Sublinear Time Maximum Inner Product Search (MIPS) , 2014, NIPS.

[3]  Jonathon Shlens,et al.  Deep Networks With Large Output Spaces , 2014, ICLR.

[4]  Manik Varma,et al.  Multi-label learning with millions of labels: recommending advertiser bid phrases for web pages , 2013, WWW.

[5]  Pradeep Ravikumar,et al.  PD-Sparse : A Primal and Dual Sparse Approach to Extreme Multiclass and Multilabel Classification , 2016, ICML.

[6]  Yiming Yang,et al.  Deep Learning for Extreme Multi-label Text Classification , 2017, SIGIR.

[7]  Junfeng Hu,et al.  Optimize Hierarchical Softmax with Word Similarity Knowledge , 2017, Polytech. Open Libr. Int. Bull. Inf. Technol. Sci..

[8]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[9]  Jure Leskovec,et al.  Hidden factors and hidden topics: understanding rating dimensions with review text , 2013, RecSys.

[10]  Georgios Paliouras,et al.  LSHTC: A Benchmark for Large-Scale Text Classification , 2015, ArXiv.

[11]  Chao Xu,et al.  On the computational complexity of the probabilistic label tree algorithms , 2019, ArXiv.

[12]  Zihan Zhang,et al.  AttentionXML: Label Tree-based Attention-Aware Deep Model for High-Performance Extreme Multi-Label Text Classification , 2019, NeurIPS.

[13]  David A. Huffman,et al.  A method for the construction of minimum-redundancy codes , 1952, Proceedings of the IRE.

[14]  Yoshua Bengio,et al.  Hierarchical Probabilistic Neural Network Language Model , 2005, AISTATS.

[15]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[16]  Inderjit S. Dhillon,et al.  PECOS: Prediction for Enormous and Correlated Output Spaces , 2020, ArXiv.

[17]  Pradeep Ravikumar,et al.  PPDsparse: A Parallel Primal-Dual Sparse Method for Extreme Classification , 2017, KDD.

[18]  Tomas Mikolov,et al.  Bag of Tricks for Efficient Text Classification , 2016, EACL.

[19]  Yiming Yang,et al.  X-BERT: eXtreme Multi-label Text Classification with using Bidirectional Encoder Representations from Transformers , 2019 .

[20]  Eyke Hüllermeier,et al.  Extreme F-measure Maximization using Sparse Probability Estimates , 2016, ICML.

[21]  Anshumali Shrivastava,et al.  Extreme Classification in Log Memory , 2018, ArXiv.

[22]  Geoffrey E. Hinton,et al.  A Scalable Hierarchical Distributed Language Model , 2008, NIPS.

[23]  Alberto Bertoni,et al.  Size Constrained Distance Clustering: Separation Properties and Some Complexity Results , 2012, Fundam. Informaticae.

[24]  Bernhard Schölkopf,et al.  DiSMEC: Distributed Sparse Machines for Extreme Multi-label Classification , 2016, WSDM.

[25]  Mill Johannes G.A. Van,et al.  Transmission Of Information , 1961 .

[26]  Manik Varma,et al.  Extreme Multi-label Learning with Label Features for Warm-start Tagging, Ranking & Recommendation , 2018, WSDM.

[27]  Stanislav Krajci,et al.  Performance analysis of Fano coding , 2015, 2015 IEEE International Symposium on Information Theory (ISIT).

[28]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[29]  Manik Varma,et al.  FastXML: a fast, accurate and stable tree-classifier for extreme multi-label learning , 2014, KDD.