Fast Multi-Resolution Transformer Fine-tuning for Extreme Multi-label Text Classification

Extreme multi-label text classification (XMC) seeks to find relevant labels from an extreme large label collection for a given text input. Many real-world applications can be formulated as XMC problems, such as recommendation systems, document tagging and semantic search. Recently, transformer based XMC methods, such as X-Transformer and LightXML, have shown significant improvement over other XMC methods. Despite leveraging pre-trained transformer models for text representation, the fine-tuning procedure of transformer models on large label space still has lengthy computational time even with powerful GPUs. In this paper, we propose a novel recursive approach, XR-Transformer to accelerate the procedure through recursively fine-tuning transformer models on a series of multi-resolution objectives related to the original XMC objective function. Empirical results show that XR-Transformer takes significantly less training time compared to other transformer-based XMC models while yielding better state-of-the-art results. In particular, on the public Amazon-3M dataset with 3 million labels, XR-Transformer is not only 20x faster than XTransformer but also improves the Precision@1 from 51% to 54%. Our code is publicly available at https://github.com/amzn/pecos.

[1]  Yiming Yang,et al.  Deep Learning for Extreme Multi-label Text Classification , 2017, SIGIR.

[2]  Narendra Ahuja,et al.  Deep Laplacian Pyramid Networks for Fast and Accurate Super-Resolution , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Bernhard Schölkopf,et al.  Data scarcity, robustness and extreme multi-label classification , 2019, Machine Learning.

[4]  Bernhard Schölkopf,et al.  DiSMEC: Distributed Sparse Machines for Extreme Multi-label Classification , 2016, WSDM.

[5]  Charles Elkan,et al.  The Foundations of Cost-Sensitive Learning , 2001, IJCAI.

[6]  Zihan Zhang,et al.  AttentionXML: Label Tree-based Attention-Aware Deep Model for High-Performance Extreme Multi-Label Text Classification , 2019, NeurIPS.

[7]  John Langford,et al.  Logarithmic Time One-Against-Some , 2016, ICML.

[8]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[9]  Ankit Singh Rawat,et al.  Multilabel reductions: what is my loss optimising? , 2019, NeurIPS.

[10]  John Langford,et al.  Logarithmic Time Online Multiclass prediction , 2015, NIPS.

[11]  Manik Varma,et al.  Extreme Multi-label Loss Functions for Recommendation, Tagging, Ranking & Other Missing Label Applications , 2016, KDD.

[12]  Manik Varma,et al.  FastXML: a fast, accurate and stable tree-classifier for extreme multi-label learning , 2014, KDD.

[13]  Inderjit S. Dhillon,et al.  PECOS: Prediction for Enormous and Correlated Output Spaces , 2020, ArXiv.

[14]  Sanjiv Kumar,et al.  Accelerating Large-Scale Inference with Anisotropic Vector Quantization , 2019, ICML.

[15]  Pradeep Ravikumar,et al.  PD-Sparse : A Primal and Dual Sparse Approach to Extreme Multiclass and Multilabel Classification , 2016, ICML.

[16]  Marek Wydmuch,et al.  Probabilistic Label Trees for Extreme Multi-label Classification , 2020, ArXiv.

[17]  Anshumali Shrivastava,et al.  Extreme Classification in Log Memory using Count-Min Sketch: A Case Study of Amazon Search with 50M Products , 2019, NeurIPS.

[18]  Manik Varma,et al.  ECLARE: Extreme Classification with Label Graph Correlations , 2021, WWW.

[19]  Chun-Liang Li,et al.  Condensed Filter Tree for Cost-Sensitive Multi-Label Classification , 2014, ICML.

[20]  Brian D. Davison,et al.  Pretrained Generalized Autoregressive Model with Adaptive Probabilistic Label Clusters for Extreme Multi-label Text Classification , 2020, ICML.

[21]  Manik Varma,et al.  Extreme Regression for Dynamic Search Advertising , 2020, WSDM.

[22]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[23]  Yukihiro Tagami,et al.  AnnexML: Approximate Nearest Neighbor Search for Extreme Multi-label Classification , 2017, KDD.

[24]  Marek Wydmuch,et al.  Online probabilistic label trees , 2020, ArXiv.

[25]  Manik Varma,et al.  DECAF: Deep Extreme Classification with Label Features , 2021, WSDM.

[26]  Ming-Wei Chang,et al.  Latent Retrieval for Weakly Supervised Open Domain Question Answering , 2019, ACL.

[27]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[28]  Moustapha Cissé,et al.  Efficient softmax approximation for GPUs , 2016, ICML.

[29]  Hsuan-Tien Lin,et al.  Cost-sensitive label embedding for multi-label classification , 2017, Machine Learning.

[30]  Pradeep Ravikumar,et al.  PPDsparse: A Parallel Primal-Dual Sparse Method for Extreme Classification , 2017, KDD.

[31]  Ting Jiang,et al.  LightXML: Transformer with Dynamic Negative Sampling for High-Performance Extreme Multi-label Text Classification , 2021, AAAI.

[32]  Wei-Cheng Chang,et al.  Pre-training Tasks for Embedding-based Large-scale Retrieval , 2020, ICLR.

[33]  I. Dhillon,et al.  Taming Pretrained Transformers for Extreme Multi-label Text Classification , 2019, KDD.

[34]  Rohit Babbar,et al.  Bonsai - Diverse and Shallow Trees for Extreme Multi-label Classification , 2019, ArXiv.

[35]  Jian Jiao,et al.  GalaXC: Graph Neural Networks with Labelwise Attention for Extreme Classification , 2021, WWW.

[36]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[37]  Yury A. Malkov,et al.  Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[38]  Venkatesh Balasubramanian,et al.  Slice: Scalable Linear Extreme Classifiers Trained on 100 Million Labels for Related Searches , 2019, WSDM.

[39]  Timo Aila,et al.  A Style-Based Generator Architecture for Generative Adversarial Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Manik Varma,et al.  DeepXML: A Deep Extreme Multi-Label Learning Framework Applied to Short Text Documents , 2021, WSDM.

[41]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[42]  Jordi Gonzàlez,et al.  A coarse-to-fine approach for fast deformable object detection , 2011, CVPR 2011.

[43]  Andrew J. Davison,et al.  Self-Supervised Generalisation with Meta Auxiliary Learning , 2019, NeurIPS.

[44]  Jaakko Lehtinen,et al.  Progressive Growing of GANs for Improved Quality, Stability, and Variation , 2017, ICLR.

[45]  Hsuan-Tien Lin,et al.  Advances in Cost-sensitive Multiclass and Multilabel Classification , 2019, KDD.

[46]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[47]  Inderjit S. Dhillon,et al.  Extreme Multi-label Learning for Semantic Matching in Product Search , 2021, KDD.

[48]  Róbert Busa-Fekete,et al.  A no-regret generalization of hierarchical softmax to extreme multi-label classification , 2018, NeurIPS.

[49]  Ali Mousavi,et al.  Breaking the Glass Ceiling for Embedding-Based Classifiers for Large Output Spaces , 2019, NeurIPS.