OTSeq2Set: An Optimal Transport Enhanced Sequence-to-Set Model for Extreme Multi-label Text Classification

Extreme multi-label text classification (XMTC) is the task of finding the most relevant subset labels from an extremely large-scale label collection. Recently, some deep learning models have achieved state-of-the-art results in XMTC tasks. These models commonly predict scores for all labels by a fully connected layer as the last layer of the model. However, such models can’t predict a relatively complete and variable-length label subset for each document, because they select positive labels relevant to the document by a fixed threshold or take top k labels in descending order of scores. A less popular type of deep learning models called sequence-to-sequence (Seq2Seq) focus on predicting variable-length positive labels in sequence style. However, the labels in XMTC tasks are essentially an unordered set rather than an ordered sequence, the default order of labels restrains Seq2Seq models in training. To address this limitation in Seq2Seq, we propose an autoregressive sequence-to-set model for XMTC tasks named OTSeq2Set. Our model generates predictions in student-forcing scheme and is trained by a loss function based on bipartite matching which enables permutation-invariance. Meanwhile, we use the optimal transport distance as a measurement to force the model to focus on the closest labels in semantic label space. Experiments show that OTSeq2Set outperforms other competitive baselines on 4 benchmark datasets. Especially, on the Wikipedia dataset with 31k labels, it outperforms the state-of-the-art Seq2Seq method by 16.34% in micro-F1 score. The code is available at https://github.com/caojie54/OTSeq2Set.

[1]  Qi Zhang,et al.  One2Set: Generating Diverse Keyphrases as a Set , 2021, ACL.

[2]  Yueting Zhuang,et al.  A Sequence-to-Set Network for Nested Named Entity Recognition , 2021, IJCAI.

[3]  Ting Jiang,et al.  LightXML: Transformer with Dynamic Negative Sampling for High-Performance Extreme Multi-label Text Classification , 2021, AAAI.

[4]  Liqun Chen,et al.  Improving Text Generation with Student-Forcing Optimal Transport , 2020, EMNLP.

[5]  Aidong Zhang,et al.  Correlation Networks for Extreme Multi-label Text Classification , 2020, KDD.

[6]  Shuming Ma,et al.  A Deep Reinforced Sequence-to-Set Model for Multi-Label Classification , 2019, ACL.

[7]  Wei-Cheng Chang,et al.  Taming Pretrained Transformers for Extreme Multi-label Text Classification , 2019, KDD.

[8]  Felix Wu,et al.  Pay Less Attention with Lightweight and Dynamic Convolutions , 2019, ICLR.

[9]  Zhe Gan,et al.  Improving Sequence-to-Sequence Learning via Optimal Transport , 2019, ICLR.

[10]  Hiroshi Mamitsuka,et al.  AttentionXML: Label Tree-based Attention-Aware Deep Model for High-Performance Extreme Multi-Label Text Classification , 2018, NeurIPS.

[11]  Wei Wu,et al.  SGM: Sequence Generation Model for Multi-label Classification , 2018, COLING.

[12]  Hongyuan Zha,et al.  A Fast Proximal Point Method for Computing Exact Wasserstein Distance , 2018, UAI.

[13]  Yiming Yang,et al.  Deep Learning for Extreme Multi-label Text Classification , 2017, SIGIR.

[14]  S. Chopra,et al.  Sequence Level Training with Recurrent Neural Networks , 2015, ICLR.

[15]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[16]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[17]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[18]  Jure Leskovec,et al.  Hidden factors and hidden topics: understanding rating dimensions with review text , 2013, RecSys.

[19]  Razvan Pascanu,et al.  On the difficulty of training recurrent neural networks , 2012, ICML.

[20]  A. Zubiaga Enhancing Navigation on Wikipedia with Social Tags , 2012, ArXiv.

[21]  Johannes Fürnkranz,et al.  Efficient Pairwise Multilabel Classification for Large-Scale Problems in the Legal Domain , 2008, ECML/PKDD.

[22]  Harold W. Kuhn,et al.  The Hungarian method for the assignment problem , 1955, 50 Years of Integer Programming.

[23]  Johannes Fürnkranz,et al.  Maximizing Subset Accuracy with Recurrent Neural Networks in Multi-label Classification , 2017, NIPS.

[24]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..