sigmoidF1: A Smooth F1 Score Surrogate Loss for Multilabel Classification

Multiclass multilabel classification is the task of attributing multiple labels to examples via predictions. Current models formulate a reduction of the multilabel setting into either multiple binary classifications or multiclass classification, allowing for the use of existing loss functions (sigmoid, cross-entropy, logistic, etc.). Multilabel classification reductions do not accommodate for the prediction of varying numbers of labels per example and the underlying losses are distant estimates of the performance metrics. We propose a loss function, sigmoidF1, which is an approximation of the F1 score that (1) is smooth and tractable for stochastic gradient descent, (2) naturally approximates a multilabel metric, and (3) estimates label propensities and label counts. We show that any confusion matrix metric can be formulated with a smooth surrogate. We evaluate the proposed loss function on text and image datasets, and with a variety of metrics, to account for the complexity of multilabel classification evaluation. sigmoidF1 outperforms other loss functions on one text and two image datasets and several metrics. These results show the effectiveness of using inferencetime metrics as loss functions for non-trivial classification problems like multilabel classification.

[1]  Georges Hébrail,et al.  Automatic document classification: natural language processing, statistical analysis, and expert system techniques used together , 1992, SIGIR '92.

[2]  David J. Miller,et al.  Semisupervised, Multilabel, Multi-Instance Learning for Structured Data , 2017, Neural Computation.

[3]  Elad Eban,et al.  Scalable Learning of Non-Decomposable Objectives , 2016, AISTATS.

[4]  Xiang Zhang,et al.  Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[5]  Ross B. Girshick,et al.  Focal Loss for Dense Object Detection , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Bernhard Schölkopf,et al.  DiSMEC: Distributed Sparse Machines for Extreme Multi-label Classification , 2016, WSDM.

[7]  Shivani Agarwal,et al.  On the Consistency of Output Code Based Learning Algorithms for Multiclass Learning Problems , 2014, COLT.

[8]  Oluwasanmi Koyejo,et al.  Consistent Multilabel Classification , 2015, NIPS.

[9]  Haihua Xu,et al.  Maximum F1-Score Discriminative Training Criterion for Automatic Mispronunciation Detection , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[10]  H. Ahn,et al.  Decision threshold adjustment in class prediction , 2006, SAR and QSAR in environmental research.

[11]  Zhi-Hua Zhou,et al.  Multilabel Neural Networks with Applications to Functional Genomics and Text Categorization , 2006, IEEE Transactions on Knowledge and Data Engineering.

[12]  David A. Hull,et al.  Dean of Graduate Studies , 2000 .

[13]  M. Craven,et al.  Pairwise learning of multilabel classifications with perceptrons , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[14]  Ankit Singh Rawat,et al.  Multilabel reductions: what is my loss optimising? , 2019, NeurIPS.

[15]  I. Dhillon,et al.  Taming Pretrained Transformers for Extreme Multi-label Text Classification , 2019, KDD.

[16]  Manik Varma,et al.  Extreme Multi-label Loss Functions for Recommendation, Tagging, Ranking & Other Missing Label Applications , 2016, KDD.

[17]  Krista A. Ehinger,et al.  SUN database: Large-scale scene recognition from abbey to zoo , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[18]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[19]  G. Kumaravelan,et al.  Performance Evaluation of Deep Learning Algorithms in Biomedical Document Classification , 2019, 2019 11th International Conference on Advanced Computing (ICoAC).

[20]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[21]  Tie-Yan Liu,et al.  Learning to rank for information retrieval , 2009, SIGIR.

[22]  Frank D. Wood,et al.  Diagnosis code assignment: models and evaluation metrics , 2013, J. Am. Medical Informatics Assoc..

[23]  Jussara M. Almeida,et al.  On the cost-effectiveness of neural and non-neural approaches and representations for text classification: A comprehensive comparative study , 2021, Inf. Process. Manag..

[24]  Sebastian Ruder,et al.  Universal Language Model Fine-tuning for Text Classification , 2018, ACL.

[25]  Bjorn Ommer,et al.  Unsupervised Representation Learning by Discovering Reliable Image Relations , 2019, Pattern Recognit..

[26]  Grigorios Tsoumakas,et al.  Multi-Label Classification: An Overview , 2007, Int. J. Data Warehous. Min..

[27]  Itamar Friedman,et al.  TResNet: High Performance GPU-Dedicated Architecture , 2021, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[28]  Min-Ling Zhang,et al.  A Review on Multi-Label Learning Algorithms , 2014, IEEE Transactions on Knowledge and Data Engineering.

[29]  Sashank J. Reddi,et al.  Stochastic Negative Mining for Learning with Large Output Spaces , 2018, AISTATS.

[30]  Li Yang,et al.  Big Bird: Transformers for Longer Sequences , 2020, NeurIPS.

[31]  Wei-Ta Chu,et al.  Movie Genre Classification based on Poster Images with Deep Neural Networks , 2017, MUSA2@MM.

[32]  Jason Weston,et al.  A kernel method for multi-labelled classification , 2001, NIPS.

[33]  Nenghai Yu,et al.  Learning Spatial Regularization with Image-Level Supervisions for Multi-label Image Classification , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[35]  Ambuj Tewari,et al.  On the Consistency of Multiclass Classification Methods , 2007, J. Mach. Learn. Res..

[36]  Eyke Hüllermeier,et al.  Label ranking by learning pairwise preferences , 2008, Artif. Intell..

[37]  In-Ho Kang,et al.  Query type classification for web document retrieval , 2003, SIGIR.

[38]  Manik Varma,et al.  Multi-label learning with millions of labels: recommending advertiser bid phrases for web pages , 2013, WWW.

[39]  Yiming Yang,et al.  An Evaluation of Statistical Approaches to Text Categorization , 1999, Information Retrieval.

[40]  Ioannis Patras,et al.  AC-SUM-GAN: Connecting Actor-Critic and Generative Adversarial Networks for Unsupervised Video Summarization , 2021, IEEE Transactions on Circuits and Systems for Video Technology.

[41]  Charles Elkan,et al.  Optimal Thresholding of Classifiers to Maximize F1 Measure , 2014, ECML/PKDD.

[42]  Mark Sandler,et al.  MobileNetV2: Inverted Residuals and Linear Bottlenecks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[43]  Prateek Jain,et al.  Sparse Local Embeddings for Extreme Multi-label Classification , 2015, NIPS.

[44]  Bernt Schiele,et al.  Loss Functions for Top-k Error: Analysis and Insights , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Amanda Clare,et al.  Knowledge Discovery in Multi-label Phenotype Data , 2001, PKDD.

[46]  Róbert Busa-Fekete,et al.  A no-regret generalization of hierarchical softmax to extreme multi-label classification , 2018, NeurIPS.

[47]  Bingbing Ni,et al.  HCP: A Flexible CNN Framework for Multi-Label Image Classification , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[48]  Thomas Wolf,et al.  DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.

[49]  Zhi-Hua Zhou,et al.  A Unified View of Multi-Label Performance Measures , 2016, ICML.

[50]  Jiebo Luo,et al.  Learning multi-label scene classification , 2004, Pattern Recognit..

[51]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[52]  D. Hanahan,et al.  Hallmarks of Cancer: The Next Generation , 2011, Cell.

[53]  Baoyuan Wu,et al.  Tencent ML-Images: A Large-Scale Multi-Label Image Database for Visual Representation Learning , 2019, IEEE Access.

[54]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[55]  A. Korhonen,et al.  Text mining for improved human exposure assessment , 2014 .

[56]  Wei Liu,et al.  Classification by Retrieval: Binarizing Data and Classifiers , 2017, SIGIR.

[57]  Tomas Mikolov,et al.  Bag of Tricks for Efficient Text Classification , 2016, EACL.

[58]  Bruno Trstenjak,et al.  on Intelligent Manufacturing and Automation , 2013 KNN with TF-IDF Based Framework for Text Categorization , 2014 .

[59]  Zhiyong Lu,et al.  ML-Net: multi-label classification of biomedical texts with deep neural networks , 2018, J. Am. Medical Informatics Assoc..

[60]  Anna Choromanska,et al.  Simultaneous Learning of Trees and Representations for Extreme Classification and Density Estimation , 2016, ICML.

[61]  M. Braga,et al.  Exploratory Data Analysis , 2018, Encyclopedia of Social Network Analysis and Mining. 2nd Ed..

[62]  E. B. Andersen,et al.  Information Science and Statistics , 1986 .

[63]  Yang Zhao,et al.  Deep High-Resolution Representation Learning for Visual Recognition , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[64]  Cheng Li,et al.  The LambdaLoss Framework for Ranking Metric Optimization , 2018, CIKM.

[65]  Fernando Benites,et al.  HARAM: A Hierarchical ARAM Neural Network for Large-Scale Text Classification , 2015, 2015 IEEE International Conference on Data Mining Workshop (ICDMW).

[66]  Hao Wu,et al.  Long Document Classification From Local Word Glimpses via Recurrent Attention Learning , 2019, IEEE Access.

[67]  Venkatesh Balasubramanian,et al.  Slice: Scalable Linear Extreme Classifiers Trained on 100 Million Labels for Related Searches , 2019, WSDM.

[68]  Inderjit S. Dhillon,et al.  Large-scale Multi-label Learning with Missing Labels , 2013, ICML.

[69]  Olga Vechtomova,et al.  Book Review: Introduction to Information Retrieval by Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze , 2009, CL.

[70]  Yale Song,et al.  Improving Pairwise Ranking for Multi-label Image Classification , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[71]  Wei Xu,et al.  CNN-RNN: A Unified Framework for Multi-label Image Classification , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[72]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[73]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[74]  Saso Dzeroski,et al.  An extensive experimental comparison of methods for multi-label learning , 2012, Pattern Recognit..

[75]  Zhi-Hua Zhou,et al.  ML-KNN: A lazy learning approach to multi-label learning , 2007, Pattern Recognit..

[76]  Bernt Schiele,et al.  Top-k Multiclass SVM , 2015, NIPS.

[77]  Jianguo Zhang,et al.  The PASCAL Visual Object Classes Challenge , 2006 .

[78]  Luc Van Gool,et al.  Large Scale Holistic Video Understanding , 2019, ECCV.

[79]  Proceedings of the Workshop on Multimodal Understanding of Social, Affective and Subjective Attributes , 2017, "MUSA2@MM.

[80]  Tong Zhang Statistical behavior and consistency of classification methods based on convex risk minimization , 2003 .

[81]  Jens Lehmann,et al.  DBpedia - A large-scale, multilingual knowledge base extracted from Wikipedia , 2015, Semantic Web.

[82]  Eyke Hüllermeier,et al.  On the bayes-optimality of F-measure maximizers , 2013, J. Mach. Learn. Res..

[83]  Quoc V. Le,et al.  Don't Decay the Learning Rate, Increase the Batch Size , 2017, ICLR.

[84]  Yiming Yang,et al.  Deep Learning for Extreme Multi-label Text Classification , 2017, SIGIR.

[85]  Eyke Hüllermeier,et al.  A Unified Model for Multilabel Classification and Ranking , 2006, ECAI.

[86]  Kaiming He,et al.  Focal Loss for Dense Object Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[87]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[88]  R. C. Macridis A review , 1963 .

[89]  Pradeep Ravikumar,et al.  PPDsparse: A Parallel Primal-Dual Sparse Method for Extreme Classification , 2017, KDD.

[90]  Celine Vens,et al.  Active learning for hierarchical multi-label classification , 2020, Data Mining and Knowledge Discovery.

[91]  Hesam Amoualian SIGIR 2020 E-Commerce Workshop Data Challenge Overview , 2020 .

[92]  Eyke Hüllermeier,et al.  Bayes Optimal Multilabel Classification via Probabilistic Classifier Chains , 2010, ICML.