On label dependence and loss minimization in multi-label classification

Most of the multi-label classification (MLC) methods proposed in recent years intended to exploit, in one way or the other, dependencies between the class labels. Comparing to simple binary relevance learning as a baseline, any gain in performance is normally explained by the fact that this method is ignoring such dependencies. Without questioning the correctness of such studies, one has to admit that a blanket explanation of that kind is hiding many subtle details, and indeed, the underlying mechanisms and true reasons for the improvements reported in experimental studies are rarely laid bare. Rather than proposing yet another MLC algorithm, the aim of this paper is to elaborate more closely on the idea of exploiting label dependence, thereby contributing to a better understanding of MLC. Adopting a statistical perspective, we claim that two types of label dependence should be distinguished, namely conditional and marginal dependence. Subsequently, we present three scenarios in which the exploitation of one of these types of dependence may boost the predictive performance of a classifier. In this regard, a close connection with loss minimization is established, showing that the benefit of exploiting label dependence does also depend on the type of loss to be minimized. Concrete theoretical results are presented for two representative loss functions, namely the Hamming loss and the subset 0/1 loss. In addition, we give an overview of state-of-the-art decomposition algorithms for MLC and we try to reveal the reasons for their effectiveness. Our conclusions are supported by carefully designed experiments on synthetic and benchmark data.

[1]  H. L. Le Roy,et al.  Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability; Vol. IV , 1969 .

[2]  A. Izenman Reduced-rank regression for the multivariate linear model , 1975 .

[3]  J. Zidek,et al.  Multivariate regression analysis and canonical variates , 1980 .

[4]  Rich Caruana,et al.  Multitask Learning: A Knowledge-Based Source of Inductive Bias , 1993, ICML.

[5]  Dimitri P. Bertsekas,et al.  Nonlinear Programming , 1997 .

[6]  J. Friedman,et al.  Predicting Multivariate Responses in Multiple Linear Regression , 1997 .

[7]  H. Joe Multivariate models and dependence concepts , 1998 .

[8]  Michael I. Jordan Learning in Graphical Models , 1999, NATO ASI Series.

[9]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[10]  Jason Weston,et al.  A kernel method for multi-labelled classification , 2001, NIPS.

[11]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[12]  Bernhard Schölkopf,et al.  Kernel Dependency Estimation , 2002, NIPS.

[13]  Yoram Singer,et al.  Log-Linear Models for Label Ranking , 2003, NIPS.

[14]  Rich Caruana,et al.  Multitask Learning , 1997, Machine Learning.

[15]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[16]  Jiebo Luo,et al.  Learning multi-label scene classification , 2004, Pattern Recognit..

[17]  Sunita Sarawagi,et al.  Discriminative Methods for Multi-labeled Classification , 2004, PAKDD.

[18]  Yoram Singer,et al.  BoosTexter: A Boosting-based System for Text Categorization , 2000, Machine Learning.

[19]  Andrew McCallum,et al.  Collective multi-label classification , 2005, CIKM '05.

[20]  Lior Rokach,et al.  Data Mining And Knowledge Discovery Handbook , 2005 .

[21]  Thomas Hofmann,et al.  Large Margin Methods for Structured and Interdependent Output Variables , 2005, J. Mach. Learn. Res..

[22]  Hans-Peter Kriegel,et al.  Multi-Output Regularized Feature Projection , 2006, IEEE Transactions on Knowledge and Data Engineering.

[23]  Y. Amit,et al.  Towards a coherent statistical framework for dense deformable template estimation , 2007 .

[24]  K. Bretonnel Cohen,et al.  A shared task involving multi-label classification of clinical free text , 2007, BioNLP@ACL.

[25]  Yoram Singer,et al.  A Boosting Algorithm for Label Covering in Multilabel Problems , 2007, AISTATS.

[26]  Grigorios Tsoumakas,et al.  Random k -Labelsets: An Ensemble Method for Multilabel Classification , 2007, ECML.

[27]  Grigorios Tsoumakas,et al.  Multi-Label Classification: An Overview , 2007, Int. J. Data Warehous. Min..

[28]  Grigorios Tsoumakas,et al.  Multi-Label Classification of Music into Emotions , 2008, ISMIR.

[29]  Wojciech Kotlowski,et al.  Maximum likelihood rule ensembles , 2008, ICML '08.

[30]  Eyke Hüllermeier,et al.  Label ranking by learning pairwise preferences , 2008, Artif. Intell..

[31]  Eyke Hüllermeier,et al.  Multilabel classification via calibrated label ranking , 2008, Machine Learning.

[32]  Thorsten Joachims,et al.  Training structural SVMs when exact inference is intractable , 2008, ICML '08.

[33]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[34]  John Langford,et al.  Multi-Label Prediction via Compressed Sensing , 2009, NIPS.

[35]  Eyke Hüllermeier,et al.  Combining Instance-Based Learning and Logistic Regression for Multilabel Classification , 2009, ECML/PKDD.

[36]  Lihi Zelnik-Manor,et al.  Large Scale Max-Margin Multi-Label Classification with Priors , 2010, ICML.

[37]  Eyke Hüllermeier,et al.  Bayes Optimal Multilabel Classification via Probabilistic Classifier Chains , 2010, ICML.

[38]  Grigorios Tsoumakas,et al.  Mining Multi-label Data , 2010, Data Mining and Knowledge Discovery Handbook.

[39]  Kun Zhang,et al.  Multi-label learning by exploiting label dependency , 2010, KDD.

[40]  Eyke Hüllermeier,et al.  On label dependence in multilabel classification , 2010, ICML 2010.

[41]  Joachim M. Buhmann,et al.  Entropy and Margin Maximization for Structured Output Learning , 2010, ECML/PKDD.

[42]  Sergei Vassilvitskii,et al.  Finding the Jaccard median , 2010, SODA '10.

[43]  K. Dembczynski,et al.  On Label Dependence in Multi-Label Classification , 2010 .

[44]  Eyke Hüllermeier,et al.  Regret Analysis for Performance Metrics in Multi-Label Classification: The Case of Hamming and Subset Zero-One Loss , 2010, ECML/PKDD.

[45]  Eyke Hüllermeier,et al.  An Exact Algorithm for F-Measure Maximization , 2011, NIPS.

[46]  Geoff Holmes,et al.  Classifier chains for multi-label classification , 2009, Machine Learning.

[47]  Hsuan-Tien Lin,et al.  Multilabel Classification with Principal Label Space Transformation , 2012, Neural Computation.

[48]  W. Karush Minima of Functions of Several Variables with Inequalities as Side Conditions , 2014 .

[49]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[50]  K. Schittkowski,et al.  NONLINEAR PROGRAMMING , 2022 .