Multilabel reductions: what is my loss optimising?

Multilabel classification is a challenging problem arising in applications ranging from information retrieval to image tagging. A popular approach to this problem is to employ a reduction to a suitable series of binary or multiclass problems (e.g., computing a softmax based cross-entropy over the relevant labels). While such methods have seen empirical success, less is understood about how well they approximate two fundamental performance measures: precision@$k$ and recall@$k$. In this paper, we study five commonly used reductions, including the one-versus-all reduction, a reduction to multiclass classification, and normalised versions of the same, wherein the contribution of each instance is normalised by the number of relevant labels. Our main result is a formal justification of each reduction: we explicate their underlying risks, and show they are each consistent with respect to either precision or recall. Further, we show that in general no reduction can be optimal for both measures. We empirically validate our results, demonstrating scenarios where normalised reductions yield recall gains over unnormalised counterparts.

[1]  L. J. Savage Elicitation of Personal Probabilities and Expectations , 1971 .

[2]  Tong Zhang Statistical behavior and consistency of classification methods based on convex risk minimization , 2003 .

[3]  Jiebo Luo,et al.  Learning multi-label scene classification , 2004, Pattern Recognit..

[4]  Ryan M. Rifkin,et al.  In Defense of One-Vs-All Classification , 2004, J. Mach. Learn. Res..

[5]  Ambuj Tewari,et al.  On the Consistency of Multiclass Classification Methods , 2007, J. Mach. Learn. Res..

[6]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[7]  A. Buja,et al.  Loss Functions for Binary Class Probability Estimation and Classification: Structure and Applications , 2005 .

[8]  Eyke Hüllermeier,et al.  A Unified Model for Multilabel Classification and Ranking , 2006, ECAI.

[9]  Grigorios Tsoumakas,et al.  Random k -Labelsets: An Ensemble Method for Multilabel Classification , 2007, ECML.

[10]  Grigorios Tsoumakas,et al.  Multi-Label Classification: An Overview , 2007, Int. J. Data Warehous. Min..

[11]  Eyke Hüllermeier,et al.  Bayes Optimal Multilabel Classification via Probabilistic Classifier Chains , 2010, ICML.

[12]  Krista A. Ehinger,et al.  SUN database: Large-scale scene recognition from abbey to zoo , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[13]  Mark D. Reid,et al.  Composite Binary Losses , 2009, J. Mach. Learn. Res..

[14]  Eyke Hüllermeier,et al.  On label dependence and loss minimization in multi-label classification , 2012, Machine Learning.

[15]  Manik Varma,et al.  Multi-label learning with millions of labels: recommending advertiser bid phrases for web pages , 2013, WWW.

[16]  Min-Ling Zhang,et al.  A Review on Multi-Label Learning Algorithms , 2014, IEEE Transactions on Knowledge and Data Engineering.

[17]  Inderjit S. Dhillon,et al.  Large-scale Multi-label Learning with Missing Labels , 2013, ICML.

[18]  Shivani Agarwal,et al.  Surrogate regret bounds for bipartite ranking via strongly proper losses , 2012, J. Mach. Learn. Res..

[19]  Shivani Agarwal,et al.  On the Consistency of Output Code Based Learning Algorithms for Multiclass Learning Problems , 2014, COLT.

[20]  Prateek Jain,et al.  Online and Stochastic Gradient Methods for Non-decomposable Loss Functions , 2014, NIPS.

[21]  Oluwasanmi Koyejo,et al.  Consistent Multilabel Classification , 2015, NIPS.

[22]  Bernt Schiele,et al.  Top-k Multiclass SVM , 2015, NIPS.

[23]  Prateek Jain,et al.  Surrogate Functions for Maximizing Precision at the Top , 2015, ICML.

[24]  Prateek Jain,et al.  Sparse Local Embeddings for Extreme Multi-label Classification , 2015, NIPS.

[25]  Csaba Szepesvári,et al.  Multiclass Classification Calibration Functions , 2016, ArXiv.

[26]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[27]  Thomas G. Dietterich,et al.  Transductive Optimization of Top k Precision , 2015, IJCAI.

[28]  Manik Varma,et al.  Extreme Multi-label Loss Functions for Recommendation, Tagging, Ranking & Other Missing Label Applications , 2016, KDD.

[29]  Pradeep Ravikumar,et al.  PPDsparse: A Parallel Primal-Dual Sparse Method for Extreme Classification , 2017, KDD.

[30]  Anna Choromanska,et al.  Simultaneous Learning of Trees and Representations for Extreme Classification and Density Estimation , 2016, ICML.

[31]  Bernhard Schölkopf,et al.  DiSMEC: Distributed Sparse Machines for Extreme Multi-label Classification , 2016, WSDM.

[32]  Tomas Mikolov,et al.  Bag of Tricks for Efficient Text Classification , 2016, EACL.

[33]  Zhi-Hua Zhou,et al.  A Unified View of Multi-Label Performance Measures , 2016, ICML.

[34]  Dirk Tasche,et al.  A plug-in approach to maximising precision at the top and recall at the top , 2018, ArXiv.

[35]  Róbert Busa-Fekete,et al.  A no-regret generalization of hierarchical softmax to extreme multi-label classification , 2018, NeurIPS.

[36]  Bernt Schiele,et al.  Analysis and Optimization of Loss Functions for Multiclass, Top-k, and Multilabel Classification , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[37]  Venkatesh Balasubramanian,et al.  Slice: Scalable Linear Extreme Classifiers Trained on 100 Million Labels for Related Searches , 2019, WSDM.

[38]  Sashank J. Reddi,et al.  Stochastic Negative Mining for Learning with Large Output Spaces , 2018, AISTATS.

[39]  Oluwasanmi Koyejo,et al.  On the Consistency of Top-k Surrogate Losses , 2019, ICML.