Analysis and Optimization of Loss Functions for Multiclass, Top-k, and Multilabel Classification

Top-k error is currently a popular performance measure on large scale image classification benchmarks such as ImageNet and Places. Despite its wide acceptance, our understanding of this metric is limited as most of the previous research is focused on its special case, the top-1 error. In this work, we explore two directions that shed more light on the top-k error. First, we provide an in-depth analysis of established and recently proposed single-label multiclass methods along with a detailed account of efficient optimization algorithms for them. Our results indicate that the softmax loss and the smooth multiclass SVM are surprisingly competitive in top-k error uniformly across all k, which can be explained by our analysis of multiclass top-k calibration. Further improvements for a specific k are possible with a number of proposed top-k loss functions. Second, we use the top-k methods to explore the transition from multiclass to multilabel learning. In particular, we find that it is possible to obtain effective multilabel classifiers on Pascal VOC using a single label per image for training, while the gap between multiclass and multilabel methods on MS COCO is more significant. Finally, our contribution of efficient algorithms for training with the considered top-k and multilabel loss functions is of independent interest.

[1]  Min-Ling Zhang,et al.  A Review on Multi-Label Learning Algorithms , 2014, IEEE Transactions on Knowledge and Data Engineering.

[2]  D K Smith,et al.  Numerical Optimization , 2001, J. Oper. Res. Soc..

[3]  Bingbing Ni,et al.  HCP: A Flexible CNN Framework for Multi-Label Image Classification , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Ivan Laptev,et al.  Is object localization for free? - Weakly-supervised learning with convolutional neural networks , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Yoram Singer,et al.  An Efficient Boosting Algorithm for Combining Preferences by , 2013 .

[6]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[7]  Thomas Hofmann,et al.  Large Margin Methods for Structured and Interdependent Output Variables , 2005, J. Mach. Learn. Res..

[8]  Gabriela Csurka,et al.  Distance-Based Image Classification: Generalizing to New Classes at Near-Zero Cost , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Michael I. Jordan,et al.  Convexity, Classification, and Risk Bounds , 2006 .

[10]  M. Verleysen,et al.  Classification in the Presence of Label Noise: A Survey , 2014, IEEE Transactions on Neural Networks and Learning Systems.

[11]  Limin Wang,et al.  Places205-VGGNet Models for Scene Recognition , 2015, ArXiv.

[12]  Tong Zhang,et al.  Subset Ranking Using Regression , 2006, COLT.

[13]  Michael Patriksson,et al.  Algorithms for the continuous nonlinear resource allocation problem - New implementations and numerical studies , 2015, Eur. J. Oper. Res..

[14]  Andrea Vedaldi,et al.  R-CNN minus R , 2015, BMVC.

[15]  Mark D. Reid,et al.  Composite Binary Losses , 2009, J. Mach. Learn. Res..

[16]  Csaba Szepesvári,et al.  Multiclass Classification Calibration Functions , 2016, ArXiv.

[17]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Filip Radlinski,et al.  A support vector method for optimizing average precision , 2007, SIGIR.

[19]  Edward H. Adelson,et al.  Material perception: What can you see in a brief glance? , 2010 .

[20]  Bernhard Schölkopf,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2005, IEEE Transactions on Neural Networks.

[21]  Xing Xu Non-member,et al.  Image annotation with incomplete labelling by modelling image specific structured loss , 2015 .

[22]  Tong Zhang Statistical behavior and consistency of classification methods based on convex risk minimization , 2003 .

[23]  Chih-Jen Lin,et al.  Dual coordinate descent methods for logistic regression and maximum entropy models , 2011, Machine Learning.

[24]  Eyke Hüllermeier,et al.  Bayes Optimal Multilabel Classification via Probabilistic Classifier Chains , 2010, ICML.

[25]  Grigorios Tsoumakas,et al.  Multi-Label Classification: An Overview , 2007, Int. J. Data Warehous. Min..

[26]  Ivan Laptev,et al.  Learning and Transferring Mid-level Image Representations Using Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[27]  Bernt Schiele,et al.  Loss Functions for Top-k Error: Analysis and Insights , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Eyke Hüllermeier,et al.  Multilabel classification via calibrated label ranking , 2008, Machine Learning.

[29]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[30]  Zhi-Hua Zhou,et al.  On the Consistency of Multi-Label Learning , 2011, COLT.

[31]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[32]  Yurii Nesterov,et al.  Smooth minimization of non-smooth functions , 2005, Math. Program..

[33]  Zijun Wei,et al.  Region Ranking SVM for Image Classification , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Laurent Condat,et al.  A Fast Projection onto the Simplex and the l 1 Ball , 2015 .

[35]  Saso Dzeroski,et al.  An extensive experimental comparison of methods for multi-label learning , 2012, Pattern Recognit..

[36]  Peter Richtárik,et al.  Quartz: Randomized Dual Coordinate Ascent with Arbitrary Sampling , 2015, NIPS.

[37]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[38]  Gang Niu,et al.  Analysis of Learning from Positive and Unlabeled Data , 2014, NIPS.

[39]  Tong Zhang,et al.  Statistical Analysis of Bayes Optimal Subset Ranking , 2008, IEEE Transactions on Information Theory.

[40]  K. Kiwiel Variable Fixing Algorithms for the Continuous Quadratic Knapsack Problem , 2008 .

[41]  J. Borwein,et al.  Convex Analysis And Nonlinear Optimization , 2000 .

[42]  Ken Lang,et al.  NewsWeeder: Learning to Filter Netnews , 1995, ICML.

[43]  Tie-Yan Liu,et al.  Listwise approach to learning to rank: theory and algorithm , 2008, ICML '08.

[44]  Stephen E. Robertson,et al.  SoftRank: optimizing non-smooth rank metrics , 2008, WSDM '08.

[45]  Jason D. M. Rennie Improving multi-class text classification with Naive Bayes , 2001 .

[46]  Bernt Schiele,et al.  Top-k Multiclass SVM , 2015, NIPS.

[47]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[48]  Julien Mairal,et al.  Network Flow Algorithms for Structured Sparsity , 2010, NIPS.

[49]  Tong Zhang,et al.  Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization , 2013, Mathematical Programming.

[50]  Ambuj Tewari,et al.  On the Consistency of Multiclass Classification Methods , 2007, J. Mach. Learn. Res..

[51]  Xiang Zhang,et al.  OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks , 2013, ICLR.

[52]  Yiming Yang,et al.  An Evaluation of Statistical Approaches to Text Categorization , 1999, Information Retrieval.

[53]  Cordelia Schmid,et al.  Good Practice in Large-Scale Learning for Image Classification , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[54]  Andrew Zisserman,et al.  Return of the Devil in the Details: Delving Deep into Convolutional Nets , 2014, BMVC.

[55]  Thorsten Joachims,et al.  A support vector method for multivariate performance measures , 2005, ICML.

[56]  Dacheng Tao,et al.  Classification with Noisy Labels by Importance Reweighting , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[57]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[58]  Marcel Worring,et al.  The challenge problem for automated detection of 101 semantic concepts in multimedia , 2006, MM '06.

[59]  Alexei A. Efros,et al.  Unbiased look at dataset bias , 2011, CVPR 2011.

[60]  Tie-Yan Liu,et al.  Learning to rank: from pairwise approach to listwise approach , 2007, ICML '07.

[61]  Tibério S. Caetano,et al.  Optimization of Robust Loss Functions for Weakly-Labeled Image Taxonomies , 2013, International Journal of Computer Vision.

[62]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[63]  Yoram Singer,et al.  Efficient Learning of Label Ranking by Soft Projections onto Polyhedra , 2006, J. Mach. Learn. Res..

[64]  Marc Teboulle,et al.  A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems , 2009, SIAM J. Imaging Sci..

[65]  Grigorios Tsoumakas,et al.  Multilabel Text Classification for Automated Tag Suggestion , 2008 .

[66]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[67]  Yisong Yue,et al.  Learning Policies for Contextual Submodular Prediction , 2013, ICML.

[68]  Stephen P. Boyd,et al.  Accuracy at the Top , 2012, NIPS.

[69]  Antonio Torralba,et al.  Recognizing indoor scenes , 2009, CVPR.

[70]  Hang Li,et al.  AdaRank: a boosting algorithm for information retrieval , 2007, SIGIR.

[71]  Allan Jabri,et al.  Learning Visual Features from Large Weakly Supervised Data , 2015, ECCV.

[72]  Tatsuya Harada,et al.  Multi-label Ranking from Positive and Unlabeled Data , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[73]  Jason Weston,et al.  WSABIE: Scaling Up to Large Vocabulary Image Annotation , 2011, IJCAI.

[74]  David A. Forsyth,et al.  Object Recognition as Machine Translation: Learning a Lexicon for a Fixed Image Vocabulary , 2002, ECCV.

[75]  J. Hiriart-Urruty,et al.  Fundamentals of Convex Analysis , 2004 .

[76]  Xiangyang Xue,et al.  Regional Gating Neural Networks for Multi-label Image Classification , 2016, BMVC.

[77]  Meng Wang,et al.  Beyond Object Proposals: Random Crop Pooling for Multi-Label Image Recognition , 2016, IEEE Transactions on Image Processing.

[78]  Rong Jin,et al.  Top Rank Optimization in Linear Time , 2014, NIPS.

[79]  Brendan J. Frey,et al.  Probabilistic n-Choose-k Models for Classification and Ranking , 2012, NIPS.

[80]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[81]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[82]  Yiming Yang,et al.  The Enron Corpus: A New Dataset for Email Classi(cid:12)cation Research , 2004 .

[83]  Bolei Zhou,et al.  Learning Deep Features for Scene Recognition using Places Database , 2014, NIPS.

[84]  Oluwasanmi Koyejo,et al.  Consistent Multilabel Classification , 2015, NIPS.

[85]  Maya R. Gupta,et al.  Training highly multiclass classifiers , 2014, J. Mach. Learn. Res..

[86]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[87]  Subhransu Maji,et al.  Deep filter banks for texture recognition and segmentation , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[88]  Wei Xu,et al.  CNN-RNN: A Unified Framework for Multi-label Image Classification , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[89]  Jun Liu,et al.  Efficient Euclidean projections in linear time , 2009, ICML '09.

[90]  Grigorios Tsoumakas,et al.  Effective and Efficient Multilabel Classification in Domains with Large Number of Labels , 2008 .

[91]  Darko Veberic,et al.  Lambert W Function for Applications in Physics , 2012, Comput. Phys. Commun..

[92]  Pietro Perona,et al.  The Caltech-UCSD Birds-200-2011 Dataset , 2011 .

[93]  Mark D. Reid,et al.  Composite Multiclass Losses , 2011, J. Mach. Learn. Res..

[94]  Lawrence Carin,et al.  Sparse multinomial logistic regression: fast algorithms and generalization bounds , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[95]  Shai Shalev-Shwartz,et al.  Stochastic dual coordinate ascent methods for regularized loss , 2012, J. Mach. Learn. Res..

[96]  Thorsten Joachims,et al.  Optimizing search engines using clickthrough data , 2002, KDD.

[97]  Peter Richtárik,et al.  Accelerated, Parallel, and Proximal Coordinate Descent , 2013, SIAM J. Optim..

[98]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[99]  Krzysztof C. Kiwiel,et al.  Breakpoint searching algorithms for the continuous quadratic knapsack problem , 2007, Math. Program..

[100]  Andrea Vedaldi,et al.  MatConvNet: Convolutional Neural Networks for MATLAB , 2014, ACM Multimedia.

[101]  Chih-Jen Lin,et al.  A dual coordinate descent method for large-scale linear SVM , 2008, ICML '08.

[102]  Alain Rakotomamonjy,et al.  Sparse Support Vector Infinite Push , 2012, ICML.

[103]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[104]  Yangqing Jia,et al.  Deep Convolutional Ranking for Multilabel Image Annotation , 2013, ICLR.

[105]  Ryan M. Rifkin,et al.  In Defense of One-Vs-All Classification , 2004, J. Mach. Learn. Res..

[106]  Cordelia Schmid,et al.  TagProp: Discriminative metric learning in nearest neighbor models for image auto-annotation , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[107]  Dale Schuurmans,et al.  Adaptive Large Margin Training for Multilabel Classification , 2011, AAAI.

[108]  Stefan Carlsson,et al.  CNN Features Off-the-Shelf: An Astounding Baseline for Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[109]  Koby Crammer,et al.  A Family of Additive Online Algorithms for Category Ranking , 2003, J. Mach. Learn. Res..

[110]  Bingbing Ni,et al.  Assistive tagging: A survey of multimedia tagging with human-computer joint exploration , 2012, CSUR.

[111]  Toshio Fukushima,et al.  Precise and fast computation of Lambert W-functions without transcendental function evaluations , 2013, J. Comput. Appl. Math..

[112]  Yang Song,et al.  Learning Fine-Grained Image Similarity with Deep Ranking , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[113]  Thomas Gärtner,et al.  Label Ranking Algorithms: A Survey , 2010, Preference Learning.

[114]  Cynthia Rudin,et al.  The P-Norm Push: A Simple Convex Ranking Algorithm that Concentrates at the Top of the List , 2009, J. Mach. Learn. Res..

[115]  Kaare Brandt Petersen,et al.  The Matrix Cookbook , 2006 .

[116]  Patrick Gallinari,et al.  "On the (Non-)existence of Convex, Calibrated Surrogate Losses for Ranking" , 2012, NIPS.

[117]  Qiang Wu,et al.  McRank: Learning to Rank Using Multiple Classification and Gradient Boosting , 2007, NIPS.

[118]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[119]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[120]  Chih-Jen Lin,et al.  A comparison of methods for multiclass support vector machines , 2002, IEEE Trans. Neural Networks.

[121]  Anderson Rocha,et al.  Multiclass From Binary: Expanding One-Versus-All, One-Versus-One and ECOC-Based Approaches , 2014, IEEE Transactions on Neural Networks and Learning Systems.

[122]  Shivani Agarwal,et al.  The Infinite Push: A New Support Vector Ranking Algorithm that Directly Optimizes Accuracy at the Absolute Top of the List , 2011, SDM.

[123]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[124]  Jason Weston,et al.  A kernel method for multi-labelled classification , 2001, NIPS.

[125]  Gert R. G. Lanckriet,et al.  Metric Learning to Rank , 2010, ICML.

[126]  Gaston H. Gonnet,et al.  On the LambertW function , 1996, Adv. Comput. Math..

[127]  Andrew Zisserman,et al.  Automated Flower Classification over a Large Number of Classes , 2008, 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing.

[128]  Gregory N. Hullender,et al.  Learning to rank using gradient descent , 2005, ICML.

[129]  Tieniu Tan,et al.  Deep semantic ranking based hashing for multi-label image retrieval , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[130]  Patrick Gallinari,et al.  Ranking with ordered weighted pairwise classification , 2009, ICML '09.

[131]  Tie-Yan Liu,et al.  Learning to Rank for Information Retrieval , 2011 .

[132]  Krista A. Ehinger,et al.  SUN database: Large-scale scene recognition from abbey to zoo , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[133]  Grigorios Tsoumakas,et al.  Multi-Label Classification of Music into Emotions , 2008, ISMIR.

[134]  Marc Teboulle,et al.  Smoothing and First Order Methods: A Unified Framework , 2012, SIAM J. Optim..

[135]  Saso Dzeroski,et al.  Ensembles of Multi-Objective Decision Trees , 2007, ECML.

[136]  Michael I. Jordan,et al.  On the Consistency of Ranking Algorithms , 2010, ICML.

[137]  A. Atiya,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2005, IEEE Transactions on Neural Networks.

[138]  A. Householder The numerical treatment of a single nonlinear equation , 1970 .

[139]  Geoff Holmes,et al.  Classifier chains for multi-label classification , 2009, Machine Learning.

[140]  Koby Crammer,et al.  On the Algorithmic Implementation of Multiclass Kernel-based Vector Machines , 2002, J. Mach. Learn. Res..

[141]  Stephen P. Boyd,et al.  Proximal Algorithms , 2013, Found. Trends Optim..