Loss Functions for Top-k Error: Analysis and Insights

In order to push the performance on realistic computer vision tasks, the number of classes in modern benchmark datasets has significantly increased in recent years. This increase in the number of classes comes along with increased ambiguity between the class labels, raising the question if top-1 error is the right performance measure. In this paper, we provide an extensive comparison and evaluation of established multiclass methods comparing their top-k performance both from a practical as well as from a theoretical perspective. Moreover, we introduce novel top-k loss functions as modifications of the softmax and the multiclass SVM losses and provide efficient optimization schemes for them. In the experiments, we compare on various datasets all of the proposed and established methods for top-k error optimization. An interesting insight of this paper is that the softmax loss yields competitive top-k performance for all k simultaneously. For a specific top-k error, our new top-k losses lead typically to further improvements while being faster to train than the softmax.

[1]  Marc Teboulle,et al.  Smoothing and First Order Methods: A Unified Framework , 2012, SIAM J. Optim..

[2]  Yurii Nesterov,et al.  Smooth minimization of non-smooth functions , 2005, Math. Program..

[3]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[4]  Michael I. Jordan,et al.  Convexity, Classification, and Risk Bounds , 2006 .

[5]  Krista A. Ehinger,et al.  SUN database: Large-scale scene recognition from abbey to zoo , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[6]  J. Mairal Sparse coding for machine learning, image processing and computer vision , 2010 .

[7]  Darko Veberic,et al.  Lambert W Function for Applications in Physics , 2012, Comput. Phys. Commun..

[8]  Jason D. M. Rennie Improving multi-class text classification with Naive Bayes , 2001 .

[9]  Jason Weston,et al.  Solving multiclass support vector machines with LaRank , 2007, ICML '07.

[10]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[11]  Rong Jin,et al.  Top Rank Optimization in Linear Time , 2014, NIPS.

[12]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[13]  Thorsten Joachims,et al.  A support vector method for multivariate performance measures , 2005, ICML.

[14]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[15]  A. Householder The numerical treatment of a single nonlinear equation , 1970 .

[16]  Chih-Jen Lin,et al.  A comparison of methods for multiclass support vector machines , 2002, IEEE Trans. Neural Networks.

[17]  Stephen J. Wright,et al.  Numerical Optimization , 2018, Fundamental Statistical Inference.

[18]  Sebastian Nowozin,et al.  On feature combination for multiclass object classification , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[19]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[20]  Anderson Rocha,et al.  Multiclass From Binary: Expanding One-Versus-All, One-Versus-One and ECOC-Based Approaches , 2014, IEEE Transactions on Neural Networks and Learning Systems.

[21]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[22]  Avraham Adler,et al.  Lambert-W Function , 2015 .

[23]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[24]  J. Borwein,et al.  Convex Analysis And Nonlinear Optimization , 2000 .

[25]  Tong Zhang,et al.  Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization , 2013, Mathematical Programming.

[26]  Marc Teboulle,et al.  A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems , 2009, SIAM J. Imaging Sci..

[27]  Limin Wang,et al.  Places205-VGGNet Models for Scene Recognition , 2015, ArXiv.

[28]  Mark D. Reid,et al.  Composite Binary Losses , 2009, J. Mach. Learn. Res..

[29]  Stefan Carlsson,et al.  CNN Features Off-the-Shelf: An Astounding Baseline for Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[30]  Ambuj Tewari,et al.  On the Consistency of Multiclass Classification Methods , 2007, J. Mach. Learn. Res..

[31]  Koby Crammer,et al.  On the Algorithmic Implementation of Multiclass Kernel-based Vector Machines , 2002, J. Mach. Learn. Res..

[32]  Bolei Zhou,et al.  Learning Deep Features for Scene Recognition using Places Database , 2014, NIPS.

[33]  Nasser M. Nasrabadi,et al.  Pattern Recognition and Machine Learning , 2006, Technometrics.

[34]  Maya R. Gupta,et al.  Training highly multiclass classifiers , 2014, J. Mach. Learn. Res..

[35]  Subhransu Maji,et al.  Deep filter banks for texture recognition and segmentation , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  PerronninFlorent,et al.  Good Practice in Large-Scale Learning for Image Classification , 2014 .

[37]  Pietro Perona,et al.  The Caltech-UCSD Birds-200-2011 Dataset , 2011 .

[38]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[39]  Bernt Schiele,et al.  Scalable Multitask Representation Learning for Scene Classification , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[40]  Edward H. Adelson,et al.  Material perception: What can you see in a brief glance? , 2010 .

[41]  Thomas G. Dietterich,et al.  Transductive Optimization of Top k Precision , 2015, IJCAI.

[42]  Toshio Fukushima,et al.  Precise and fast computation of Lambert W-functions without transcendental function evaluations , 2013, J. Comput. Appl. Math..

[43]  Cynthia Rudin,et al.  The P-Norm Push: A Simple Convex Ranking Algorithm that Concentrates at the Top of the List , 2009, J. Mach. Learn. Res..

[44]  Kaare Brandt Petersen,et al.  The Matrix Cookbook , 2006 .

[45]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[46]  Brendan J. Frey,et al.  Probabilistic n-Choose-k Models for Classification and Ranking , 2012, NIPS.

[47]  Andrea Vedaldi,et al.  MatConvNet: Convolutional Neural Networks for MATLAB , 2014, ACM Multimedia.

[48]  Alain Rakotomamonjy,et al.  Sparse Support Vector Infinite Push , 2012, ICML.

[49]  Ryan M. Rifkin,et al.  In Defense of One-Vs-All Classification , 2004, J. Mach. Learn. Res..

[50]  Andrew Zisserman,et al.  Automated Flower Classification over a Large Number of Classes , 2008, 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing.

[51]  Gregory N. Hullender,et al.  Learning to rank using gradient descent , 2005, ICML.

[52]  Patrick Gallinari,et al.  Ranking with ordered weighted pairwise classification , 2009, ICML '09.

[53]  Julien Mairal,et al.  Network Flow Algorithms for Structured Sparsity , 2010, NIPS.

[54]  Ken Lang,et al.  NewsWeeder: Learning to Filter Netnews , 1995, ICML.

[55]  Bernt Schiele,et al.  Top-k Multiclass SVM , 2015, NIPS.

[56]  Jason Weston,et al.  WSABIE: Scaling Up to Large Vocabulary Image Annotation , 2011, IJCAI.

[57]  Shivani Agarwal,et al.  The Infinite Push: A New Support Vector Ranking Algorithm that Directly Optimizes Accuracy at the Absolute Top of the List , 2011, SDM.

[58]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[59]  Antonio Torralba,et al.  Recognizing indoor scenes , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[60]  Thomas Hofmann,et al.  Large Margin Methods for Structured and Interdependent Output Variables , 2005, J. Mach. Learn. Res..

[61]  Yisong Yue,et al.  Learning Policies for Contextual Submodular Prediction , 2013, ICML.

[62]  Stephen P. Boyd,et al.  Accuracy at the Top , 2012, NIPS.

[63]  J. Hiriart-Urruty,et al.  Fundamentals of Convex Analysis , 2004 .