Least Squares Revisited: Scalable Approaches for Multi-class Prediction

This work provides simple algorithms for multiclass (and multi-label) prediction in settings where both the number of examples n and the data dimension d are relatively large. These robust and parameter free algorithms are essentially iterative least-squares updates and very versatile both in theory and in practice. On the theoretical front, we present several variants with convergence guarantees. Owing to their effective use of second-order structure, these algorithms are substantially better than first-order methods in many practical scenarios. On the empirical side, we show how to scale our approach to high dimensional datasets, achieving dramatic computational speedups over popular optimization packages such as Liblinear and Vowpal Wabbit on standard datasets (MNIST and CIFAR-10), while attaining state-of-the-art accuracies.

[1]  Jorge Nocedal,et al.  On the Use of Stochastic Hessian Information in Optimization Methods for Machine Learning , 2011, SIAM J. Optim..

[2]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[3]  Shai Shalev-Shwartz,et al.  Stochastic dual coordinate ascent methods for regularized loss , 2012, J. Mach. Learn. Res..

[4]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[5]  John Platt,et al.  Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .

[6]  Peter Richtárik,et al.  Parallel coordinate descent methods for big data optimization , 2012, Mathematical Programming.

[7]  Adam Tauman Kalai,et al.  Efficient Learning of Generalized Linear and Single Index Models with Isotonic Regression , 2011, NIPS.

[8]  Honglak Lee,et al.  An Analysis of Single-Layer Networks in Unsupervised Feature Learning , 2011, AISTATS.

[9]  Yurii Nesterov,et al.  Efficiency of Coordinate Descent Methods on Huge-Scale Optimization Problems , 2012, SIAM J. Optim..

[10]  Anna Choromanska,et al.  Majorization for CRFs and Latent Likelihoods , 2012, NIPS.

[11]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[12]  Léon Bottou,et al.  The Tradeoffs of Large Scale Learning , 2007, NIPS.

[13]  Alexander J. Smola,et al.  Linear support vector machines via dual cached loops , 2012, KDD.

[14]  Alekh Agarwal,et al.  Selective sampling algorithms for cost-sensitive multiclass prediction , 2013, ICML.

[15]  John Darzentas,et al.  Problem Complexity and Method Efficiency in Optimization , 1983 .

[16]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[17]  Stephen J. Wright,et al.  Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.

[18]  Yurii Nesterov,et al.  Primal-dual subgradient methods for convex problems , 2005, Math. Program..

[19]  Michael Collins,et al.  Discriminative Reranking for Natural Language Parsing , 2000, CL.

[20]  E. Bronshtein ε-Entropy of convex sets and functions , 1976 .

[21]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[22]  Adam Tauman Kalai,et al.  The Isotron Algorithm: High-Dimensional Isotonic Regression , 2009, COLT.

[23]  Mark W. Schmidt,et al.  A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training Sets , 2012, NIPS.

[24]  Chih-Jen Lin,et al.  Large Linear Classification When Data Cannot Fit in Memory , 2011, TKDD.

[25]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[26]  Yoram Singer,et al.  Efficient projections onto the l1-ball for learning in high dimensions , 2008, ICML '08.

[27]  Shai Shalev-Shwartz,et al.  Online Learning and Online Convex Optimization , 2012, Found. Trends Mach. Learn..

[28]  Patrick Gallinari,et al.  SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent , 2009, J. Mach. Learn. Res..

[29]  R. Rockafellar Characterization of the subdifferentials of convex functions , 1966 .

[30]  Olivier Chapelle,et al.  Training a Support Vector Machine in the Primal , 2007, Neural Computation.

[31]  Michael I. Jordan Graphical Models , 1998 .