PPDsparse: A Parallel Primal-Dual Sparse Method for Extreme Classification

Extreme Classification comprises multi-class or multi-label prediction where there is a large number of classes, and is increasingly relevant to many real-world applications such as text and image tagging. In this setting, standard classification methods, with complexity linear in the number of classes, become intractable, while enforcing structural constraints among classes (such as low-rank or tree-structure) to reduce complexity often sacrifices accuracy for efficiency. The recent PD-Sparse method addresses this via an algorithm that is sub-linear in the number of variables, by exploiting primal-dual sparsity inherent in a specific loss function, namely the max-margin loss. In this work, we extend PD-Sparse to be efficiently parallelized in large-scale distributed settings. By introducing separable loss functions, we can scale out the training, with network communication and space efficiency comparable to those in one-versus-all approaches while maintaining an overall complexity sub-linear in the number of classes. On several large-scale benchmarks our proposed method achieves accuracy competitive to the state-of-the-art while reducing the training time from days to tens of minutes compared with existing parallel or sparse methods on a cluster of 100 cores.

[1]  Pat Langley,et al.  Crafting Papers on Machine Learning , 2000, ICML.

[2]  Koby Crammer,et al.  On the Algorithmic Implementation of Multiclass Kernel-based Vector Machines , 2002, J. Mach. Learn. Res..

[3]  Koby Crammer,et al.  A Family of Additive Online Algorithms for Category Ranking , 2003, J. Mach. Learn. Res..

[4]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[5]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[6]  Chih-Jen Lin,et al.  A dual coordinate descent method for large-scale linear SVM , 2008, ICML '08.

[7]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[8]  Chih-Jen Lin,et al.  A sequential dual method for large scale multi-class linear svms , 2008, KDD.

[9]  S. Kakade,et al.  On the duality of strong convexity and strong smoothness : Learning applications and matrix regularization , 2009 .

[10]  Ambuj Tewari,et al.  Applications of strong convexity--strong smoothness duality to learning with matrices , 2009, ArXiv.

[11]  Fei-Fei Li,et al.  What Does Classifying More Than 10, 000 Image Categories Tell Us? , 2010, ECCV.

[12]  Ashish Kapoor,et al.  Multilabel Classification using Bayesian Compressed Sensing , 2012, NIPS.

[13]  R. Tibshirani The Lasso Problem and Uniqueness , 2012, 1206.0313.

[14]  Hsuan-Tien Lin,et al.  Feature-aware Label Space Dimension Reduction for Multi-label Classification , 2012, NIPS.

[15]  A. Choromańska Extreme Multi Class Classification , 2013 .

[16]  Shai Shalev-Shwartz,et al.  Stochastic dual coordinate ascent methods for regularized loss , 2012, J. Mach. Learn. Res..

[17]  Jason Weston,et al.  Label Partitioning For Sublinear Ranking , 2013, ICML.

[18]  Mark W. Schmidt,et al.  Block-Coordinate Frank-Wolfe Optimization for Structural SVMs , 2012, ICML.

[19]  Shou-De Lin,et al.  Sparse Random Feature Algorithm as Coordinate Descent in Hilbert Space , 2014, NIPS.

[20]  Manik Varma,et al.  FastXML: a fast, accurate and stable tree-classifier for extreme multi-label learning , 2014, KDD.

[21]  Peter Richtárik,et al.  Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function , 2011, Mathematical Programming.

[22]  Inderjit S. Dhillon,et al.  Large-scale Multi-label Learning with Missing Labels , 2013, ICML.

[23]  Prateek Jain,et al.  Sparse Local Embeddings for Extreme Multi-label Classification , 2015, NIPS.

[24]  Andreas Stathopoulos,et al.  A Preconditioned Hybrid SVD Method for Accurately Computing Singular Triplets of Large Matrices , 2015, SIAM J. Sci. Comput..

[25]  John Langford,et al.  Logarithmic Time Online Multiclass prediction , 2015, NIPS.

[26]  Ofer Meshi,et al.  Smooth and Strong: MAP Inference with Linear Convergence , 2015, NIPS.

[27]  Georgios Paliouras,et al.  LSHTC: A Benchmark for Large-Scale Text Classification , 2015, ArXiv.

[28]  Eric P. Xing,et al.  Managed communication and consistency for fast data-parallel iterative analytics , 2015, SoCC.

[29]  Eric P. Xing,et al.  High-Performance Distributed ML at Scale through Parameter Server Consistency Models , 2014, AAAI.

[30]  Dacheng Tao,et al.  Robust Extreme Multi-label Learning , 2016, KDD.

[31]  Eyke Hüllermeier,et al.  Extreme F-measure Maximization using Sparse Probability Estimates , 2016, ICML.

[32]  Seunghak Lee,et al.  STRADS: a distributed framework for scheduled model parallel machine learning , 2016, EuroSys.

[33]  Pradeep Ravikumar,et al.  PD-Sparse : A Primal and Dual Sparse Approach to Extreme Multiclass and Multilabel Classification , 2016, ICML.

[34]  Jie Chen,et al.  Revisiting Random Binning Features: Fast Convergence and Strong Parallelizability , 2016, KDD.

[35]  Manik Varma,et al.  Extreme Multi-label Loss Functions for Recommendation, Tagging, Ranking & Other Missing Label Applications , 2016, KDD.

[36]  Brian Kingsbury,et al.  Efficient one-vs-one kernel ridge regression for speech recognition , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[37]  Bernhard Schölkopf,et al.  DiSMEC: Distributed Sparse Machines for Extreme Multi-label Classification , 2016, WSDM.