Communication-Efficient Distributed Block Minimization for Nonlinear Kernel Machines

Nonlinear kernel machines often yield superior predictive performance on various tasks; however, they suffer from severe computational challenges. In this paper, we show how to overcome the important challenge of speeding up kernel machines using multiple computers. In particular, we develop a parallel block minimization framework, and demonstrate its good scalability in solving nonlinear kernel SVM and logistic regression. Our framework proceeds by dividing the problem into smaller subproblems by forming a block-diagonal approximation of the Hessian matrix. The subproblems are then solved approximately in parallel. After that, a communication efficient line search procedure is developed to ensure sufficient reduction of the objective function value by exploiting the problem structure of kernel machines. We prove global linear convergence rate of the proposed method with a wide class of subproblem solvers, and our analysis covers strongly convex and some non-strongly convex functions. We apply our algorithm to solve large-scale kernel SVM problems on distributed systems, and show a significant improvement over existing parallel solvers. As an example, on the covtype dataset with half-a-million samples, our algorithm can obtain an approximate solution with 96% accuracy in 20 seconds using 32 machines, while all the other parallel kernel SVM solvers require more than 2000 seconds to achieve a solution with 95% accuracy. Moreover, our algorithm is the first distributed kernel SVM solver that can scale to massive data sets. On the KDDB dataset (20 million samples and 30 million features), our parallel solver can compute the kernel SVM solution within half an hour using 32 machines with 640 cores in total, while existing solvers can not scale to this dataset.

[1]  Tara N. Sainath,et al.  Kernel methods match Deep Neural Networks on TIMIT , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Ambuj Tewari,et al.  Feature Clustering for Accelerating Parallel Coordinate Descent , 2012, NIPS.

[3]  Benjamin Recht,et al.  Large Scale Kernel Learning using Block Coordinate Descent , 2016, ArXiv.

[4]  Inderjit S. Dhillon,et al.  Computationally Efficient Nyström Approximation using Fast Transforms , 2016, ICML.

[5]  Thomas Hofmann,et al.  Communication-Efficient Distributed Dual Coordinate Ascent , 2014, NIPS.

[6]  James Demmel,et al.  CA-SVM : Communication-Avoiding Support Vector Machines on Clusters , 2016 .

[7]  Matthias W. Seeger,et al.  Using the Nyström Method to Speed Up Kernel Machines , 2000, NIPS.

[8]  S. Sundararajan,et al.  A Distributed Algorithm for Training Nonlinear Kernel Machines , 2014, ArXiv.

[9]  Michael I. Jordan,et al.  Adding vs. Averaging in Distributed Primal-Dual Optimization , 2015, ICML.

[10]  Chih-Jen Lin,et al.  Iteration complexity of feasible descent methods for convex optimization , 2014, J. Mach. Learn. Res..

[11]  Igor Durdanovic,et al.  Parallel Support Vector Machines: The Cascade SVM , 2004, NIPS.

[12]  C. C. Chang,et al.  Libsvm : introduction and benchmarks , 2000 .

[13]  Edward Y. Chang,et al.  Parallelizing Support Vector Machines on Distributed Computers , 2007, NIPS.

[14]  Inderjit S. Dhillon,et al.  Goal-Directed Inductive Matrix Completion , 2016, KDD.

[15]  Zheng Chen,et al.  P-packSVM: Parallel Primal grAdient desCent Kernel SVM , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[16]  Tianbao Yang,et al.  Trading Computation for Communication: Distributed Stochastic Dual Coordinate Ascent , 2013, NIPS.

[17]  S. Sundararajan,et al.  A distributed block coordinate descent method for training $l_1$ regularized linear classifiers , 2014, J. Mach. Learn. Res..

[18]  Cho-Jui Hsieh,et al.  Fixing the Convergence Problems in Parallel Asynchronous Dual Coordinate Descent , 2016, 2016 IEEE 16th International Conference on Data Mining (ICDM).

[19]  Le Song,et al.  CA-SVM: Communication-Avoiding Support Vector Machines on Distributed Systems , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.

[20]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[21]  Ameet Talwalkar,et al.  Ensemble Nystrom Method , 2009, NIPS.

[22]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[23]  Dan Roth,et al.  Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM , 2015, ICML.

[24]  Robert A. van de Geijn,et al.  Collective communication: theory, practice, and experience , 2007, Concurr. Comput. Pract. Exp..

[25]  Inderjit S. Dhillon,et al.  PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent , 2015, ICML.

[26]  Zhuang Wang,et al.  Scaling Up Kernel SVM on Limited Resources: A Low-Rank Linearization Approach , 2012, IEEE Transactions on Neural Networks and Learning Systems.

[27]  Alexander J. Smola,et al.  Fastfood: Approximate Kernel Expansions in Loglinear Time , 2014, ArXiv.

[28]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[29]  Peter Richtárik,et al.  Parallel coordinate descent methods for big data optimization , 2012, Mathematical Programming.

[30]  Yuchen Zhang,et al.  DiSCO: Distributed Optimization for Self-Concordant Empirical Loss , 2015, ICML.

[31]  James Demmel,et al.  Asynchronous Parallel Greedy Coordinate Descent , 2016, NIPS.

[32]  Inderjit S. Dhillon,et al.  Fast Prediction for Large-Scale Kernel Machines , 2014, NIPS.

[33]  Inderjit S. Dhillon,et al.  Memory Efficient Kernel Approximation , 2014, ICML.

[34]  Chih-Jen Lin,et al.  Dual coordinate descent methods for logistic regression and maximum entropy models , 2011, Machine Learning.

[35]  S. Sathiya Keerthi,et al.  A Fast Dual Algorithm for Kernel Logistic Regression , 2002, 2007 International Joint Conference on Neural Networks.

[36]  Inderjit S. Dhillon,et al.  A Divide-and-Conquer Solver for Kernel Support Vector Machines , 2013, ICML.