Algorithms and hardness results for parallel large margin learning

We consider the problem of learning an unknown large-margin halfspace in the context of parallel computation, giving both positive and negative results. As our main positive result, we give a parallel algorithm for learning a large-margin half-space, based on an algorithm of Nesterov's that performs gradient descent with a momentum term. We show that this algorithm can learn an unknown γ-margin halfspace over n dimensions using n ċ poly(1/γ) processors and running in time O(1/γ)+O(log n). In contrast, naive parallel algorithms that learn a γ-margin halfspace in time that depends polylogarithmically on n have an inverse quadratic running time dependence on the margin parameter γ. Our negative result deals with boosting, which is a standard approach to learning large-margin halfspaces. We prove that in the original PAC framework, in which a weak learning algorithm is provided as an oracle that is called by the booster, boosting cannot be parallelized. More precisely, we show that, if the algorithm is allowed to call the weak learner multiple times in parallel within a single boosting stage, this ability does not reduce the overall number of successive stages of boosting needed for learning by even a single stage. Our proof is information-theoretic and does not rely on unproven assumptions.

[1]  Yishay Mansour,et al.  On the Boosting Ability of Top-Down Decision Tree Learning Algorithms , 1999, J. Comput. Syst. Sci..

[2]  Ohad Shamir,et al.  Optimal Distributed Online Prediction , 2011, ICML.

[3]  F ROSENBLATT,et al.  The perceptron: a probabilistic model for information storage and organization in the brain. , 1958, Psychological review.

[4]  Robert E. Schapire,et al.  The strength of weak learnability , 1990, Mach. Learn..

[5]  Jeffrey Scott Vitter,et al.  Learning in parallel , 1988, COLT '88.

[6]  Osamu Watanabe,et al.  MadaBoost: A Modification of AdaBoost , 2000, COLT.

[7]  Rocco A. Servedio,et al.  Adaptive Martingale Boosting , 2008, NIPS.

[8]  Santosh S. Vempala,et al.  An algorithmic theory of learning: Robust concepts and random projection , 1999, Machine Learning.

[9]  H. James Hoover,et al.  Limits to Parallel Computation: P-Completeness Theory , 1995 .

[10]  Yoav Freund,et al.  Large Margin Classification Using the Perceptron Algorithm , 1998, COLT.

[11]  Yoram Singer,et al.  Logistic Regression, AdaBoost and Bregman Distances , 2000, Machine Learning.

[12]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[13]  Rocco A. Servedio,et al.  Smooth boosting and learning with malicious noise , 2003 .

[14]  Alexandre d'Aspremont,et al.  Smooth Optimization with Approximate Gradient , 2005, SIAM J. Optim..

[15]  Umesh V. Vazirani,et al.  An Introduction to Computational Learning Theory , 1994 .

[16]  Yoav Freund,et al.  Boosting a weak learning algorithm by majority , 1990, COLT '90.

[17]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[18]  Nick Littlestone,et al.  From on-line to batch learning , 1989, COLT '89.

[19]  Javier Peña,et al.  A Smooth Perceptron Algorithm , 2012, SIAM J. Optim..

[20]  John H. Reif,et al.  O(log2 n) time efficient parallel factorization of dense, sparse separable, and banded matrices , 1994, SPAA '94.

[21]  James Renegar,et al.  A polynomial-time algorithm, based on Newton's method, for linear programming , 1988, Math. Program..

[22]  Yoram Singer,et al.  On the equivalence of weak learnability and linear separability: new relaxations and efficient boosting algorithms , 2010, Machine Learning.

[23]  Yishay Mansour,et al.  Boosting Using Branching Programs , 2000, J. Comput. Syst. Sci..

[24]  H. D. Block The perceptron: a model for brain functioning. I , 1962 .

[25]  Joseph K. Bradley,et al.  Parallel Coordinate Descent for L1-Regularized Loss Minimization , 2011, ICML.

[26]  Shun-ichi Amari,et al.  A Theory of Pattern Recognition , 1968 .

[27]  Nader H. Bshouty,et al.  Noise-tolerant parallel learning of geometric concepts , 1995, COLT '95.

[28]  G. Grammin Polynomial-time Algorithm , 1984 .

[29]  Avrim Blum,et al.  Random Projection, Margins, Kernels, and Feature-Selection , 2005, SLSFS.

[30]  Joseph K. Bradley,et al.  FilterBoost: Regression and Classification on Large Datasets , 2007, NIPS.

[31]  Rocco A. Servedio,et al.  Boosting in the presence of noise , 2003, STOC '03.

[32]  James Renegar,et al.  A mathematical view of interior-point methods in convex optimization , 2001, MPS-SIAM series on optimization.

[33]  Yoav Freund,et al.  An Adaptive Version of the Boost by Majority Algorithm , 1999, COLT.

[34]  Rocco A. Servedio,et al.  Martingale Boosting , 2005, COLT.

[35]  Narendra Karmarkar,et al.  A new polynomial-time algorithm for linear programming , 1984, STOC '84.

[36]  Stephen A. Cook,et al.  Log Depth Circuits for Division and Related Problems , 1986, SIAM J. Comput..

[37]  Yurii Nesterov,et al.  Excessive Gap Technique in Nonsmooth Convex Minimization , 2005, SIAM J. Optim..

[38]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, CACM.

[39]  David Haussler,et al.  Learnability and the Vapnik-Chervonenkis dimension , 1989, JACM.

[40]  R. Freund Review of A mathematical view of interior-point methods in convex optimization, by James Renegar, SIAM, Philadelphia, PA , 2004 .

[41]  Albert B Novikoff,et al.  ON CONVERGENCE PROOFS FOR PERCEPTRONS , 1963 .

[42]  Noga Alon,et al.  Parallel linear programming in fixed dimension almost surely in constant time , 1990, Proceedings [1990] 31st Annual Symposium on Foundations of Computer Science.

[43]  Jinyang Li,et al.  Learning random forest on the gpu: NIPS workshop on parallel and large-scale machine learning (Big Learning) , 2013 .

[44]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.