Agnostic Learning of Halfspaces with Gradient Descent via Soft Margins

We analyze the properties of gradient descent on convex surrogates for the zero-one loss for the agnostic learning of linear halfspaces. If $\mathsf{OPT}$ is the best classification error achieved by a halfspace, by appealing to the notion of soft margins we are able to show that gradient descent finds halfspaces with classification error $\tilde O(\mathsf{OPT}^{1/2}) + \varepsilon$ in $\mathrm{poly}(d,1/\varepsilon)$ time and sample complexity for a broad class of distributions that includes log-concave isotropic distributions as a subclass. Along the way we answer a question recently posed by Ji et al. (2020) on how the tail behavior of a loss function can affect sample complexity and runtime guarantees for gradient descent.

[1]  F ROSENBLATT,et al.  The perceptron: a probabilistic model for information storage and organization in the brain. , 1958, Psychological review.

[2]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[3]  Robert H. Sloan,et al.  Corrigendum to types of noise in data for concept learning , 1988, COLT '92.

[4]  Alan M. Frieze,et al.  A Polynomial-Time Algorithm for Learning Noisy Linear Threshold Functions , 1996, Algorithmica.

[5]  Rocco A. Servedio,et al.  On PAC learning using Winnow, Perceptron, and a Perceptron-like algorithm , 1999, COLT '99.

[6]  A. Tsybakov,et al.  Optimal aggregation of classifiers in statistical learning , 2003 .

[7]  R. Schapire,et al.  Toward efficient agnostic learning , 1992, COLT '92.

[8]  D. Angluin,et al.  Learning From Noisy Examples , 1988, Machine Learning.

[9]  Rocco A. Servedio,et al.  Agnostically learning halfspaces , 2005, 46th Annual IEEE Symposium on Foundations of Computer Science (FOCS'05).

[10]  Michael I. Jordan,et al.  Convexity, Classification, and Risk Bounds , 2006 .

[11]  Prasad Raghavendra,et al.  Hardness of Learning Halfspaces with Noise , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[12]  S. Vempala,et al.  The geometry of logconcave functions and sampling algorithms , 2007 .

[13]  P. Massart,et al.  Risk bounds for statistical learning , 2007, math/0702683.

[14]  Nathan Srebro,et al.  Fast Rates for Regularized Objectives , 2008, NIPS.

[15]  Ambuj Tewari,et al.  Smoothness, Low Noise and Fast Rates , 2010, NIPS.

[16]  John Langford,et al.  Contextual Bandit Algorithms with Supervised Learning Guarantees , 2010, AISTATS.

[17]  Roman Vershynin,et al.  Introduction to the non-asymptotic analysis of random matrices , 2010, Compressed Sensing.

[18]  Nathan Srebro,et al.  Minimizing The Misclassification Error Rate Using a Surrogate Convex Loss , 2012, ICML.

[19]  Shai Ben-David,et al.  Understanding Machine Learning: From Theory to Algorithms , 2014 .

[20]  Maria-Florina Balcan,et al.  Efficient Learning of Linear Separators under Bounded Noise , 2015, COLT.

[21]  Maria-Florina Balcan,et al.  Learning and 1-bit Compressed Sensing under Asymmetric Noise , 2016, COLT.

[22]  Amit Daniely,et al.  Complexity theoretic limitations on learning halfspaces , 2015, STOC.

[23]  Maria-Florina Balcan,et al.  The Power of Localization for Efficiently Learning Linear Separators with Noise , 2013, J. ACM.

[24]  Maria-Florina Balcan,et al.  Sample and Computationally Efficient Learning Algorithms under S-Concave Distributions , 2017, NIPS.

[25]  Karthik Sridharan,et al.  Uniform Convergence of Gradients for Non-Convex Learning and Optimization , 2018, NeurIPS.

[26]  Adam R. Klivans,et al.  Time/Accuracy Tradeoffs for Learning a ReLU with respect to Gaussian Marginals , 2019, NeurIPS.

[27]  Xiao Zhang,et al.  Learning One-hidden-layer ReLU Networks via Gradient Descent , 2018, AISTATS.

[28]  Christos Tzamos,et al.  Distribution-Independent PAC Learning of Halfspaces with Massart Noise , 2019, NeurIPS.

[29]  Yuan Cao,et al.  Algorithm-Dependent Generalization Bounds for Overparameterized Deep Residual Networks , 2019, NeurIPS.

[30]  Matus Telgarsky,et al.  Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow ReLU networks , 2019, ICLR.

[31]  Adam R. Klivans,et al.  Approximation Schemes for ReLU Regression , 2020, COLT.

[32]  Quanquan Gu,et al.  Agnostic Learning of a Single Neuron with Gradient Descent , 2020, NeurIPS.

[33]  Quanquan Gu,et al.  Generalization Error Bounds of Gradient Descent for Learning Over-Parameterized Deep ReLU Networks , 2019, AAAI.

[34]  Daniel M. Kane,et al.  Near-Optimal SQ Lower Bounds for Agnostically Learning Halfspaces and ReLUs under Gaussian Marginals , 2020, NeurIPS.

[35]  Maria-Florina Balcan,et al.  Noise in Classification , 2020, Beyond the Worst-Case Analysis of Algorithms.

[36]  Adam R. Klivans,et al.  Statistical-Query Lower Bounds via Functional Gradients , 2020, NeurIPS.

[37]  Christos Tzamos,et al.  Non-Convex SGD Learns Halfspaces with Adversarial Label Noise , 2020, NeurIPS.

[38]  Christos Tzamos,et al.  Learning Halfspaces with Tsybakov Noise , 2020, ArXiv.

[39]  Matus Telgarsky,et al.  Gradient descent follows the regularization path for general losses , 2020, COLT.

[40]  Christos Tzamos,et al.  Learning Halfspaces with Massart Noise Under Structured Distributions , 2020, COLT 2020.

[41]  Ohad Shamir,et al.  Gradient Methods Never Overfit On Separable Data , 2020, J. Mach. Learn. Res..