Learning Halfspaces with Massart Noise Under Structured Distributions

We study the problem of learning halfspaces with Massart noise in the distribution-specific PAC model. We give the first computationally efficient algorithm for this problem with respect to a broad family of distributions, including log-concave distributions. This resolves an open question posed in a number of prior works. Our approach is extremely simple: We identify a smooth {\em non-convex} surrogate loss with the property that any approximate stationary point of this loss defines a halfspace that is close to the target halfspace. Given this structural result, we can use SGD to solve the underlying learning problem.

[1]  Vitaly Feldman,et al.  New Results for Learning Noisy Parities and Halfspaces , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[2]  Yuchen Zhang,et al.  A Hitting Time Analysis of Stochastic Gradient Langevin Dynamics , 2017, COLT.

[3]  Christos Tzamos,et al.  Distribution-Independent PAC Learning of Halfspaces with Massart Noise , 2019, NeurIPS.

[4]  Amit Daniely,et al.  Complexity theoretic limitations on learning halfspaces , 2015, STOC.

[5]  S. Vempala,et al.  The geometry of logconcave functions and sampling algorithms , 2007 .

[6]  Dustin Boswell,et al.  Introduction to Support Vector Machines , 2002 .

[7]  Luc Devroye,et al.  Combinatorial methods in density estimation , 2001, Springer series in statistics.

[8]  R.L. Rivest,et al.  A Formal Model of Hierarchical Concept Learning , 1994, Inf. Comput..

[9]  Nathan Srebro,et al.  Lower Bounds for Non-Convex Stochastic Optimization , 2019, ArXiv.

[10]  Prasad Raghavendra,et al.  Hardness of Learning Halfspaces with Noise , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[11]  T. Sanders,et al.  Analysis of Boolean Functions , 2012, ArXiv.

[12]  Daniel M. Kane,et al.  Learning geometric concepts with nasty noise , 2017, STOC.

[13]  Maria-Florina Balcan,et al.  Sample and Computationally Efficient Learning Algorithms under S-Concave Distributions , 2017, NIPS.

[14]  Robert H. Sloan,et al.  Corrigendum to types of noise in data for concept learning , 1988, COLT '92.

[15]  G. Paouris Concentration of mass on convex bodies , 2006 .

[16]  David Haussler,et al.  Decision Theoretic Generalizations of the PAC Model for Neural Net and Other Learning Applications , 1992, Inf. Comput..

[17]  Ohad Shamir,et al.  The Complexity of Finding Stationary Points with Stochastic Gradient Descent , 2020, ICML.

[18]  P. Massart,et al.  Risk bounds for statistical learning , 2007, math/0702683.

[19]  Nisheeth K. Vishnoi,et al.  Nonconvex sampling with the Metropolis-adjusted Langevin algorithm , 2019, COLT.

[20]  Rocco A. Servedio,et al.  Agnostically learning halfspaces , 2005, 46th Annual IEEE Symposium on Foundations of Computer Science (FOCS'05).

[21]  Vladimir Vapnik,et al.  Estimation of Dependences Based on Empirical Data: Springer Series in Statistics (Springer Series in Statistics) , 1982 .

[22]  Maria-Florina Balcan,et al.  The Power of Localization for Efficiently Learning Linear Separators with Noise , 2013, J. ACM.

[23]  Alan M. Frieze,et al.  A Polynomial-Time Algorithm for Learning Noisy Linear Threshold Functions , 1996, Algorithmica.

[24]  Maria-Florina Balcan,et al.  Efficient Learning of Linear Separators under Bounded Noise , 2015, COLT.

[25]  Maria-Florina Balcan,et al.  Learning and 1-bit Compressed Sensing under Asymmetric Noise , 2016, COLT.

[26]  Maria-Florina Balcan,et al.  Noise in Classification , 2020, Beyond the Worst-Case Analysis of Algorithms.

[27]  Andrew Chi-Chih Yao,et al.  ON ACC and threshold circuits , 1990, Proceedings [1990] 31st Annual Symposium on Foundations of Computer Science.

[28]  Nello Cristianini,et al.  An introduction to Support Vector Machines , 2000 .

[29]  Ryan O'Donnell,et al.  Analysis of Boolean Functions , 2014, ArXiv.

[30]  F ROSENBLATT,et al.  The perceptron: a probabilistic model for information storage and organization in the brain. , 1958, Psychological review.

[31]  Wolfgang Maass,et al.  How fast can a threshold gate learn , 1994, COLT 1994.

[32]  Marvin Minsky,et al.  Perceptrons: An Introduction to Computational Geometry , 1969 .

[33]  R. Schapire,et al.  Toward efficient agnostic learning , 1992, COLT '92.

[34]  Chicheng Zhang,et al.  Revisiting Perceptron: Efficient and Label-Optimal Learning of Halfspaces , 2017, NIPS.

[35]  D. Angluin,et al.  Learning From Noisy Examples , 1988, Machine Learning.

[36]  Pravesh Kothari,et al.  Embedding Hard Learning Problems into Gaussian Space , 2014, Electron. Colloquium Comput. Complex..

[37]  Santosh S. Vempala,et al.  The geometry of logconcave functions and sampling algorithms , 2007, Random Struct. Algorithms.

[38]  Alexander A. Razborov,et al.  Majority gates vs. general weighted threshold gates , 1992, [1992] Proceedings of the Seventh Annual Structure in Complexity Theory Conference.