A Theory of PAC Learnability of Partial Concept Classes

We extend the classical theory of PAC learning in a way which allows to model a rich variety of practical learning tasks where the data satisfy special properties that ease the learning process. For example, tasks where the distance of the data from the decision boundary is bounded away from zero, or tasks where the data lie on a lower dimensional surface. The basic and simple idea is to consider partial concepts: these are functions that can be undefined on certain parts of the space. When learning a partial concept, we assume that the source distribution is supported only on points where the partial concept is defined. This way, one can naturally express assumptions on the data such as lying on a lower dimensional surface, or that it satisfies margin conditions. In contrast, it is not at all clear that such assumptions can be expressed by the traditional PAC theory using learnable total concept classes, and in fact we exhibit easy-to-learn partial concept classes which provably cannot be captured by the traditional PAC theory. This also resolves, in a strong negative sense, a question posed by Attias, Kontorovich, and Mansour (2019). We characterize PAC learnability of partial concept classes and reveal an algorithmic landscape which is fundamentally different than the classical one. For example, in the classical PAC model, learning boils down to Empirical Risk Minimization (ERM). This basic principle follows from Uniform Convergence and the Fundamental Theorem of PAC Learning (Vapnik and Chervonenkis, 1971, 1974a; Blumer, Ehrenfeucht, Haussler, and Warmuth, 1989). In stark contrast, we show that the ERM principle fails spectacularly in explaining learnability of partial concept classes. In fact, we demonstrate classes that are incredibly easy to learn, but such that any algorithm that learns them must use an hypothesis space with unbounded VC dimension. We also find that the sample compression conjecture of Littlestone and Warmuth fails in this setting. Our impossibility results hinge on the recent breakthroughs in communication complexity and graph theory by Göös (2015); Ben-David, Hatami, and Tal (2017); Balodis, Ben-David, Göös, Jain, and Kothari (2021). Thus, this theory features problems that cannot be represented in the traditional way and cannot be solved in the traditional way. We view this as evidence that it might provide insights on the nature of learnability in realistic scenarios which the classical theory fails to explain. We include in the paper suggestions for future research and open problems in several contexts, including combinatorics, geometry, and learning theory.

[1]  Gintare Karolina Dziugaite,et al.  Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data , 2017, UAI.

[2]  Manfred K. Warmuth,et al.  Relating Data Compression and Learnability , 2003 .

[3]  Matus Telgarsky,et al.  Spectrally-normalized margin bounds for neural networks , 2017, NIPS.

[4]  Ibrahim M. Alabdulmohsin,et al.  What Do Neural Networks Learn When Trained With Random Labels? , 2020, NeurIPS.

[5]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[6]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[7]  Siddhartha Jain,et al.  Unambiguous DNFs and Alon-Saks-Seymour , 2021 .

[8]  Vladimir Vovk,et al.  Aggregating strategies , 1990, COLT '90.

[9]  E. Rowland Theory of Games and Economic Behavior , 1946, Nature.

[10]  Mika Göös,et al.  Lower Bounds for Clique vs. Independent Set , 2015, 2015 IEEE 56th Annual Symposium on Foundations of Computer Science.

[11]  Norbert Sauer,et al.  On the Density of Families of Sets , 1972, J. Comb. Theory A.

[12]  Kaspars Balodis Several Separations Based on a Partial Boolean Function , 2021, ArXiv.

[13]  Ralf Herbrich,et al.  Algorithmic Luckiness , 2001, J. Mach. Learn. Res..

[14]  Ramon van Handel,et al.  The universal Glivenko–Cantelli property , 2010, 1009.4434.

[15]  F ROSENBLATT,et al.  The perceptron: a probabilistic model for information storage and organization in the brain. , 1958, Psychological review.

[16]  Vitaly Feldman,et al.  When is memorization of irrelevant training data necessary for high-accuracy learning? , 2020, STOC.

[17]  Lee-Ad Gottlieb,et al.  Near-Optimal Sample Compression for Nearest Neighbors , 2014, IEEE Transactions on Information Theory.

[18]  Thomas G. Dietterich Adaptive computation and machine learning , 1998 .

[19]  Yishay Mansour,et al.  Improved generalization bounds for robust learning , 2018, ALT.

[20]  Vladimir Vapnik,et al.  Chervonenkis: On the uniform convergence of relative frequencies of events to their probabilities , 1971 .

[21]  O. Bousquet,et al.  Predicting Neural Network Accuracy from Weights , 2020, ArXiv.

[22]  Shalev Ben-David Low-Sensitivity Functions from Unambiguous Certificates , 2017, ITCS.

[23]  Balas K. Natarajan,et al.  On learning sets and functions , 2004, Machine Learning.

[24]  Nicholas J. A. Harvey,et al.  Near-optimal Sample Complexity Bounds for Robust Learning of Gaussian Mixtures via Compression Schemes , 2017, J. ACM.

[25]  J. Zico Kolter,et al.  Uniform convergence may be unable to explain generalization in deep learning , 2019, NeurIPS.

[26]  Ulrike von Luxburg,et al.  Distance-Based Classification with Lipschitz Functions , 2004, J. Mach. Learn. Res..

[27]  Manfred K. Warmuth,et al.  The Weighted Majority Algorithm , 1994, Inf. Comput..

[28]  M. Talagrand Sharper Bounds for Gaussian and Empirical Processes , 1994 .

[29]  Andrew C. Singer,et al.  Universal linear prediction by model order weighting , 1999, IEEE Trans. Signal Process..

[30]  John Shawe-Taylor,et al.  Structural Risk Minimization Over Data-Dependent Hierarchies , 1998, IEEE Trans. Inf. Theory.

[31]  Nathan Srebro,et al.  Exploring Generalization in Deep Learning , 2017, NIPS.

[32]  Cynthia Dwork,et al.  Calibrating Noise to Sensitivity in Private Data Analysis , 2006, TCC.

[33]  Ioannis Mitliagkas,et al.  In Search of Robust Measures of Generalization , 2020, NeurIPS.

[34]  Mika Göös,et al.  Unambiguous DNFs from Hex , 2021, Electron. Colloquium Comput. Complex..

[35]  Shay Moran,et al.  Private Learning Implies Online Learning: An Efficient Reduction , 2019, NeurIPS.

[36]  Gintare Karolina Dziugaite,et al.  On the role of data in PAC-Bayes bounds , 2021, AISTATS.

[37]  HausslerDavid,et al.  A general lower bound on the number of examples needed for learning , 1989 .

[38]  David Haussler,et al.  Learnability and the Vapnik-Chervonenkis dimension , 1989, JACM.

[39]  Boaz Barak,et al.  Deep double descent: where bigger models and more data hurt , 2019, ICLR.

[40]  John Shawe-Taylor,et al.  PAC-Bayesian Compression Bounds on the Prediction Error of Learning Algorithms for Classification , 2005, Machine Learning.

[41]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[42]  David Haussler,et al.  How to use expert advice , 1993, STOC.

[43]  Manfred K. Warmuth,et al.  Averaging Expert Predictions , 1999, EuroCOLT.

[44]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .

[45]  Nicolas Bousquet,et al.  Clique versus independent set , 2013, Eur. J. Comb..

[46]  Aryeh Kontorovich,et al.  Sample Compression for Real-Valued Learners , 2018, ALT.

[47]  Vitaly Feldman,et al.  Does learning require memorization? a short tale about a long tail , 2019, STOC.

[48]  Ran El-Yaniv,et al.  A compression technique for analyzing disagreement-based active learning , 2014, J. Mach. Learn. Res..

[49]  Aryeh Kontorovich,et al.  Exact Lower Bounds for the Agnostic Probably-Approximately-Correct (PAC) Machine Learning Model , 2016, The Annals of Statistics.

[50]  S. Szarek Metric Entropy of Homogeneous Spaces , 1997, math/9701213.

[51]  Úlfar Erlingsson,et al.  RAPPOR: Randomized Aggregatable Privacy-Preserving Ordinal Response , 2014, CCS.

[52]  Mikhail Belkin,et al.  Reconciling modern machine-learning practice and the classical bias–variance trade-off , 2018, Proceedings of the National Academy of Sciences.

[53]  Shay Moran,et al.  Sample compression schemes for VC classes , 2015, 2016 Information Theory and Applications Workshop (ITA).

[54]  Soumendu Sundar Mukherjee,et al.  Weak convergence and empirical processes , 2019 .

[55]  Aaron Roth,et al.  The Algorithmic Foundations of Differential Privacy , 2014, Found. Trends Theor. Comput. Sci..

[56]  Aryeh Kontorovich,et al.  Nearest-Neighbor Sample Compression: Efficiency, Consistency, Infinite Dimensions , 2017, NIPS.

[57]  Badih Ghazi,et al.  Sample-efficient proper PAC learning with approximate differential privacy , 2021, STOC.

[58]  Shai Ben-David,et al.  Understanding Machine Learning: From Theory to Algorithms , 2014 .

[59]  Vladimir Vovk,et al.  Universal Forecasting Algorithms , 1992, Inf. Comput..

[60]  Jonathan Ullman,et al.  Efficient Private Algorithms for Learning Large-Margin Halfspaces , 2020, ALT.

[61]  Philip M. Long,et al.  Characterizations of Learnability for Classes of {0, ..., n}-Valued Functions , 1995, J. Comput. Syst. Sci..

[62]  Manfred K. Warmuth Compressing to VC Dimension Many Points , 2003, COLT.

[63]  Shay Moran,et al.  Supervised learning through the lens of compression , 2016, NIPS.

[64]  Ruth Urner,et al.  Probabilistic Lipschitzness A niceness assumption for deterministic labels , 2013 .

[65]  Roi Livni,et al.  An Equivalence Between Private Classification and Online Prediction , 2020, 2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS).

[66]  Steve Hanneke,et al.  The Optimal Sample Complexity of PAC Learning , 2015, J. Mach. Learn. Res..

[67]  N. Littlestone Learning Quickly When Irrelevant Attributes Abound: A New Linear-Threshold Algorithm , 1987, 28th Annual Symposium on Foundations of Computer Science (sfcs 1987).

[68]  Sanjoy Dasgupta,et al.  Rates of Convergence for Nearest Neighbor Classification , 2014, NIPS.

[69]  Claes Johnson,et al.  Mathematics and Computation , 2023, Springer Proceedings in Mathematics & Statistics.

[70]  Massimiliano Pontil,et al.  Empirical Bernstein Bounds and Sample-Variance Penalization , 2009, COLT.

[71]  Shai Ben-David,et al.  Agnostic Online Learning , 2009, COLT.

[72]  Noga Alon,et al.  Adversarial laws of large numbers and optimal regret in online classification , 2021, STOC.

[73]  Leslie G. Valiant,et al.  A general lower bound on the number of examples needed for learning , 1988, COLT '88.

[74]  Noga Alon,et al.  Private PAC learning implies finite Littlestone dimension , 2018, STOC.