Probabilistic Analysis of Learning in Artificial Neural Networks: The PAC Model and its Variants

There are a number of mathematical approaches to the study of learning and generalization in arti cial neural networks. Here we survey the `probably approximately correct' (PAC) model of learning and some of its variants. These models provide a probabilistic framework for the discussion of generalization and learning. This survey concentrates on the sample complexity questions in these models; that is, the emphasis is on how many examples should be used for training. Computational complexity considerations are brie y discussed for the basic PAC model. Throughout, the importance of the Vapnik-Chervonenkis dimension is highlighted. Particular attention is devoted to describing how the probabilistic models apply in the context of neural network learning, both for networks with binary-valued output and for networks with real-valued output.

[1]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[2]  Vladimir Vapnik,et al.  Chervonenkis: On the uniform convergence of relative frequencies of events to their probabilities , 1971 .

[3]  Norbert Sauer,et al.  On the Density of Families of Sets , 1972, J. Comb. Theory A.

[4]  S. Shelah A combinatorial problem; stability and order for models and theories in infinitary languages. , 1972 .

[5]  R. Dudley Central Limit Theorems for Empirical Measures , 1978 .

[6]  David S. Johnson,et al.  Computers and Inrracrobiliry: A Guide ro the Theory of NP-Completeness , 1979 .

[7]  D. Pollard Convergence of stochastic processes , 1984 .

[8]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, CACM.

[9]  C. Micchelli Interpolation of scattered data: Distance matrices and conditionally positive definite functions , 1986 .

[10]  Leslie G. Valiant,et al.  On the learnability of Boolean formulae , 1987, STOC.

[11]  J. Stephen Judd,et al.  Learning in neural networks , 1988, COLT '88.

[12]  Leslie G. Valiant,et al.  Computational limitations on learning from examples , 1988, JACM.

[13]  Leslie G. Valiant,et al.  A general lower bound on the number of examples needed for learning , 1988, COLT '88.

[14]  Alon Itai,et al.  Learnability by fixed distributions , 1988, COLT '88.

[15]  David Haussler,et al.  Predicting {0,1}-functions on randomly drawn points , 1988, COLT '88.

[16]  David Haussler,et al.  Equivalence of models for polynomial learnability , 1988, COLT '88.

[17]  Ming Li,et al.  A theory of learning simple concepts under simple distributions and average case complexity for the universal distribution , 1989, 30th Annual Symposium on Foundations of Computer Science.

[18]  David Haussler,et al.  What Size Net Gives Valid Generalization? , 1989, Neural Computation.

[19]  Paul M. B. Vitányi,et al.  A Theory of Learning Simple Concepts Under Simple Distributions , 1989, COLT 1989.

[20]  Gyora M. Benedek,et al.  A parametrization scheme for classifying models of learnability , 1989, COLT '89.

[21]  Colin McDiarmid,et al.  Surveys in Combinatorics, 1989: On the method of bounded differences , 1989 .

[22]  David Haussler,et al.  Learnability and the Vapnik-Chervonenkis dimension , 1989, JACM.

[23]  J. Shawe-Taylor Building symmetries into feedforward networks , 1989 .

[24]  John Shawe-Taylor,et al.  The learnability of formal concepts , 1990, COLT '90.

[25]  Robert E. Schapire,et al.  Exact identification of circuits using fixed points of amplification functions , 1990, Proceedings [1990] 31st Annual Symposium on Foundations of Computer Science.

[26]  Karsten A. Verbeurgt Learning DNF under the uniform distribution in quasi-polynomial time , 1990, COLT '90.

[27]  Robert E. Schapire,et al.  On the sample complexity of weak learning , 1990, COLT '90.

[28]  Michael Kearns,et al.  Computational complexity of machine learning , 1990, ACM distinguished dissertations.

[29]  Eric B. Baum,et al.  Polynomial time algorithms for learning neural nets , 1990, Annual Conference Computational Learning Theory.

[30]  Leonard Pitt,et al.  Prediction-Preserving Reducibility , 1990, J. Comput. Syst. Sci..

[31]  Kenji Yamanishi,et al.  A learning criterion for stochastic rules , 1990, COLT '90.

[32]  Robert E. Schapire,et al.  The strength of weak learnability , 1990, Mach. Learn..

[33]  Alon Itai,et al.  Learnability with Respect to Fixed Distributions , 1991, Theor. Comput. Sci..

[34]  Philip M. Long,et al.  Tracking drifting concepts using random examples , 1991, Annual Conference Computational Learning Theory.

[35]  Sean W. Smith,et al.  Improved learning of AC0 functions , 1991, COLT '91.

[36]  Anders Krogh,et al.  Introduction to the theory of neural computation , 1994, The advanced book program.

[37]  John Shawe-Taylor,et al.  Sample sizes for multiple-output threshold networks , 1991 .

[38]  Kathleen Romanik,et al.  Testing as a dual to learning , 1991 .

[39]  M. Anthony Uniform convergence and learnability. , 1991 .

[40]  M. Kearns,et al.  On the complexity of teaching , 1991, COLT '91.

[41]  Balas K. Natarajan,et al.  Machine Learning: A Theoretical Approach , 1992 .

[42]  Andrew Tomkins,et al.  A computational model of teaching , 1992, COLT '92.

[43]  Peter L. Bartlett,et al.  Learning with a slowly changing distribution , 1992, COLT '92.

[44]  Paola Campadelli,et al.  Polynomial uniform convergence and polynomial-sample learnability , 1992, COLT '92.

[45]  John Shawe-Taylor,et al.  On exact specification by examples , 1992, COLT '92.

[46]  Ronald L. Rivest,et al.  Training a 3-node neural network is NP-complete , 1988, COLT '88.

[47]  Manfred K. Warmuth,et al.  Some weak learning results , 1992, COLT '92.

[48]  Alon Itai,et al.  Dominating distributions and learnability , 1992, COLT '92.

[49]  Kathleen Romanik,et al.  Approximate testing and learnability , 1992, COLT '92.

[50]  Martin Anthony,et al.  Computational learning theory: an introduction , 1992 .

[51]  Michele Flammini,et al.  Learning DNF formulae under classes of probability distributions , 1992, COLT '92.

[52]  Dana Angluin,et al.  Computational learning theory: survey and selected bibliography , 1992, STOC '92.

[53]  Eduardo D. Sontag,et al.  Feedforward Nets for Interpolation and Classification , 1992, J. Comput. Syst. Sci..

[54]  R. Schapire Toward Eecient Agnostic Learning , 1992 .

[55]  Balas K. Natarajan,et al.  Probably Approximate Learning Over Classes of Distributions , 1992, SIAM J. Comput..

[56]  Linda Sellie,et al.  Toward efficient agnostic learning , 1992, COLT '92.

[57]  David Haussler,et al.  Decision Theoretic Generalizations of the PAC Model for Neural Net and Other Learning Applications , 1992, Inf. Comput..

[58]  Hans Ulrich Simon,et al.  Robust Trainability of Single Neurons , 1995, J. Comput. Syst. Sci..

[59]  John Shawe-Taylor,et al.  Bounding Sample Size with the Vapnik-Chervonenkis Dimension , 1993, Discrete Applied Mathematics.

[60]  John Shawe-Taylor,et al.  Generalising from Approximate Interpolation , 1993 .

[61]  Martin Anthony,et al.  Valid generalisation of functions from close approximations on a sample , 1994 .

[62]  David Haussler,et al.  How to use expert advice , 1993, STOC.

[63]  Peter L. Bartlett,et al.  Vapnik-Chervonenkis Dimension Bounds for Two- and Three-Layer Networks , 1993, Neural Computation.

[64]  Martin Anthony,et al.  Computational Learning Theory for Artificial Neural Networks , 1993 .

[65]  Michael Kharitonov,et al.  Cryptographic hardness of distribution-specific learning , 1993, STOC.

[66]  Paul W. Goldberg,et al.  Bounding the Vapnik-Chervonenkis Dimension of Concept Classes Parameterized by Real Numbers , 1993, COLT '93.

[67]  Noam Nisan,et al.  Constant depth circuits, Fourier transform, and learnability , 1993, JACM.

[68]  Hans Ulrich Simon,et al.  General bounds on the number of examples needed for learning probabilistic concepts , 1993, COLT '93.

[69]  Gerhard J. Woeginger,et al.  On the Complexity of Function Learning , 1993, COLT.

[70]  Balas K. Natarajan,et al.  Occam's razor for functions , 1993, COLT '93.

[71]  Wolfgang Maass,et al.  On the Complexity of Learning on Feedforward Neural Nets , 1993 .

[72]  A. Sakurai,et al.  Tighter bounds of the VC-dimension of three layer networks , 1993 .

[73]  Wolfgang Maass,et al.  Bounds for the computational power and learning complexity of analog neural nets , 1993, SIAM J. Comput..

[74]  Sean B. Holden,et al.  On the theory of generalization and self-structuring in linearly weighted connectionist networks , 1993 .

[75]  Martin Anthony,et al.  On the power of polynomial discriminators and radial basis function networks , 1993, COLT '93.

[76]  Eduardo D. Sontag,et al.  Finiteness results for sigmoidal “neural” networks , 1993, STOC.

[77]  Peter L. Bartlett,et al.  Lower bounds on the Vapnik-Chervonenkis dimension of multi-layer threshold networks , 1993, COLT '93.

[78]  Martin Anthony,et al.  Quantifying Generalization in Linearly Weighted Neural Networks , 1994, Complex Syst..

[79]  Robert E. Schapire,et al.  Efficient Distribution-Free Learning of Probabilistic , 1994 .

[80]  Sanjeev R. Kulkarni,et al.  A metric entropy bound is not sufficient for learnability , 1994, IEEE Trans. Inf. Theory.

[81]  N. Fisher,et al.  Probability Inequalities for Sums of Bounded Random Variables , 1994 .

[82]  Leslie G. Valiant,et al.  Cryptographic limitations on learning Boolean formulae and finite automata , 1994, JACM.

[83]  David H. Wolpert,et al.  The Mathematics of Generalization: The Proceedings of the SFI/CNLS Workshop on Formal Approaches to Supervised Learning , 1994 .

[84]  Wee Sun Lee,et al.  Lower bounds on the VC-dimension of smoothly parametrized function classes , 1994, COLT '94.

[85]  Umesh V. Vazirani,et al.  An Introduction to Computational Learning Theory , 1994 .

[86]  Wolfgang Maass,et al.  Neural Nets with Superlinear VC-Dimension , 1994, Neural Computation.

[87]  John Shawe-Taylor,et al.  A Result of Vapnik with Applications , 1993, Discrete Applied Mathematics.

[88]  Philip M. Long,et al.  Fat-shattering and the learnability of real-valued functions , 1994, COLT '94.

[89]  Mario Marchand,et al.  Strong Unimodality and Exact Learning of Constant Depth µ-Perceptron Networks , 1995, NIPS.

[90]  Marek Karpinski,et al.  Polynomial bounds for VC dimension of sigmoidal neural networks , 1995, STOC '95.

[91]  John Shawe-Taylor,et al.  On Specifying Boolean Functions by Labelled Examples , 1995, Discret. Appl. Math..

[92]  Wolfgang Maass,et al.  Agnostic PAC Learning of Functions on Analog Neural Nets , 1993, Neural Computation.

[93]  Philip M. Long,et al.  Characterizations of Learnability for Classes of {0, ..., n}-Valued Functions , 1995, J. Comput. Syst. Sci..

[94]  John Shawe-Taylor Sample Sizes for Threshold Networks with Equivalences , 1995, Inf. Comput..

[95]  Umesh V. Vazirani,et al.  Computational Learning Theory , 1995, SIGACT News.

[96]  Mario Marchand,et al.  On learning ?-perceptron networks on the uniform distribution , 1996, Neural Networks.

[97]  Yuval Ishai,et al.  Valid Generalisation from Approximate Interpolation , 1996, Combinatorics, Probability and Computing.

[98]  László Györfi,et al.  A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.

[99]  Philip M. Long,et al.  Fat-shattering and the learnability of real-valued functions , 1994, COLT '94.

[100]  Peter L. Bartlett,et al.  The VC Dimension and Pseudodimension of Two-Layer Neural Networks with Discrete Inputs , 1996, Neural Computation.

[101]  R. K. Shyamasundar,et al.  Introduction to algorithms , 1996 .