What Size Net Gives Valid Generalization?

We address the question of when a network can be expected to generalize from m random training examples chosen from some arbitrary probability distribution, assuming that future test examples are drawn from the same distribution. Among our results are the following bounds on appropriate sample vs. network size. Assume 0 < ∊ 1/8. We show that if m O(W/∊ log N/∊) random examples can be loaded on a feedforward network of linear threshold functions with N nodes and W weights, so that at least a fraction 1 ∊/2 of the examples are correctly classified, then one has confidence approaching certainty that the network will correctly classify a fraction 1 ∊ of future test examples drawn from the same distribution. Conversely, for fully-connected feedforward nets with one hidden layer, any learning algorithm using fewer than (W/∊) random training examples will, for some distributions of examples consistent with an appropriate weight choice, fail at least some fixed fraction of the time to find a weight choice that will correctly classify more than a 1 ∊ fraction of the future test examples.

[1]  Thomas M. Cover,et al.  Geometrical and Statistical Properties of Systems of Linear Inequalities with Applications in Pattern Recognition , 1965, IEEE Trans. Electron. Comput..

[2]  Vladimir Vapnik,et al.  Chervonenkis: On the uniform convergence of relative frequencies of events to their probabilities , 1971 .

[3]  Norbert Sauer,et al.  On the Density of Families of Sets , 1972, J. Comb. Theory, Ser. A.

[4]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[5]  Temple F. Smith Occam's razor , 1980, Nature.

[6]  Richard M. Dudley,et al.  Some special vapnik-chervonenkis classes , 1981, Discret. Math..

[7]  D. Pollard Convergence of stochastic processes , 1984 .

[8]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, STOC '84.

[9]  Lawrence D. Jackel,et al.  Large Automatic Learning, Rule Extraction, and Generalization , 1987, Complex Syst..

[10]  N. Littlestone Learning Quickly When Irrelevant Attributes Abound: A New Linear-Threshold Algorithm , 1987, 28th Annual Symposium on Foundations of Computer Science (sfcs 1987).

[11]  Terrence J. Sejnowski,et al.  A 'Neural' Network that Learns to Play Backgammon , 1987, NIPS.

[12]  Terrence J. Sejnowski,et al.  Parallel Networks that Learn to Pronounce English Text , 1987, Complex Syst..

[13]  Eric B. Baum,et al.  On the capabilities of multilayer perceptrons , 1988, J. Complex..

[14]  David Haussler,et al.  Quantifying Inductive Bias: AI Learning Algorithms and Valiant's Learning Framework , 1988, Artif. Intell..

[15]  Terrence J. Sejnowski,et al.  NETtalk: a parallel network that learns to read aloud , 1988 .

[16]  Leslie G. Valiant,et al.  A general lower bound on the number of examples needed for learning , 1988, COLT '88.

[17]  T. Sejnowski,et al.  Predicting the secondary structure of globular proteins using neural network models. , 1988, Journal of molecular biology.

[18]  Luc Devroye,et al.  Automatic Pattern Recognition: A Study of the Probability of Error , 1988, IEEE Trans. Pattern Anal. Mach. Intell..

[19]  Andrew M. Odlyzko,et al.  On subspaces spanned by random selections of plus/minus 1 vectors , 1988, Journal of combinatorial theory. Series A.

[20]  Esther Levin,et al.  A statistical approach to learning and generalization in layered neural networks , 1989, Proc. IEEE.

[21]  Y.S. Abu-Mostafa,et al.  Information theory, complexity and neural networks , 1989, IEEE Communications Magazine.

[22]  Jude Shavlik,et al.  An Approach to Combining Explanation-based and Neural Learning Algorithms , 1989 .

[23]  Halbert White,et al.  Learning in Artificial Neural Networks: A Statistical Perspective , 1989, Neural Computation.

[24]  Yaser S. Abu-Mostafa,et al.  The Vapnik-Chervonenkis Dimension: Information versus Complexity in Learning , 1989, Neural Computation.

[25]  V. Rich Personal communication , 1989, Nature.

[26]  Geoffrey E. Hinton Connectionist Learning Procedures , 1989, Artif. Intell..

[27]  Lawrence D. Jackel,et al.  Backpropagation Applied to Handwritten Zip Code Recognition , 1989, Neural Computation.

[28]  Eric B. Baum,et al.  A Proposal for More Powerful Learning Algorithms , 1989, Neural Computation.

[29]  David Haussler,et al.  Learnability and the Vapnik-Chervonenkis dimension , 1989, JACM.

[30]  Michael C. Mozer,et al.  Using Relevance to Reduce Network Size Automatically , 1989 .

[31]  Kishan G. Mehrotra,et al.  Bounds on the number of samples needed for neural learning , 1991, IEEE Trans. Neural Networks.