A Mathematical Theory of Generalization: Part I

The problem of how best to generalize from a given learning set of input-output examples is cent ral to the fields of neural nets, statistics, approximation theory, and ar tificial in telligence. T his series of papers inv estigates this pr oblem from within an abstract and modelindep endent fr am ework and then tests some of the resulting concepts in realworld sit uations. In this abs tract fram ework a gene ralizer is completely specified by a certain countably infinite set of fun ctions , so the mathematics of generali zat ion becomes an investigation into candida te set s of cri teria governing the behavior of that infinite set of fun ctions. In the first pap er of thi s series, t he foundations of this mathematics are spelled out and some relatively simple generalization criteri a are investigated. Elsewhere the real-world generalizing of sys tems construct ed with these generalization crite ria in mind h ave been favorably comp ared to neural nets for several real generali zation problems , including Sejnowski's problem of reading aloud . T his leads to the conclusion that (current) neural nets in fact cons titute a poor means of generali zing. In th e second of this pair of papers, ot her sets of crit eria, mor e sophistic at ed than those crit eri a embodied in thi s first series of paper s, are investi gat ed. Generali zers meeting these more sophisticated criteria can readil y be approximated on computers. Some of th ese approxi mations employ net work structures built via an evolutionar y process. A preliminary and favorable in vestigati on int o the generali zation behavior of these approximations fini shes the second pap er of thi s series . Outline of these papers In section 1 of this p ap er the topic of generalizat ion is dis cussed fr om a very broad perspective . It is argued that it is t he ir ability t o generalize that consti t u tes the pr imary reason for curre nt interest in neural ne ts (even thoug h suc h neural nets in fact gener a lize poorly on average, as is demonstrat ed in [13]) . This section goes on to discuss the b en efits that wo uld come fro m h aving a particularly good generaliz ing algorithm. Section 1 t hen ends wi th a @ 1990 Complex Systems Publications, Inc. 152 David H. Wolpert detailed out line of the rest of these pap ers , presented in terms of the pr eceding discussion of generalization . Here and throughout these pap ers, generalization is assumed to be t ak ing place wit hout any knowledge of what the variables involved "really mean. " An abs tract, model-independent formalism is the most rigorous way to deal with this kind of generalizing. Section 2 of this paper begins with a mathematically precise definition of generalizers. It t hen goes on to exp lore some of the more basic properties that can be required of generalizers and elucidates some of the more straightforward mathematical consequences of such requirements . Some of these conse quences (e.g., no linear mathematical model should be used to generalize whe n the underlying syste m being mo delled is known to be geometric in nature) are not intuitively obvious . The par adigm here and throughout these papers is to make general, broad requiremen ts for gen erali zing behavior , and then see what (if any) mathematical solut ions there are for such requirements . This contrasts to the usual (extremely ad hoc) way of dealing with generalizers, which is to take a concrete generalizer and investigate its behavior for assorted tes t problems. In these papers , the behavior defines the architecture, not the ot her way aro und. The other way of trying to build an eng ine which exhibits "good" gene ralization is, ultimately, depend ent to a degree on sheer luck. Sections 1 and 2 of the second paper present and explore some sets of restrictions on generalization behavior which are more sophist ica ted than those found in section 2 of the first pap er. The first of these new restrictions , exp lored at length in sect ion 1 of the second pap er , is th e restriction of "self-guess ing" in its various formulations. Intuitively, self-guessing refers to the requirement that if taught with a subset of the full t raining set , the generalizer should correctly guess the rest of the training set . One of the more interesti ng results concern ing self-guessing is th at it is impossible to construct generalization crit eria which, along wit h self-guessing, specify unique generalization of a learn ing set . (Any particular set of crite ria will always be either under-restrictive or over-restricti ve.) Sectio n 2 of the second pap er then discusses the restriction of information compactification , which can be viewed as a mathematically precise stat ement of Occam 's razor. Particul ar at tention is drawn to the fac t (and it s consequen ces) that at present there is no known way of making an a priori most reason abl e definition of information measure in a model-independent way. Finally, sect ion 3 of th e second paper is a partial investigation of t he real-worl d efficacy of the tools elaborated in the first paper and in the first two sections of the second paper. References [13J consist of other such investigations and show that these techniques far outperform backpropagation [4, 5J and in particular easily beat NETtaik [6J. T he tests and investigations pr esented in section 3 of this second paper are intended to be an extension and elaboration of these tests presented in [1-3]. A Mathematical Th eory of Generalization : Part I "T his then is the measure of a man that from the parti culars he can discern the pattern in the greater whole ." [U. Merre, from Studies on the Nature of ManJ 153

[1]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[2]  D. Raine General relativity , 1980, Nature.

[3]  Tom M. Mitchell,et al.  Generalization as Search , 2002 .

[4]  E. Jaynes On the rationale of maximum-entropy methods , 1982, Proceedings of the IEEE.

[5]  J J Hopfield,et al.  Neural networks and physical systems with emergent collective computational abilities. , 1982, Proceedings of the National Academy of Sciences of the United States of America.

[6]  C. D. Gelatt,et al.  Optimization by Simulated Annealing , 1983, Science.

[7]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, STOC '84.

[8]  Sompolinsky,et al.  Spin-glass models of neural networks. , 1985, Physical review. A, General physics.

[9]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[10]  Robert M. Farber,et al.  How Neural Nets Work , 1987, NIPS.

[11]  N. Littlestone Learning Quickly When Irrelevant Attributes Abound: A New Linear-Threshold Algorithm , 1987, 28th Annual Symposium on Foundations of Computer Science (sfcs 1987).

[12]  B. McNaughton,et al.  Hippocampal synaptic enhancement and information storage within a distributed memory system , 1987, Trends in Neurosciences.

[13]  Satosi Watanabe Inductive ambiguity and the limits of artificial intelligence , 1987, Comput. Intell..

[14]  Aviv Bergman,et al.  BREEDING INTELLIGENT AUTOMATA. , 1987 .

[15]  Dembo,et al.  General potential surfaces and neural networks. , 1988, Physical review. A, General physics.

[16]  Terrence J. Sejnowski,et al.  NETtalk: a parallel network that learns to read aloud , 1988 .

[17]  Gutfreund Neural networks with hierarchically correlated patterns. , 1988, Physical review. A, General physics.

[18]  J. Doyne Farmer,et al.  Exploiting Chaos to Predict the Future and Reduce Noise , 1989 .

[19]  Georg Schnitger,et al.  Relating Boltzmann machines to conventional models of computation , 1987, Neural Networks.

[20]  D. Wolpert Generalization, surface-fitting, and network structures , 1990 .

[21]  Albrecht Rau,et al.  Statistical mechanics of neural networks , 1992 .

[22]  E. Capaldi,et al.  The organization of behavior. , 1992, Journal of applied behavior analysis.