A Model of Inductive Bias Learning

A major problem in machine learning is that of inductive bias: how to choose a learner's hypothesis space so that it is large enough to contain a solution to the problem being learnt, yet small enough to ensure reliable generalization from reasonably-sized training sets. Typically such bias is supplied by hand through the skill and insights of experts. In this paper a model for automatically learning bias is investigated. The central assumption of the model is that the learner is embedded within an environment of related learning tasks. Within such an environment the learner can sample from multiple tasks, and hence it can search for a hypothesis space that contains good solutions to many of the problems in the environment. Under certain restrictions on the set of all hypothesis spaces available to the learner, we show that a hypothesis space that performs well on a sufficiently large number of training tasks will also perform well when learning novel tasks in the same environment. Explicit bounds are also derived demonstrating that learning multiple tasks within an environment of related tasks can potentially give much better generalization than learning a single task.

[1]  R. Tate On a Double Inequality of the Normal Distribution , 1953 .

[2]  J. Lamperti ON CONVERGENCE OF STOCHASTIC PROCESSES , 1962 .

[3]  Norbert Sauer,et al.  On the Density of Families of Sets , 1972, J. Comb. Theory A.

[4]  E. Slud Distribution Inequalities for the Binomial Law , 1977 .

[5]  I. Good Some history of the hierarchical Bayesian methodology , 1980 .

[6]  Vladimir Vapnik,et al.  Estimation of Dependences Based on Empirical Data: Springer Series in Statistics (Springer Series in Statistics) , 1982 .

[7]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, STOC '84.

[8]  R. Dudley A course on empirical processes , 1984 .

[9]  Paul E. Utgoff,et al.  Shift of bias for inductive concept learning , 1984 .

[10]  Larry A. Rendell,et al.  Layered Concept-Learning and Dynamically Variable Bias Management , 1987, IJCAI.

[11]  J. Berger Statistical Decision Theory and Bayesian Analysis , 1988 .

[12]  Sturart J. Russell,et al.  The use of knowledge in analogy and induction , 1989 .

[13]  David Haussler,et al.  Learnability and the Vapnik-Chervonenkis dimension , 1989, JACM.

[14]  S. C. Suddarth,et al.  Rule-Injection Hints as a Means of Improving Network Performance and Learning Time , 1990, EURASIP Workshop.

[15]  Steven C. Suddarth,et al.  Symbolic-Neural Systems and the Use of Hints for Developing Complex Systems , 1991, Int. J. Man Mach. Stud..

[16]  Lorien Y. Pratt,et al.  Discriminability-Based Transfer between Neural Networks , 1992, NIPS.

[17]  Yaser S. Abu-Mostafa,et al.  A Method for Learning From Hints , 1992, NIPS.

[18]  Richard S. Sutton,et al.  Adapting Bias by Gradient Descent: An Incremental Version of Delta-Bar-Delta , 1992, AAAI.

[19]  David Haussler,et al.  Decision Theoretic Generalizations of the PAC Model for Neural Net and Other Learning Applications , 1992, Inf. Comput..

[20]  Peter L. Bartlett,et al.  Lower bounds on the Vapnik-Chervonenkis dimension of multi-layer threshold networks , 1993, COLT '93.

[21]  Sebastian Thrun,et al.  Finding Structure in Reinforcement Learning , 1994, NIPS.

[22]  Mark B. Ring Continual learning in reinforcement environments , 1995, GMD-Bericht.

[23]  Sebastian Thrun,et al.  Learning One More Thing , 1994, IJCAI.

[24]  Sebastian Thrun,et al.  Is Learning The n-th Thing Any Easier Than Learning The First? , 1995, NIPS.

[25]  Jonathan Baxter,et al.  Learning internal representations , 1995, COLT '95.

[26]  Nathan Intrator,et al.  How to Make a Low-Dimensional Representation Suitable for Diverse Tasks , 1996 .

[27]  Sebastian Thrun,et al.  Discovering Structure in Multiple Learning Tasks: The TC Algorithm , 1996, ICML.

[28]  László Györfi,et al.  A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.

[29]  Daniel L. Silver,et al.  The Parallel Transfer of Task Knowledge Using Dynamic Learning Rates Based on a Measure of Relatedness , 1996, Connect. Sci..

[30]  Jonathan Baxter,et al.  The Canonical Distortion Measure for Vector Quantization and Function Approximation , 1997, ICML.

[31]  Peter L. Bartlett,et al.  The Canonical Distortion Measure in Feature Space and 1-NN Classification , 1997, NIPS.

[32]  Peter L. Bartlett,et al.  The Sample Complexity of Pattern Classification with Neural Networks: The Size of the Weights is More Important than the Size of the Network , 1998, IEEE Trans. Inf. Theory.

[33]  Stephen Muggleton,et al.  Repeat Learning Using Predicate Invention , 1998, ILP.

[34]  Tom Heskes,et al.  Solving a Huge Number of Similar Tasks: A Combination of Multi-Task Learning and a Hierarchical Bayesian Approach , 1998, ICML.

[35]  Rich Caruana,et al.  Multitask Learning , 1998, Encyclopedia of Machine Learning and Data Mining.

[36]  Peter L. Bartlett,et al.  Learning in Neural Networks: Theoretical Foundations , 1999 .

[37]  Peter L. Bartlett,et al.  Neural Network Learning - Theoretical Foundations , 1999 .

[38]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[39]  Dudley,et al.  Real Analysis and Probability: Measurability: Borel Isomorphism and Analytic Sets , 2002 .