Task Clustering and Gating for Bayesian Multitask Learning

Modeling a collection of similar regression or classification tasks can be improved by making the tasks 'learn from each other'. In machine learning, this subject is approached through 'multitask learning', where parallel tasks are modeled as multiple outputs of the same network. In multilevel analysis this is generally implemented through the mixed-effects linear model where a distinction is made between 'fixed effects', which are the same for all tasks, and 'random effects', which may vary between tasks. In the present article we will adopt a Bayesian approach in which some of the model parameters are shared (the same for all tasks) and others more loosely connected through a joint prior distribution that can be learned from the data. We seek in this way to combine the best parts of both the statistical multilevel approach and the neural network machinery. The standard assumption expressed in both approaches is that each task can learn equally well from any other task. In this article we extend the model by allowing more differentiation in the similarities between tasks. One such extension is to make the prior mean depend on higher-level task characteristics. More unsupervised clustering of tasks is obtained if we go from a single Gaussian prior to a mixture of Gaussians. This can be further generalized to a mixture of experts architecture with the gates depending on task characteristics. All three extensions are demonstrated through application both on an artificial data set and on two real-world problems, one a school problem and the other involving single-copy newspaper sales.

[1]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[2]  M. Aitkin,et al.  Statistical Modelling Issues in School Effectiveness Studies , 1986 .

[3]  Lorien Y. Pratt,et al.  Discriminability-Based Transfer between Neural Networks , 1992, NIPS.

[4]  Anthony S. Bryk,et al.  Hierarchical Linear Models: Applications and Data Analysis Methods , 1992 .

[5]  D. Gamerman,et al.  Dynamic Hierarchical Models , 1993 .

[6]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[7]  Robert A. Jacobs,et al.  Hierarchical Mixtures of Experts and the EM Algorithm , 1993, Neural Computation.

[8]  David Mackay,et al.  Probable networks and plausible predictions - a review of practical Bayesian methods for supervised neural networks , 1995 .

[9]  R. Bosker Boekbespreking van "A.S. Bryk & S.W. Raudenbusch - Hierarchical linear models: Applications and data analysis methods" : Sage Publications, Newbury Parki, London/New Delhi 1992 , 1995 .

[10]  Sebastian Thrun,et al.  Discovering Structure in Multiple Learning Tasks: The TC Algorithm , 1996, ICML.

[11]  Wing Hung Wong,et al.  Bayesian Analysis in Applications of Hierarchical Models: Issues and Methods , 1996 .

[12]  C. Robert The Bayesian choice : a decision-theoretic motivation , 1996 .

[13]  Vipin Arora,et al.  Empirical Bayes Estimation of Finite Population Means from Complex Surveys , 1997 .

[14]  Tom Heskes,et al.  Solving a Huge Number of Similar Tasks: A Combination of Multi-Task Learning and a Hierarchical Bayesian Approach , 1998, ICML.

[15]  J. Rice,et al.  Smoothing spline models for the analysis of nested and crossed samples of curves , 1998 .

[16]  Rich Caruana,et al.  Multitask Learning , 1998, Encyclopedia of Machine Learning and Data Mining.

[17]  David Barber,et al.  Bayesian Classification With Gaussian Processes , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[18]  M. Daniels,et al.  Hierarchical Generalized Linear Models in the Analysis of Variations in Health Care Utilization , 1999 .

[19]  X. Lin,et al.  Inference in generalized additive mixed modelsby using smoothing splines , 1999 .

[20]  Wenxin Jiang,et al.  On the Approximation Rate of Hierarchical Mixtures-of-Experts for Generalized Linear Models , 1999, Neural Computation.

[21]  Rich Caruana,et al.  Overfitting in Neural Nets: Backpropagation, Conjugate Gradient, and Early Stopping , 2000, NIPS.

[22]  Padhraic Smyth,et al.  A general probabilistic framework for clustering individuals and objects , 2000, KDD '00.

[23]  Tom Heskes,et al.  Empirical Bayes for Learning to Learn , 2000, ICML.

[24]  Jonathan Baxter,et al.  A Bayesian/Information Theoretic Model of Learning to Learn via Multiple Task Sampling , 1997, Machine Learning.