Hierarchical Mixtures of Experts and the EM Algorithm

We present a tree-structured architecture for supervised learning. The statistical model underlying the architecture is a hierarchical mixture model in which both the mixture coefficients and the mixture components are generalized linear models (GLIMs). Learning is treated as a maximum likelihood problem; in particular, we present an expectation-maximization (EM) algorithm for adjusting the parameters of the architecture. We also develop an online learning algorithm in which the parameters are updated incrementally. Comparative simulation results are presented in the robot dynamics domain.

[1]  David R. Cox The analysis of binary data , 1970 .

[2]  D. J. Finney Statistical Method in Biological Assay , 1966 .

[3]  R. Quandt A New Approach to Estimating Switching Regressions , 1972 .

[4]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[5]  C. J. Stone,et al.  Consistent Nonparametric Regression , 1977 .

[6]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[7]  Lennart Ljung,et al.  Theory and Practice of Recursive Identification , 1983 .

[8]  New York Dover,et al.  ON THE CONVERGENCE PROPERTIES OF THE EM ALGORITHM , 1983 .

[9]  P. McCullagh,et al.  Generalized Linear Models , 1984 .

[10]  R. Redner,et al.  Mixture densities, maximum likelihood, and the EM algorithm , 1984 .

[11]  A. F. Smith,et al.  Statistical analysis of finite mixture distributions , 1986 .

[12]  S. Haykin,et al.  Adaptive Filter Theory , 1986 .

[13]  Geoffrey E. Hinton,et al.  Learning and relearning in Boltzmann machines , 1986 .

[14]  R. Tibshirani,et al.  Generalized additive models for medical research , 1986, Statistical methods in medical research.

[15]  D. Rubin,et al.  Statistical Analysis with Missing Data. , 1989 .

[16]  James Kelly,et al.  AutoClass: A Bayesian Classification System , 1993, ML.

[17]  John E. Moody,et al.  Fast Learning in Multi-Resolution Hierarchies , 1988, NIPS.

[18]  John Scott Bridle,et al.  Probabilistic Interpretation of Feedforward Classification Network Outputs, with Relationships to Statistical Pattern Recognition , 1989, NATO Neurocomputing.

[19]  Steven J. Nowlan,et al.  Maximum Likelihood Competitive Learning , 1989, NIPS.

[20]  Ronald L. Rivest,et al.  Inferring Decision Trees Using the Minimum Description Length Principle , 1989, Inf. Comput..

[21]  Kurt Hornik,et al.  Multilayer feedforward networks are universal approximators , 1989, Neural Networks.

[22]  Carla E. Brodley,et al.  An Incremental Method for Finding Multivariate Splits for Decision Trees , 1990, ML.

[23]  J. Friedman Multivariate adaptive regression splines , 1990 .

[24]  Michael I. Jordan,et al.  Hierarchies of Adaptive Experts , 1991, NIPS.

[25]  Donald F. Specht,et al.  A general regression neural network , 1991, IEEE Trans. Neural Networks.

[26]  Steven J. Nowlan,et al.  Soft competitive adaptation: neural network learning algorithms based on fitting statistical mixtures , 1991 .

[27]  Terence D. Sanger,et al.  A tree-structured adaptive network for function approximation in high-dimensional spaces , 1991, IEEE Trans. Neural Networks.

[28]  Jan-Erik Strömberg,et al.  Neural trees-using neural nets in a tree classifier structure , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[29]  Geoffrey E. Hinton,et al.  Adaptive Mixtures of Local Experts , 1991, Neural Computation.

[30]  David W. Scott,et al.  Multivariate Density Estimation: Theory, Practice, and Visualization , 1992, Wiley Series in Probability and Statistics.

[31]  Wray L. Buntine,et al.  Learning classification trees , 1992 .

[32]  Elie Bienenstock,et al.  Neural Networks and the Bias/Variance Dilemma , 1992, Neural Computation.

[33]  Michael I. Jordan,et al.  A Model of the Learning of Arm Trajectories from Spatial Deviations , 1994, Journal of Cognitive Neuroscience.

[34]  Michael I. Jordan,et al.  Hierarchical Mixtures of Experts and the EM Algorithm , 1994, Neural Computation.

[35]  Michael I. Jordan,et al.  Convergence results for the EM approach to mixtures of experts architectures , 1995, Neural Networks.