Hierarchical Mixtures of Experts and the EM Algorithm

We present a tree-structured architecture for supervised learning. The statistical model underlying the architecture is a hierarchical mixture model in which both the mixture coefficients and the mixture components are generalized linear models (GLIM's). Learning is treated as a maximum likelihood problem; in particular, we present an Expectation-Maximization (EM) algorithm for adjusting the parameters of the architecture. We also develop an on-line learning algorithm in which the parameters are updated incrementally. Comparative simulation results are presented in the robot dynamics domain.

[1]  M. Kendall Theoretical Statistics , 1956, Nature.

[2]  David R. Cox The analysis of binary data , 1970 .

[3]  D. J. Finney Statistical Method in Biological Assay , 1966 .

[4]  R. Quandt A New Approach to Estimating Switching Regressions , 1972 .

[5]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[6]  Peter E. Hart,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[7]  C. J. Stone,et al.  Consistent Nonparametric Regression , 1977 .

[8]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[9]  Lennart Ljung,et al.  Theory and Practice of Recursive Identification , 1983 .

[10]  New York Dover,et al.  ON THE CONVERGENCE PROPERTIES OF THE EM ALGORITHM , 1983 .

[11]  P. McCullagh,et al.  Generalized Linear Models , 1984 .

[12]  R. Redner,et al.  Mixture densities, maximum likelihood, and the EM algorithm , 1984 .

[13]  P. Kumar,et al.  Theory and practice of recursive identification , 1985, IEEE Transactions on Automatic Control.

[14]  A. F. Smith,et al.  Statistical analysis of finite mixture distributions , 1986 .

[15]  S. Haykin,et al.  Adaptive Filter Theory , 1986 .

[16]  Geoffrey E. Hinton,et al.  Learning and relearning in Boltzmann machines , 1986 .

[17]  D. Rubin,et al.  Statistical Analysis with Missing Data. , 1989 .

[18]  James Kelly,et al.  AutoClass: A Bayesian Classification System , 1993, ML.

[19]  John E. Moody,et al.  Fast Learning in Multi-Resolution Hierarchies , 1988, NIPS.

[20]  John Scott Bridle,et al.  Probabilistic Interpretation of Feedforward Classification Network Outputs, with Relationships to Statistical Pattern Recognition , 1989, NATO Neurocomputing.

[21]  Steven J. Nowlan,et al.  Maximum Likelihood Competitive Learning , 1989, NIPS.

[22]  Ronald L. Rivest,et al.  Inferring Decision Trees Using the Minimum Description Length Principle , 1989, Inf. Comput..

[23]  Kurt Hornik,et al.  Multilayer feedforward networks are universal approximators , 1989, Neural Networks.

[24]  Carla E. Brodley,et al.  An Incremental Method for Finding Multivariate Splits for Decision Trees , 1990, ML.

[25]  R. Tibshirani,et al.  Generalized Additive Models , 1991 .

[26]  J. Friedman Multivariate adaptive regression splines , 1990 .

[27]  Michael I. Jordan,et al.  Hierarchies of Adaptive Experts , 1991, NIPS.

[28]  Donald F. Specht,et al.  A general regression neural network , 1991, IEEE Trans. Neural Networks.

[29]  Steven J. Nowlan,et al.  Soft competitive adaptation: neural network learning algorithms based on fitting statistical mixtures , 1991 .

[30]  Terence D. Sanger,et al.  A tree-structured adaptive network for function approximation in high-dimensional spaces , 1991, IEEE Trans. Neural Networks.

[31]  J. Freidman,et al.  Multivariate adaptive regression splines , 1991 .

[32]  Jan-Erik Strömberg,et al.  Neural trees-using neural nets in a tree classifier structure , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[33]  Geoffrey E. Hinton,et al.  Adaptive Mixtures of Local Experts , 1991, Neural Computation.

[34]  David W. Scott,et al.  Multivariate Density Estimation: Theory, Practice, and Visualization , 1992, Wiley Series in Probability and Statistics.

[35]  Wray L. Buntine,et al.  Learning classification trees , 1992 .

[36]  Elie Bienenstock,et al.  Neural Networks and the Bias/Variance Dilemma , 1992, Neural Computation.

[37]  Timothy Masters,et al.  Multilayer Feedforward Networks , 1993 .

[38]  G. Wahba,et al.  Soft Classiication, A. K. A. Risk Estimation, via Penalized Log Likelihood and Smoothing Spline Analysis of Variance , 1993 .

[39]  Shun-ichi Amari,et al.  Information geometry of the EM and em algorithms for neural networks , 1995, Neural Networks.

[40]  Fengchun Peng,et al.  Bayesian Inference in Mixtures-of-Experts and Hierarchical Mixtures-of-Experts Models With an Applic , 1996 .

[41]  Shun-ichi Amari,et al.  The EM Algorithm and Information Geometry in Neural Network Learning , 1995, Neural Computation.

[42]  Michael I. Jordan,et al.  Convergence results for the EM approach to mixtures of experts architectures , 1995, Neural Networks.

[43]  Martin Brown,et al.  Advances in neurofuzzy algorithms for real-time modelling and control , 1996 .

[44]  Athanasios Kehagias,et al.  Modular neural networks for MAP classification of time series and the partition algorithm , 1996, IEEE Trans. Neural Networks.

[45]  Eric Mjolsness,et al.  Learning with Preknowledge: Clustering with Point and Graph Matching Distance Measures , 1996, Neural Computation.

[46]  Michael I. Jordan,et al.  Local linear perceptrons for classification , 1996, IEEE Trans. Neural Networks.

[47]  Ke Chen,et al.  A modified HME architecture for text-dependent speaker identification , 1996, IEEE Trans. Neural Networks.

[48]  J. T. Connor,et al.  A robust neural network filter for electricity demand prediction , 1996 .

[49]  Ming Zhang,et al.  Face recognition using artificial neural network group-based adaptive tolerance (GAT) trees , 1996, IEEE Trans. Neural Networks.

[50]  Jun Tani,et al.  Model-based learning for mobile robot navigation from the dynamical systems perspective , 1996, IEEE Trans. Syst. Man Cybern. Part B.

[51]  Kenneth Rose,et al.  A global optimization technique for statistical classifier design , 1996, IEEE Trans. Signal Process..

[52]  Steven Gold,et al.  A Graduated Assignment Algorithm for Graph Matching , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[53]  L.M.C. Buydens,et al.  Performance of multi-layer feedforward and radial base function neural networks in classification and modelling , 1996 .

[54]  J. Simonoff Multivariate Density Estimation , 1996 .

[55]  Haim Sompolinsky,et al.  Neural Network Models of Perceptual Learning of Angle Discrimination , 1996, Neural Computation.

[56]  Kenneth Rose,et al.  Hierarchical, Unsupervised Learning with Growing via Phase Transitions , 1996, Neural Computation.