Using the EM algorithm to train neural networks: misconceptions and a new algorithm for multiclass classification

The expectation-maximization (EM) algorithm has been of considerable interest in recent years as the basis for various algorithms in application areas of neural networks such as pattern recognition. However, there exists some misconceptions concerning its application to neural networks. In this paper, we clarify these misconceptions and consider how the EM algorithm can be adopted to train multilayer perceptron (MLP) and mixture of experts (ME) networks in applications to multiclass classification. We identify some situations where the application of the EM algorithm to train MLP networks may be of limited value and discuss some ways of handling the difficulties. For ME networks, it is reported in the literature that networks trained by the EM algorithm using iteratively reweighted least squares (IRLS) algorithm in the inner loop of the M-step, often performed poorly in multiclass classification. However, we found that the convergence of the IRLS algorithm is stable and that the log likelihood is monotonic increasing when a learning rate smaller than one is adopted. Also, we propose the use of an expectation-conditional maximization (ECM) algorithm to train ME networks. Its performance is demonstrated to be superior to the IRLS algorithm on some simulated and real data sets.

[1]  Xiao-Li Meng,et al.  On the rate of convergence of the ECM algorithm , 1994 .

[2]  R. Fisher THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS , 1936 .

[3]  Chuanyi Ji,et al.  Fast training of recurrent networks based on the EM algorithm , 1998, IEEE Trans. Neural Networks.

[4]  Shun-ichi Amari,et al.  Information geometry of the EM and em algorithms for neural networks , 1995, Neural Networks.

[5]  Robert A. Jacobs,et al.  Hierarchical Mixtures of Experts and the EM Algorithm , 1993, Neural Computation.

[6]  D. M. Titterington,et al.  Neural Networks: A Review from a Statistical Perspective , 1994 .

[7]  G. C. Wei,et al.  A Monte Carlo Implementation of the EM Algorithm and the Poor Man's Data Augmentation Algorithms , 1990 .

[8]  Eric R. Ziegel,et al.  Generalized Linear Models , 2002, Technometrics.

[9]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[10]  Chuanyi Ji,et al.  An Efficient EM-based Training Algorithm for Feedforward Neural Networks , 1997, Neural Networks.

[11]  John Yen,et al.  Radial basis function networks, regression weights, and the expectation-maximization algorithm , 1997, IEEE Trans. Syst. Man Cybern. Part A.

[12]  Peter N. Nikiforuk,et al.  A new supervised learning algorithm for multilayered and interconnected neural networks , 2000, IEEE Trans. Neural Networks Learn. Syst..

[13]  M I Chai,et al.  Boundary detection of retinoblastoma tumors with neural networks. , 2001, Computerized medical imaging and graphics : the official journal of the Computerized Medical Imaging Society.

[14]  John Scott Bridle,et al.  Probabilistic Interpretation of Feedforward Classification Network Outputs, with Relationships to Statistical Pattern Recognition , 1989, NATO Neurocomputing.

[15]  Volker Tresp,et al.  Neural-network models for the blood glucose metabolism of a diabetic , 1999, IEEE Trans. Neural Networks.

[16]  T. Lai,et al.  Stochastic Neural Networks With Applications to Nonlinear Time Series , 2001 .

[17]  G. McLachlan,et al.  The EM algorithm and extensions , 1996 .

[18]  Michael I. Jordan,et al.  Convergence results for the EM approach to mixtures of experts architectures , 1995, Neural Networks.

[19]  Brian D. Ripley,et al.  Statistical aspects of neural networks , 1993 .

[20]  Yoshua Bengio,et al.  Pattern Recognition and Neural Networks , 1995 .

[21]  Christopher J. Merz,et al.  UCI Repository of Machine Learning Databases , 1996 .

[22]  Ke Chen,et al.  Improved learning algorithms for mixture of experts in multiclass classification , 1999, Neural Networks.

[23]  Xiao-Li Meng,et al.  Maximum likelihood estimation via the ECM algorithm: A general framework , 1993 .

[24]  N. Campbell,et al.  A multivariate study of variation in two species of rock crab of the genus Leptograpsus , 1974 .

[25]  Geoffrey E. Hinton,et al.  Adaptive Mixtures of Local Experts , 1991, Neural Computation.

[26]  Robert A. Jacobs,et al.  A Bayesian Approach to Model Selection in Hierarchical Mixtures-of-Experts Architectures , 1997, Neural Networks.

[27]  Michael I. Jordan,et al.  Attractor Dynamics in Feedforward Neural Networks , 2000, Neural Computation.

[28]  Andrew L. Rukhin,et al.  Tools for statistical inference , 1991 .

[29]  Reza Langari,et al.  Sugeno model, fuzzy discretization, and the EM algorithm , 1996, Fuzzy Sets Syst..

[30]  Shun-ichi Amari,et al.  Information geometry of neural network—an overview , 1997 .

[31]  Donald B. Rubin,et al.  Max-imum Likelihood from Incomplete Data , 1972 .

[32]  Roy L. Streit,et al.  Maximum likelihood training of probabilistic neural networks , 1994, IEEE Trans. Neural Networks.

[33]  Murray Aitkin,et al.  Statistical modelling of artificial neural networks using the multi-layer perceptron , 2003, Stat. Comput..

[34]  George Casella,et al.  Implementations of the Monte Carlo EM Algorithm , 2001 .