A Monte Carlo EM Approach for Partially Observable Diffusion Processes: Theory and Applications to Neural Networks

We present a Monte Carlo approach for training partially observable diffusion processes. We apply the approach to diffusion networks, a stochastic version of continuous recurrent neural networks. The approach is aimed at learning probability distributions of continuous paths, not just expected values. Interestingly, the relevant activation statistics used by the learning rule presented here are inner products in the Hilbert space of square integrable functions. These inner products can be computed using Hebbian operations and do not require backpropagation of error signals. Moreover, standard kernel methods could potentially be applied to compute such inner products. We propose that the main reason that recurrent neural networks have not worked well in engineering applications (e.g., speech recognition) is that they implicitly rely on a very simplistic likelihood model. The diffusion network approach proposed here is much richer and may open new avenues for applications of recurrent neural networks. We present some analysis and simulations to support this view. Very encouraging results were obtained on a visual speech recognition task in which neural networks outperformed hidden Markov models.

[1]  R. E. Kalman,et al.  New Results in Linear Filtering and Prediction Theory , 1961 .

[2]  M. Aizerman,et al.  Theoretical Foundations of the Potential Function Method in Pattern Recognition Learning , 1964 .

[3]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[4]  Andrew P. Sage,et al.  Estimation theory with applications to communications and control , 1979 .

[5]  R. Shumway,et al.  AN APPROACH TO TIME SERIES SMOOTHING AND FORECASTING USING THE EM ALGORITHM , 1982 .

[6]  J J Hopfield,et al.  Neurons with graded response have collective computational properties like those of two-state neurons. , 1984, Proceedings of the National Academy of Sciences of the United States of America.

[7]  Geoffrey E. Hinton,et al.  A Learning Algorithm for Boltzmann Machines , 1985, Cogn. Sci..

[8]  F. Lewis Optimal Estimation: With an Introduction to Stochastic Control Theory , 1986 .

[9]  Lennart Ljung,et al.  System Identification: Theory for the User , 1987 .

[10]  B. Efron The jackknife, the bootstrap, and other resampling plans , 1987 .

[11]  Richard J. Meinhold,et al.  Robustification of Kalman Filter Models , 1989 .

[12]  F. Gland,et al.  MLE for partially observed diffusions: direct maximization vs. the em algorithm , 1989 .

[13]  Kurt Hornik,et al.  Multilayer feedforward networks are universal approximators , 1989, Neural Networks.

[14]  P. Protter Stochastic integration and differential equations , 1990 .

[15]  Anders Krogh,et al.  Introduction to the theory of neural computation , 1994, The advanced book program.

[16]  L. Fahrmeir Posterior Mode Estimation by Extended Kalman Filtering for Multivariate Dynamic Generalized Linear Models , 1992 .

[17]  Bernt Øksendal,et al.  Stochastic differential equations (3rd ed.): an introduction with applications , 1992 .

[18]  James L. McClelland Toward a theory of information processing in graded, random, and interactive networks , 1993 .

[19]  Halbert White,et al.  Estimation, inference, and specification analysis , 1996 .

[20]  Javier R. Movellan,et al.  A Local Algorithm to Learn Trajectories with Stochastic Neural Networks , 1993, NIPS.

[21]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[22]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[23]  D. Zipser,et al.  A spiking network model of short-term active memory , 1993, The Journal of neuroscience : the official journal of the Society for Neuroscience.

[24]  Javier R. Movellan,et al.  Learning Continuous Probability Distributions with Symmetric Diffusion Networks , 1993, Cogn. Sci..

[25]  H. Vincent Poor,et al.  An introduction to signal detection and estimation (2nd ed.) , 1994 .

[26]  H. Vincent Poor,et al.  An Introduction to Signal Detection and Estimation , 1994, Springer Texts in Electrical Engineering.

[27]  Barak A. Pearlmutter Gradient calculations for dynamic recurrent neural networks: a survey , 1995, IEEE Trans. Neural Networks.

[28]  P. Kloeden,et al.  Numerical Solutions of Stochastic Differential Equations , 1995 .

[29]  R. Karandikar On pathwise stochastic integration , 1995 .

[30]  G. Kitagawa Monte Carlo Filter and Smoother for Non-Gaussian Nonlinear State Space Models , 1996 .

[31]  Juergen Luettin,et al.  Speechreading using shape and intensity information , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[32]  Adrian Pagan,et al.  Estimation, Inference and Specification Analysis. , 1996 .

[33]  Juergen Luettin,et al.  Statistical LIP modelling for visual speech recognition , 1996, 1996 8th European Signal Processing Conference (EUSIPCO 1996).

[34]  Krzysztof J. Cios,et al.  Advances in neural information processing systems 7 , 1997 .

[35]  Juergen Luettin,et al.  Visual Speech and Speaker Recognition , 1997 .

[36]  Paul Mineiro,et al.  Learning Path Distributions Using Nonequilibrium Diffusion Networks , 1997, NIPS.

[37]  Michael Isard,et al.  Learning Multi-Class Dynamics , 1998, NIPS.

[38]  Zoubin Ghahramani,et al.  Learning Nonlinear Dynamical Systems Using an EM Algorithm , 1998, NIPS.

[39]  Andrew Blake,et al.  Robust contour tracking in echocardiographic sequences , 1998, Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271).

[40]  J. C. BurgesChristopher A Tutorial on Support Vector Machines for Pattern Recognition , 1998 .

[41]  Javier R. Movellan,et al.  A Learning Theorem for Networks at Detailed Stochastic Equilibrium , 1998, Neural Computation.

[42]  Andrew Blake,et al.  Learning dynamical models using expectation-maximisation , 1998, Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271).

[43]  Matthew Brand,et al.  Pattern discovery via entropy minimization , 1999, AISTATS.

[44]  'Unobserved' Monte Carlo method for identification of partially observed nonlinear state space systems. Part II. Counting process observations , 2000, Proceedings of the 39th IEEE Conference on Decision and Control (Cat. No.00CH37187).

[45]  James L. McClelland,et al.  The Morton-Massaro law of information integration: implications for models of perception. , 2001, Psychological review.

[46]  Alan F. Murray,et al.  Continuous restricted Boltzmann machine with an implementable training algorithm , 2003 .