An EM Approach to Learning Sequential

We consider problems of sequence processing and we propose a solution based on a discrete state model. We introduce a recurrent architecture having a modular structure that allocates subnetworks to discrete states. Di erent subnetworks are model the dynamics (state transition) and the output of the model, conditional on the previous state and an external input. The model has a statistical interpretation and can be trained by the EM or GEM algorithms, considering state trajectories as missing data. This allows to decouple temporal credit assignment and actual parameters estimation. The model presents similarities to hidden Markov models, but allows to map input sequences to output sequences, using the same processing style of recurrent networks. For this reason we call it Input/Output HMM (IOHMM). Another remarkable di erence is that IOHMMs are trained using a supervised learning paradigm (while potentially taking advantage of the EM algorithm), whereas standard HMMs are trained by an unsupervised EM algorithm (or a supervised criterion with gradient ascent). We also study the problem of learning long-term dependencies with Markovian systems, making comparisons to recurrent networks trained by gradient descent. The analysis reported in this paper shows that Markovian models generally su er from a problem of di usion of temporal credit for long-term dependencies and fully connected transition graphs. However, while recurrent networks exhibit a con ict between long-term information storing and trainability, these two requirements are either both satis ed or both not satis ed in Markovian models. Finally, we demonstrate that EM supervised learning is well suited for solving grammatical inference problems. Experimental results are presented for the seven Tomita grammars, showing that these adaptive models can attain excellent generalization. 1

[1]  L. Baum,et al.  A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains , 1970 .

[2]  R. Bakis Continuous speech recognition via centisecond acoustic states , 1976 .

[3]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[4]  E. Seneta Non-negative Matrices and Markov Chains (Springer Series in Statistics) , 1981 .

[5]  L. R. Rabiner,et al.  An introduction to the application of the theory of probabilistic functions of a Markov process to automatic speech recognition , 1983, The Bell System Technical Journal.

[6]  Carl H. Smith,et al.  Inductive Inference: Theory and Methods , 1983, CSUR.

[7]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[8]  Peter F. Brown,et al.  The acoustic-modeling problem in automatic speech recognition , 1987 .

[9]  Anthony J. Robinson,et al.  Static and Dynamic Error Propagation Networks with Application to Speech Coding , 1987, NIPS.

[10]  John Scott Bridle,et al.  Probabilistic Interpretation of Feedforward Classification Network Outputs, with Relationships to Statistical Pattern Recognition , 1989, NATO Neurocomputing.

[11]  Barak A. Pearlmutter Learning State Space Trajectories in Recurrent Neural Networks , 1989, Neural Computation.

[12]  Yann LeCun,et al.  Improving the convergence of back-propagation learning with second-order methods , 1989 .

[13]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[14]  M. Gori,et al.  BPS: a learning algorithm for capturing the dynamic nature of speech , 1989, International 1989 Joint Conference on Neural Networks.

[15]  Michael C. Mozer,et al.  A Focused Backpropagation Algorithm for Temporal Pattern Recognition , 1989, Complex Syst..

[16]  John S. Bridle,et al.  Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters , 1989, NIPS.

[17]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[18]  Richard Rohwer,et al.  The "Moving Targets" Training Algorithm , 1989, NIPS.

[19]  Geoffrey E. Hinton,et al.  Phoneme recognition using time-delay neural networks , 1989, IEEE Trans. Acoust. Speech Signal Process..

[20]  Ronald J. Williams,et al.  A Learning Algorithm for Continually Running Fully Recurrent Neural Networks , 1989, Neural Computation.

[21]  Jeffrey L. Elman,et al.  Finding Structure in Time , 1990, Cogn. Sci..

[22]  Yaser S. Abu-Mostafa,et al.  Learning from hints in neural networks , 1990, J. Complex..

[23]  Michael I. Jordan,et al.  Hierarchies of Adaptive Experts , 1991, NIPS.

[24]  A. Hasman,et al.  Probabilistic reasoning in intelligent systems: Networks of plausible inference , 1991 .

[25]  Michael C. Mozer,et al.  Induction of Multiscale Temporal Structure , 1991, NIPS.

[26]  P. Frasconi,et al.  Local Feedback Multi-Layered Networks , 1992 .

[27]  C. Lee Giles,et al.  Learning and Extracting Finite State Automata with Second-Order Recurrent Neural Networks , 1992, Neural Computation.

[28]  Raymond L. Watrous,et al.  Induction of Finite-State Languages Using Second-Order Recurrent Networks , 1992, Neural Computation.

[29]  C. L. Giles,et al.  Inserting rules into recurrent neural networks , 1992, Neural Networks for Signal Processing II Proceedings of the 1992 IEEE Workshop.

[30]  Volker Tresp,et al.  Network Structuring and Training Using Rule-Based Knowledge , 1992, NIPS.

[31]  C. Lee Giles,et al.  Training Second-Order Recurrent Neural Networks using Hints , 1992, ML.

[32]  Yoshua Bengio,et al.  Global optimization of a neural network-hidden Markov model hybrid , 1992, IEEE Trans. Neural Networks.

[33]  Hervé Bourlard,et al.  Connectionist speech recognition , 1993 .

[34]  Yoshua Bengio,et al.  Credit Assignment through Time: Alternatives to Backpropagation , 1993, NIPS.

[35]  Michael C. Mozer,et al.  A Unified Gradient-Descent/Clustering Architecture for Finite State Machine Induction , 1993, NIPS.

[36]  Hervé Bourlard,et al.  Connectionist Speech Recognition: A Hybrid Approach , 1993 .

[37]  Yoshua Bengio,et al.  The problem of learning long-term dependencies in recurrent networks , 1993, IEEE International Conference on Neural Networks.

[38]  Steven J. Nowlan,et al.  Mixtures of Controllers for Jump Linear and Non-Linear Plants , 1993, NIPS.

[39]  Yoshua Bengio,et al.  Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.