Online phoneme recognition using multi-layer perceptron networks combined with recurrent non-linear autoregressive neural networks with exogenous inputs

Abstract Off-line pattern recognition in speech signals is a complex task. Yet, this task becomes harder when the recognition result is required online or in real-time. The present work proposes an online identification of the Portuguese language phonemes using a non-linear autoregressive model with exogenous inputs, commonly called NARX. The process first conditions the input speech signal, and extracts its frequency characteristics. Then it pre-classifies the extracted features into one of the ten possible groups of phonemes, as available in the Portuguese language. This pre-classification is done using a multilayer perceptron network (MLP) with a supervised learning. Subsequently, the MLP output vector, together with the vector that carries the input frequencies, feeds a NARX neural network by means of a temporal delay of four times and feed-backward recurrent links that encompass the results of all hidden layers of the network. As a result of this process, the proposed phoneme recognition process improves the accuracy of an online identification of the Portuguese spoken phonemes during a natural conversation. When the phoneme input signal is well conditioned and continuous over time, the proposed recognition process can provide the correct classification in real-time, with an acceptable accuracy rate.

[1]  Huaguang Zhang,et al.  Novel Weighting-Delay-Based Stability Criteria for Recurrent Neural Networks With Time-Varying Delay , 2010, IEEE Transactions on Neural Networks.

[2]  Huaguang Zhang,et al.  A Comprehensive Review of Stability Analysis of Continuous-Time Recurrent Neural Networks , 2014, IEEE Transactions on Neural Networks and Learning Systems.

[3]  Simon Haykin,et al.  Neural Networks and Learning Machines , 2010 .

[4]  Bhiksha Raj,et al.  The Basics of Automatic Speech Recognition , 2012, Techniques for Noise Robustness in Automatic Speech Recognition.

[5]  Richard Lippmann,et al.  Review of Neural Networks for Speech Recognition , 1989, Neural Computation.

[6]  Jinde Cao,et al.  Parameter identification of dynamical systems from time series. , 2007, Physical review. E, Statistical, nonlinear, and soft matter physics.

[7]  Simon Haykin,et al.  Neural Networks: A Comprehensive Foundation , 1998 .

[8]  Meinard Müller,et al.  Dynamic Time Warping , 2008 .

[9]  Gerald Penn,et al.  Convolutional Neural Networks for Speech Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[10]  Mark D. Hanes,et al.  Acoustic-to-phonetic mapping using recurrent neural networks , 1994, IEEE Trans. Neural Networks.

[11]  Chin-Hui Lee,et al.  An artificial neural network approach to automatic speech processing , 2014, Neurocomputing.

[12]  Jinde Cao,et al.  Synchronization-based approach for parameters identification in delayed chaotic neural networks , 2007 .

[13]  Chin-Hui Lee,et al.  Exploiting deep neural networks for detection-based speech recognition , 2013, Neurocomputing.

[14]  Alex Acero,et al.  Spoken Language Processing: A Guide to Theory, Algorithm and System Development , 2001 .

[15]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[16]  Peter Tiño,et al.  Learning long-term dependencies in NARX recurrent neural networks , 1996, IEEE Trans. Neural Networks.

[17]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[18]  John H. L. Hansen,et al.  Discrete-Time Processing of Speech Signals , 1993 .

[19]  Huaguang Zhang,et al.  Design and analysis of associative memories based on external inputs of delayed recurrent neural networks , 2014, Neurocomputing.