Sichuan dialect speech recognition with deep LSTM network

In speech recognition research, because of the variety of languages, corresponding speech recognition systems need to be constructed for different languages. Especially in a dialect speech recognition system, there are many special words and oral language features. In addition, dialect speech data is very scarce. Therefore, constructing a dialect speech recognition system is difficult. This paper constructs a speech recognition system for Sichuan dialect by combining a hidden Markov model (HMM) and a deep long short-term memory (LSTM) network. Using the HMM-LSTM architecture, we created a Sichuan dialect dataset and implemented a speech recognition system for this dataset. Compared with the deep neural network (DNN), the LSTM network can overcome the problem that the DNN only captures the context of a fixed number of information items. Moreover, to identify polyphone and special pronunciation vocabularies in Sichuan dialect accurately, we collect all the characters in the dataset and their common phoneme sequences to form a lexicon. Finally, this system yields a 11.34% character error rate on the Sichuan dialect evaluation dataset. As far as we know, it is the best performance for this corpus at present.

[1]  Zhang Yi,et al.  Learning robust uniform features for cross-media social data by using cross autoencoders , 2016, Knowl. Based Syst..

[2]  Vivek Tyagi Maximum accept and reject (MARS) training of HMM-GMM speech recognition systems , 2008, INTERSPEECH.

[3]  Jr. G. Forney,et al.  The viterbi algorithm , 1973 .

[4]  Sanjeev Khudanpur,et al.  Semi-supervised maximum mutual information training of deep neural network acoustic models , 2015, INTERSPEECH.

[5]  Zhang Yi,et al.  Trajectory Predictor by Using Recurrent Neural Networks in Visual Tracking , 2017, IEEE Transactions on Cybernetics.

[6]  Geoffrey E. Hinton,et al.  Acoustic Modeling Using Deep Belief Networks , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[7]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[8]  Jeff A. Bilmes,et al.  A gentle tutorial of the em algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models , 1998 .

[9]  Ronald J. Williams,et al.  Gradient-based learning algorithms for recurrent networks and their computational complexity , 1995 .

[10]  Izhak Shafran,et al.  Context dependent phone models for LSTM RNN acoustic modelling , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[12]  Zhang Yi,et al.  Recurrent Neural Networks With Auxiliary Memory Units , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[13]  Chong Wang,et al.  Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[14]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[15]  Jorge Nocedal,et al.  Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..

[16]  Mark J. F. Gales,et al.  The Application of Hidden Markov Models in Speech Recognition , 2007, Found. Trends Signal Process..

[17]  Zhang Yi,et al.  Foundations of Implementing the Competitive Layer Model by Lotka–Volterra Recurrent Neural Networks , 2010, IEEE Transactions on Neural Networks.

[18]  Tara N. Sainath,et al.  A Comparison of Sequence-to-Sequence Models for Speech Recognition , 2017, INTERSPEECH.

[19]  Donald J. Berndt,et al.  Using Dynamic Time Warping to Find Patterns in Time Series , 1994, KDD Workshop.

[20]  M. S. Ryan,et al.  The Viterbi Algorithm 1 1 The Viterbi Algorithm . , 2009 .

[21]  Alex Graves,et al.  Supervised Sequence Labelling with Recurrent Neural Networks , 2012, Studies in Computational Intelligence.

[22]  Atsushi Nakamura,et al.  Speech Recognition Algorithms Based on Weighted Finite-State Transducers , 2013, Speech Recognition Algorithms Based on Weighted Finite-State Transducers.

[23]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[24]  Yifan Gong,et al.  Towards better performance with heterogeneous training data in acoustic modeling using deep neural networks , 2014, INTERSPEECH.

[25]  YoungSteve,et al.  The application of hidden Markov models in speech recognition , 2007 .

[26]  Andreas Stolcke,et al.  The Microsoft 2017 Conversational Speech Recognition System , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Zhang Yi,et al.  Theoretical Study of Oscillator Neurons in Recurrent Neural Networks , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[28]  Steve J. Young,et al.  Large vocabulary continuous speech recognition using HTK , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[29]  Zhang Yi,et al.  Dynamical properties of background neural networks with uniform firing rate and background input , 2007 .

[30]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[31]  Beth Logan,et al.  Mel Frequency Cepstral Coefficients for Music Modeling , 2000, ISMIR.

[32]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.