Bengali speech recognition: A double layered LSTM-RNN approach

Speech recognition may be an intuitive process for humans, but it turns out to be intimidating to make computer automatically recognize speeches. Although recent progresses in speech recognition have been very promising in other languages, Bengali lacks such progress. There are very little research works published for Bengali speech recognizer. In this paper, we have investigated long short term memory (LSTM), a recurrent neural network, approach to recognize individual Bengali words. We divided each word into a number of frames each containing 13 mel-frequency cepstral coefficients (MFCC), providing us with a useful set of distinctive features. We trained a deep LSTM model with the frames to recognize the most plausible phonemes. The final layer of our deep model is a softmax layer having equal number of units to the number of phonemes. We picked the most probable phonemes for each time frame. Finally, we passed these phonemes through a filter where we got individual words as the output. Our system achieves word detection error rate 13.2% and phoneme detection error rate 28.7% on Bangla-Real-Number audio dataset.

[1]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[2]  Suman V. Ravuri,et al.  How neural network features and depth modify statistical properties of HMM acoustic models , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Md Saiful Islam,et al.  A noble approach for recognizing Bangla real number automatically using CMU Sphinx4 , 2016, 2016 5th International Conference on Informatics, Electronics and Vision (ICIEV).

[4]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[5]  Li-Rong Dai,et al.  Speaker Adaptation of Hybrid NN/HMM Model for Speech Recognition Based on Singular Value Decomposition , 2014, Journal of Signal Processing Systems.

[6]  Dong Yu,et al.  Automatic Speech Recognition: A Deep Learning Approach , 2014 .

[7]  L. R. Rabiner,et al.  A comparative study of several dynamic time-warping algorithms for connected-word recognition , 1981, The Bell System Technical Journal.

[8]  Biing-Hwang Juang,et al.  Hidden Markov Models for Speech Recognition , 1991 .

[9]  Jürgen Schmidhuber,et al.  Framewise phoneme classification with bidirectional LSTM and other neural network architectures , 2005, Neural Networks.

[10]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[11]  Quoc V. Le,et al.  Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Jürgen Schmidhuber,et al.  Learning to forget: continual prediction with LSTM , 1999 .

[13]  Li Deng,et al.  HMM adaptation using vector taylor series for noisy speech recognition , 2000, INTERSPEECH.

[14]  Lianhong Cai,et al.  Learning cross-lingual knowledge with multilingual BLSTM for emphasis detection with limited training data , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Yongqiang Wang,et al.  An investigation of deep neural networks for noise robust speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[16]  Geoffrey Zweig,et al.  The microsoft 2016 conversational speech recognition system , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).