DNN-Based Acoustic Modeling for Russian Speech Recognition Using Kaldi

In the paper, we describe a research of DNN-based acoustic modeling for Russian speech recognition. Training and testing of the system was performed using the open-source Kaldi toolkit. We created tanh and p-norm DNNs with a different number of hidden layers and a different number of hidden units of tanh DNNs. Testing of the models was carried out on very large vocabulary continuous Russian speech recognition task. We obtained a relative WER reduction of 20 % comparing to the baseline GMM-HMM system.

[1]  Daniel Jurafsky,et al.  Building DNN acoustic models for large vocabulary speech recognition , 2014, Comput. Speech Lang..

[2]  Piero Cosi A KALDI-DNN-based ASR system for Italian , 2015, 2015 International Joint Conference on Neural Networks (IJCNN).

[3]  Daniel P. W. Ellis,et al.  Tandem acoustic modeling in large-vocabulary recognition , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[4]  Andrey Ronzhin,et al.  Large vocabulary Russian speech recognition using syntactico-statistical language modeling , 2014, Speech Commun..

[5]  Jan Cernocký,et al.  Probabilistic and Bottle-Neck Features for LVCSR of Meetings , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[6]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[7]  Yajie Miao,et al.  Kaldi+PDNN: Building DNN-based ASR Systems with Kaldi and PDNN , 2014, ArXiv.

[8]  Yoshua Bengio,et al.  Greedy Layer-Wise Training of Deep Networks , 2006, NIPS.

[9]  Maxim Korenevsky,et al.  Improving Acoustic Models for Russian Spontaneous Speech Recognition , 2015, SPECOM.

[10]  Dong Yu,et al.  Conversational Speech Transcription Using Context-Dependent Deep Neural Networks , 2012, ICML.

[11]  Andrey Ronzhin,et al.  Very Large Vocabulary ASR for Spoken Russian with Syntactic and Morphemic Analysis , 2011, INTERSPEECH.

[12]  Sanjeev Khudanpur,et al.  Parallel training of DNNs with Natural Gradient and Parameter Averaging , 2014 .

[13]  Lukás Burget,et al.  Sequence-discriminative training of deep neural networks , 2013, INTERSPEECH.

[14]  Natalia A. Tomashenko,et al.  Speaker adaptation of context dependent deep neural networks based on MAP-adaptation and GMM-derived feature processing , 2014, INTERSPEECH.

[15]  Dong Yu,et al.  Automatic Speech Recognition: A Deep Learning Approach , 2014 .

[16]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[17]  Alexey Karpov,et al.  Lexicon Size and Language Model Order Optimization for Russian LVCSR , 2013, SPECOM.

[18]  Alexey Karpov,et al.  Analysis of long-distance word dependencies and pronunciation variability at conversational Russian speech recognition , 2012, 2012 Federated Conference on Computer Science and Information Systems (FedCSIS).

[19]  Hermann Ney,et al.  Improved backing-off for M-gram language modeling , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[20]  Andreas Stolcke,et al.  SRILM at Sixteen: Update and Outlook , 2011 .

[21]  Vlado Delic,et al.  Deep Neural Network Based Continuous Speech Recognition for Serbian Using the Kaldi Toolkit , 2015, SPECOM.

[22]  Xiaohui Zhang,et al.  Improving deep neural network acoustic models using generalized maxout networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).