Improving Acoustic Models for Russian Spontaneous Speech Recognition

The aim of the paper is to investigate the ways to improve acoustic models for Russian spontaneous speech recognition. We applied the main steps of the Kaldi Switchboard recipe to a Russian dataset but obtained low accuracy with respect to the results for English spontaneous telephone speech. We found two methods to be especially useful for Russian spontaneous speech: the i-vector based deep neural network adaptation and speaker-dependent bottleneck features which provide 8.6 % and 11.9 % relative word error rate reduction over the baseline system respectively.

[1]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[2]  Kaisheng Yao,et al.  KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[3]  Shigeru Katagiri,et al.  Speaker Adaptive Training using Deep Neural Networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Jan Cernocký,et al.  But neural network features for spontaneous Vietnamese in BABEL , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Khe Chai Sim,et al.  On combining DNN and GMM with unsupervised speaker adaptation for robust automatic speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Xiao Li,et al.  Regularized Adaptation of Discriminative Classifiers , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[7]  Kaisheng Yao,et al.  Adaptation of context-dependent deep neural networks for automatic speech recognition , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[8]  Sergey Novoselov,et al.  RBM-PLDA subsystem for the NIST i-vector challenge , 2014, INTERSPEECH.

[9]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[10]  George Saon,et al.  Speaker adaptation of neural network acoustic models using i-vectors , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[11]  Benoît Favre,et al.  Speaker adaptation of DNN-based ASR with i-vectors: does it actually adapt models to speakers? , 2014, INTERSPEECH.

[12]  Tara N. Sainath,et al.  Joint training of convolutional and non-convolutional neural networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[14]  Alexander Kozlov,et al.  SVID Speaker Recognition System for NIST SRE 2012 , 2013, SPECOM.

[15]  Natalia A. Tomashenko,et al.  Speaker adaptation of context dependent deep neural networks based on MAP-adaptation and GMM-derived feature processing , 2014, INTERSPEECH.

[16]  Andrew W. Senior,et al.  Improving DNN speaker independence with I-vector inputs , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  John J. Godfrey,et al.  SWITCHBOARD: telephone speech corpus for research and development , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[18]  Lukás Burget,et al.  Sequence-discriminative training of deep neural networks , 2013, INTERSPEECH.

[19]  Pietro Laface,et al.  Linear hidden transformations for adaptation of hybrid ANN/HMM models , 2007, Speech Commun..

[20]  Dong Yu,et al.  Feature engineering in Context-Dependent Deep Neural Networks for conversational speech transcription , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.