Deep Neural Network Based Continuous Speech Recognition for Serbian Using the Kaldi Toolkit

This paper presents a deep neural network (DNN) based large vocabulary continuous speech recognition (LVCSR) system for Serbian, developed using the open-source Kaldi speech recognition toolkit. The DNNs are initialized using stacked restricted Boltzmann machines (RBMs) and trained using cross-entropy as the objective function and the standard error backpropagation procedure in order to provide posterior probability estimates for the hidden Markov model (HMM) states. Emission densities of HMM states are represented as Gaussian mixture models (GMMs). The recipes were modified based on the particularities of the Serbian language in order to achieve the optimal results. A corpus of approximately 90 hours of speech (21000 utterances) is used for the training. The performances are compared for two different sets of utterances between the baseline GMM-HMM algorithm and various DNN settings.

[1]  Vlado Delic,et al.  Speech and Language Resources within Speech Recognition and Synthesis Systems for Serbian and Kindred South Slavic Languages , 2013, SPECOM.

[2]  Darko Pekar,et al.  Large vocabulary continuous speech recognition for Serbian using the Kaldi toolkit , 2014 .

[3]  Dragisa Miskovic,et al.  A decoder for large vocabulary speech recognition , 2011, 2011 18th International Conference on Systems, Signals and Image Processing.

[4]  Hagen Soltau,et al.  Fast speaker adaptive training for speech recognition , 2008, INTERSPEECH.

[5]  Lukás Burget,et al.  Semi-supervised training of Deep Neural Networks , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[6]  S. J. Young,et al.  Tree-based state tying for high accuracy acoustic modelling , 1994 .

[7]  Hermann Ney,et al.  Improved backing-off for M-gram language modeling , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[8]  Fernando Pereira,et al.  Weighted finite-state transducers in speech recognition , 2002, Comput. Speech Lang..

[9]  J. Demmel,et al.  Sun Microsystems , 1996 .

[10]  Daniel Povey,et al.  Minimum Phone Error and I-smoothing for improved discriminative training , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[11]  Kai Feng,et al.  The subspace Gaussian mixture model - A structured model for speech recognition , 2011, Comput. Speech Lang..

[12]  Brian Kingsbury,et al.  Boosted MMI for model and feature-space discriminative training , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[13]  Lukás Burget,et al.  Sequence-discriminative training of deep neural networks , 2013, INTERSPEECH.

[14]  Andreas Stolcke,et al.  SRILM at Sixteen: Update and Outlook , 2011 .

[15]  James Demmel,et al.  LAPACK Users' Guide, Third Edition , 1999, Software, Environments and Tools.

[16]  Miguel Á. Carreira-Perpiñán,et al.  On Contrastive Divergence Learning , 2005, AISTATS.

[17]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .