Speech-to-text development for Slovak, a low-resourced language

Development of an automatic speech recognition (ASR) system for low-resourced languages is an important research topic in ASR. This paper reports on t\ he development of a speech-to-text (STT) system targeting broadcast news and broadcast conversation transcription for the low-resourced Slovak language. Context-dep\ endent acoustic models are trained without any manually transcribed audio data via cross-language transfer and unsupervised training. In addition, a pronunciation d\ ictionary for Slovak language is created using efficient rule-based pronunciation modeling. For language modeling, large N-gram language models were estimated on 63M words of texts downloaded from the Internet. The system uses MLP (multilayer perceptron) features imported from English which are concatenated with cepstral PLP (perceptual linear prediction) and F0 (pitch) features. These techniques were applied to develop a Slovak STT system with performance similar to that obtained by state-of-the-art systems for other languages. Furthermore, we propose to reduce the dimension of the MLP+PLP+F0 features from 81 to 50, using principal component analysis (PCA), in order to reduce the redundancy between the MLP and the PLP+F0 features. This feature reduction makes it possible to reduce the word error rate (WER) and the recognition time while reducing the CMLLR adaptation time by a factor of 3.

[1]  Hynek Hermansky,et al.  On use of task independent training data in tandem feature extraction , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[3]  Ngoc Thang Vu,et al.  GlobalPhone: A multilingual text & speech database in 20 languages , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[4]  Alexandre Allauzen,et al.  Where are we in transcribing French broadcast news? , 2005, INTERSPEECH.

[5]  Hermann Ney,et al.  Joint-sequence models for grapheme-to-phoneme conversion , 2008, Speech Commun..

[6]  Jean-Luc Gauvain,et al.  Incorporating MLP features in the unsupervised training process , 2012, SLTU.

[7]  Jan Silovský,et al.  Czech-to-slovak adapted broadcast news transcription system , 2008, INTERSPEECH.

[8]  Jean-Luc Gauvain,et al.  Transcribing broadcast data using MLP features , 2008, INTERSPEECH.

[9]  Jean-Luc Gauvain,et al.  Improved models for Mandarin speech-to-text transcription , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Jean-Luc Gauvain,et al.  Automatic Speech-to-Text Transcription in Arabic , 2009, TALIP.

[11]  Jean-Luc Gauvain,et al.  Partitioning and transcription of broadcast news data , 1998, ICSLP.

[12]  N. Morgan,et al.  Pushing the envelope - aside [speech recognition] , 2005, IEEE Signal Processing Magazine.

[13]  Claude Barras,et al.  Augmenting short-term cepstral features with long-term discriminative features for speaker verification of telephone data , 2013, INTERSPEECH.

[14]  Richard M. Schwartz,et al.  The 2004 BBN/LIMSI 20xRT English conversational telephone speech recognition system , 2005, INTERSPEECH.

[15]  Mark Liberman,et al.  THE TDT-2 TEXT AND SPEECH CORPUS , 1999 .

[16]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[17]  Frantisek Grézl,et al.  Optimizing bottle-neck features for lvcsr , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[18]  Philip N. Garner,et al.  Current trends in multilingual speech processing , 2011 .

[19]  Jean-Luc Gauvain,et al.  Rapid development of a Latvian speech-to-text system , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[20]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[21]  Tanja Schultz,et al.  Language-independent and language-adaptive acoustic modeling for speech recognition , 2001, Speech Commun..

[22]  I. Jolliffe Principal Component Analysis , 2002 .

[23]  Vassilios Digalakis,et al.  Speaker adaptation using constrained estimation of Gaussian mixtures , 1995, IEEE Trans. Speech Audio Process..

[24]  Thomas Hain,et al.  Implicit modelling of pronunciation variation in automatic speech recognition , 2005, Speech Commun..

[25]  Jean-Luc Gauvain,et al.  Transcription of Russian conversational speech , 2012, SLTU.

[26]  Jean-Luc Gauvain,et al.  Some issues affecting the transcription of Hungarian broadcast audio , 2013, INTERSPEECH.

[27]  Jean-Luc Gauvain,et al.  Unsupervised acoustic model training , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[28]  Andreas Stolcke,et al.  Using MLP features in SRI's conversational speech recognition system , 2005, INTERSPEECH.

[29]  Thomas Niesler,et al.  Experiments in broadcast news transcription , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).