Some issues affecting the transcription of Hungarian broadcast audio

This paper reports on a speech-to-text (STT) transcription system for Hungarian broadcast audio developed for the 2012 Quaero evaluations. For this evaluation, no manually transcribed audio data were provided for model training, however a small amount of development data were provided to assess system performance. As a consequence, the acoustic models were developed in an unsupervised manner, with the only supervision provided indirectly by the language model. The language models were trained on texts downloaded from various websites, also without any speech transcripts. This contrasts with other STT systems for Hungarian broadcast audio which use at least 10 to 50 hours of manually transcribed data for acoustic training, and typically include speech transcripts in the language models. Based on mixed results previously reported applying morph-based approaches to agglutinative languages such as Hungarian, word-based language models were used. The initial Word Error Rate (WER) of the system using contextindependent seed models from other languages of 59.8% on the 3h development corpus was reduced to 25.0% after successive training iterations and system refinement. The same system obtained a WER of 23.3% on the independent Quaero 2012 evaluation corpus (a mix of broadcast news and broadcast conversation data). These results compare well with previously reported systems on similar data. Various issues affecting system performance are discussed, such as amount of training data, the acoustic features and choice of text sources for language model training. Index Terms: Large vocabulary continuous speech recognition (LVCSR), broadcast news transcription, Hungarian language, unsupervised training, agglutinative languages, Bottleneck MLP features

[1]  Alexander H. Waibel,et al.  Unsupervised training of a speech recognizer: recent experiments , 1999, EUROSPEECH.

[2]  Jean-Luc Gauvain,et al.  Incorporating MLP features in the unsupervised training process , 2012, SLTU.

[4]  András Kornai,et al.  Parallel corpora for medium density languages , 2007 .

[5]  László Tóth,et al.  Investigating the robustness of a Hungarian medical dictation system under various conditions , 2006, Int. J. Speech Technol..

[6]  Szarvas Mate Attila Efficient large vocabulary continuous speech recognition using weighted finite-state transducers : the development of a Hungarian dictation system , 2003 .

[7]  Jean-Luc Gauvain,et al.  Lightly supervised and unsupervised acoustic model training , 2002, Comput. Speech Lang..

[8]  Lori Lamel,et al.  Text normalization and speech recognition in French , 1997, EUROSPEECH.

[9]  Lori Lamel,et al.  Development of a speech-to-text transcription system for Finnish , 2010, SLTU.

[10]  Jean-Luc Gauvain,et al.  Arabic Broadcast News Transcription Using a One Million Word Vocalized Vocabulary , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[11]  Jean-Luc Gauvain,et al.  Partitioning and transcription of broadcast news data , 1998, ICSLP.

[12]  György Szaszák,et al.  Examination of Pronunciation Variation from Hand-Labelled Corpora , 2004, TSD.

[13]  Lori Lamel,et al.  Investigating text normalization and pronunciation variants for German broadcast transcription , 2000, INTERSPEECH.

[14]  Mark J. F. Gales,et al.  The Application of Hidden Markov Models in Speech Recognition , 2007, Found. Trends Signal Process..

[15]  Tibor Fegyó,et al.  Improved Recognition of Spontaneous Hungarian Speech—Morphological and Acoustic Modeling Techniques for a Less Resourced Task , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[16]  Martin Karafiát,et al.  Study of probabilistic and Bottle-Neck features in multilingual environment , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[17]  Jean-Luc Gauvain,et al.  Transcribing broadcast data using MLP features , 2008, INTERSPEECH.

[18]  Jean-Luc Gauvain,et al.  Unsupervised acoustic model training , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[19]  Tibor Fegyó,et al.  LVCSR-based Speech Analytics of a Hungarian Language Call-Center , 2010 .

[20]  Hermann Ney,et al.  Cross-language bootstrapping for unsupervised acoustic model training: rapid development of a Polish speech recognition system , 2009, INTERSPEECH.

[21]  Jean-Luc Gauvain,et al.  Lightly Supervised Acoustic Model Training , 2000 .

[22]  Balazs Tarjan,et al.  Evaluation of lexical models for Hungarian Broadcast speech transcription and spoken term detection , 2011, 2011 2nd International Conference on Cognitive Infocommunications (CogInfoCom).

[23]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[24]  Steve Renals,et al.  Unsupervised cross-lingual knowledge transfer in DNN-based LVCSR , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[25]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[26]  Péter Mihajlik,et al.  On morph-based LVCSR improvements , 2010, SLTU.