Lithuanian Broadcast Speech Transcription Using Semi-supervised Acoustic Model Training

Abstract This paper reports on an experimental work to build a speech transcription system for Lithuanian broadcast data, relying on unsupervised and semi-supervised training methods as well as on other low-knowledge methods to compensate for missing resources. Unsupervised acoustic model training is investigated using 360 hours of untranscribed speech data. A graphemic pronunciation approach is used to simplify the pronunciation model generation and there-fore ease the language model adaptation for the system users. Discriminative training on top of semi-supervised training is also investigated, as well as various types of acoustic features and their combinations. Experimental results are provided for each of our development steps as well as contrastive results comparing various options. Using the best system configuration a word error rate of 18.3% is obtained on a set of development data from the Quaero program.

[1]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  Jean-Luc Gauvain,et al.  Unsupervised acoustic model training , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[3]  Andreas Stolcke,et al.  Finding consensus in speech recognition: word error minimization and other applications of confusion networks , 2000, Comput. Speech Lang..

[4]  Xiaohui Zhang,et al.  Improving deep neural network acoustic models using generalized maxout networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Lori Lamel,et al.  Development of a speech-to-text transcription system for Finnish , 2010, SLTU.

[6]  Jean-Luc Gauvain,et al.  Lightly supervised and unsupervised acoustic model training , 2002, Comput. Speech Lang..

[7]  Jean-Luc Gauvain,et al.  Automatic processing of broadcast audio in multiple languages , 2002, 2002 11th European Signal Processing Conference.

[8]  Mark J. F. Gales,et al.  Unicode-based graphemic systems for limited resource languages , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Lori Lamel,et al.  Investigating text normalization and pronunciation variants for German broadcast transcription , 2000, INTERSPEECH.

[10]  Frantisek Grézl,et al.  Optimizing bottle-neck features for lvcsr , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[11]  Gailius Raskinis,et al.  Building Medium-Vocabulary Isolated-Word Lithuanian HMM Speech Recognition System , 2003, Informatica.

[12]  Lukás Burget,et al.  Sequence-discriminative training of deep neural networks , 2013, INTERSPEECH.

[13]  Jean-Luc Gauvain,et al.  Partitioning and transcription of broadcast news data , 1998, ICSLP.

[14]  Laimutis Telksnys,et al.  Development of Isolated Word Speech Recognition System , 2002, Informatica.

[15]  Daniel P. W. Ellis,et al.  Tandem connectionist feature extraction for conventional HMM systems , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[16]  Jean-Luc Gauvain,et al.  Conversational telephone speech recognition for Lithuanian , 2018, Comput. Speech Lang..

[17]  Laimutis Telksnys,et al.  Towards Acoustic Modeling of Lithuanian Speech , 2004 .

[18]  Jean-Luc Gauvain,et al.  Lattice-based unsupervised acoustic model training , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Lori Lamel Multilingual Speech Processing Activities in Quaero: Application to Multimedia Search in Unstructured Data , 2012, Baltic HLT.

[20]  Lukás Burget,et al.  Semi-supervised training of Deep Neural Networks , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[21]  S. Laurinčiukaitė,et al.  Syllable-Phoneme based Continuous Speech Recognition , 2006 .

[22]  Jean-Luc Gauvain,et al.  The LIMSI Broadcast News transcription system , 2002, Speech Commun..

[23]  Lori Lamel,et al.  Text normalization and speech recognition in French , 1997, EUROSPEECH.