Lightly supervised acoustic model training for imprecisely and asynchronously transcribed speech

In a variety of speech recognition tasks a large amount of approximate transcription is available for the audio material, but is not directly applicable for acoustic model training. Whereas roughly time-synchronous closed-captions or proper audiobook texts are already used in lightly supervised techniques, the utilization of more imperfect and at the same time completely unaligned transcriptions is not self-evident. In this paper we describe our experiments aiming at automated transcription of Hungarian parliamentary speeches. Essentially, a lightly supervised across-domain acoustic model adaptation/retraining is performed. A low-resource broadcast news model is used to bootstrap the process. Relying on automatic recognition of parliamentary training speech and on dynamic text alignment based data selection, a new, task-specific acoustic model is built. For the adaptation to the parliamentary domain, only edited official transcriptions and unaligned speech data are used, without any additional human annotation effort. The adapted acoustic model is applied on unseen target speech in real-time recognition. The word accuracy difference between the automatic and the human powered, official transcription is only 5% (as compared to the exact reference text).

[1]  V Psutka Josef,et al.  Training of Speaker-Clustered Discriminative Acoustic Models for Use in Real-Time Recognizers , 2010 .

[2]  Tibor Fegyó,et al.  Improved Recognition of Spontaneous Hungarian Speech—Morphological and Acoustic Modeling Techniques for a Less Resourced Task , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[3]  Jan Silovský,et al.  Challenges in Speech Processing of Slavic Languages (Case Studies in Speech Recognition of Czech and Slovak) , 2009, COST 2102 Training School.

[4]  Tatsuya Kawahara Transcription System Using Automatic Speech Recognition for the Japanese Parliament (Diet) , 2012, IAAI.

[5]  Giorgio Biagetti,et al.  Semi-Automatic Acoustic Model Generation from Large Unsynchronized Audio and Text Chunks , 2011, INTERSPEECH.

[6]  Hermann Ney,et al.  Cross-language bootstrapping for unsupervised acoustic model training: rapid development of a Polish speech recognition system , 2009, INTERSPEECH.

[7]  Fernando Pereira,et al.  Weighted finite-state transducers in speech recognition , 2002, Comput. Speech Lang..

[8]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[9]  Steve Young,et al.  The HTK book , 1995 .

[10]  Hermann Ney,et al.  An improved method for unsupervised training of LVCSR systems , 2007, INTERSPEECH.

[11]  Jean-Luc Gauvain,et al.  Lightly Supervised Acoustic Model Training , 2000 .

[12]  Ricky Ho Yin Chan,et al.  Improving broadcast news transcription by lightly supervised discriminative training , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[13]  Balazs Tarjan,et al.  Evaluation of lexical models for Hungarian Broadcast speech transcription and spoken term detection , 2011, 2011 2nd International Conference on Cognitive Infocommunications (CogInfoCom).

[14]  Mark J. F. Gales,et al.  Lightly supervised recognition for automatic alignment of large coherent speech recordings , 2010, INTERSPEECH.

[15]  Ebru Arisoy,et al.  Morph-based speech recognition and modeling of out-of-vocabulary words across languages , 2007, TSLP.

[16]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[17]  Laurent Mauuary,et al.  Blind equalization in the cepstral domain for robust telephone based speech recognition , 1998, 9th European Signal Processing Conference (EUSIPCO 1998).

[18]  P. Mihajlik,et al.  Broadcast news transcription in Central-East European languages , 2012, 2012 IEEE 3rd International Conference on Cognitive Infocommunications (CogInfoCom).