Discriminative data selection for lightly supervised training of acoustic model using closed caption texts

We present a novel data selection method for lightly supervised training of acoustic model, which exploits a large amount of data with closed caption texts but not faithful transcripts. In the proposed scheme, a sequence of the closed caption text and that of the ASR hypothesis by the baseline system are aligned. Then, a set of dedicated classifiers is designed and trained to select the correct one among them or reject both. It is demonstrated that the classifiers can effectively filter the usable data for acoustic model training without tuning any threshold parameters. A significant improvement in the ASR accuracy is achieved from the baseline system and also in comparison with the conventional method of lightly supervised training based on simple matching and confidence measure scores.

[1]  Tatsuya Kawahara,et al.  Recent Development of Open-Source Speech Recognition Engine Julius , 2009 .

[2]  Helena Moniz,et al.  Recognition of classroom lectures in european portuguese , 2006, INTERSPEECH.

[3]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[4]  Sebastian Stüker,et al.  Overview of the IWSLT 2010 evaluation campaign , 2010, IWSLT.

[5]  Jean-Luc Gauvain,et al.  Lightly supervised and unsupervised acoustic model training , 2002, Comput. Speech Lang..

[6]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[7]  Tatsuya Kawahara,et al.  Language model transformation applied to lightly supervised training of acoustic model for congress meetings , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  Kiyohiro Shikano,et al.  Real-time word confidence scoring using local posterior probabilities on tree trellis search , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[9]  Joris Driesen,et al.  Lightly supervised automatic subtitling of weather forecasts , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[10]  Hui Lin,et al.  How to select a good training-data subset for transcription: submodular active selection for sequences , 2009, INTERSPEECH.

[11]  François Yvon,et al.  Practical Very Large Scale CRFs , 2010, ACL.

[12]  Lin-Shan Lee,et al.  Learning on demand - course lecture distillation by information extraction and semantic structuring for spoken documents , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[13]  Ricky Ho Yin Chan,et al.  Improving broadcast news transcription by lightly supervised discriminative training , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[14]  Bing Xiang,et al.  Light supervision in acoustic model training , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[15]  K. Maekawa CORPUS OF SPONTANEOUS JAPANESE : ITS DESIGN AND EVALUATION , 2003 .

[16]  Hiroki Yamazaki,et al.  Dynamic language model adaptation using presentation slides for lecture speech recognition , 2007, INTERSPEECH.

[17]  Jian Zhang,et al.  A comparative study on speech summarization of broadcast news and lecture speech , 2007, INTERSPEECH.

[18]  Mark J. F. Gales,et al.  Improving lightly supervised training for broadcast transcription , 2013, INTERSPEECH.

[19]  James R. Glass,et al.  Recent progress in the MIT spoken lecture processing project , 2007, INTERSPEECH.

[20]  Nataliya Sokolovska,et al.  Efficient Learning of Sparse Conditional Random Fields for Supervised Sequence Labeling , 2010, IEEE Journal of Selected Topics in Signal Processing.

[21]  Tatsuya Kawahara,et al.  Fast Speaker Normalization and Adaptation based on BIC for Meeting Speech Recognition , 2011 .

[22]  Tatsuya Kawahara,et al.  Language model and speaking rate adaptation for spontaneous presentation speech recognition , 2004, IEEE Transactions on Speech and Audio Processing.