Unsupervised versus supervised training of acoustic models

In this paper we reports unsupervised training experiments we have conducted on large amounts of the English Fisher conversational telephone speech. A great amount of work has been reported on unsupervised training, but the major difference of this work is that we compared behaviors of unsupervised training with supervised training on exactly the same data. This comparison reveals surprising results. First, as the amount of training data increases, unsupervised training, even bootstrapped with a very limited amount (1 hour) of manual data, improves recognition performance faster than supervised training does, and it converges to supervised training. Second, bootstrapping unsupervised training with more manual data is not of significance if a large amount of un-transcribed data is available.

[1]  Alexander H. Waibel,et al.  Unsupervised training of a speech recognizer: recent experiments , 1999, EUROSPEECH.

[2]  Richard M. Schwartz,et al.  Unsupervised Training on Large Amounts of Broadcast News Data , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[3]  S. Matsoukas,et al.  Improved speaker adaptation using speaker dependent feature projections , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[4]  Jean-Luc Gauvain,et al.  Lightly supervised and unsupervised acoustic model training , 2002, Comput. Speech Lang..

[5]  Spyridon Matsoukas,et al.  Unsupervised Training on a Large Amount of Arabic Broadcast News Data , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[6]  Jean-Luc Gauvain,et al.  Unsupervised acoustic model training , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[7]  Andreas Stolcke,et al.  Getting More Mileage from Web Text Sources for Conversational Speech Language Modeling using Class-Dependent Mixtures , 2003, NAACL.

[8]  Bing Xiang,et al.  Light supervision in acoustic model training , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[9]  George Zavaliagkos,et al.  Utilizing untranscribed training data to improve perfomance , 1998, LREC.

[10]  John Makhoul,et al.  Using quick transcriptions to improve conversational speech models , 2004, INTERSPEECH.

[11]  Hermann Ney,et al.  An improved method for unsupervised training of LVCSR systems , 2007, INTERSPEECH.

[12]  Ricky Ho Yin Chan,et al.  Improving broadcast news transcription by lightly supervised discriminative training , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.