An improved method for unsupervised training of LVCSR systems

In this paper, we introduce an improved method for unsupervised training where the data selection or filtering process is done on state level. We describe in detail the setup of the experiments and introduce the state confidence scores on word and allophone state level for performing the data selection for mixture training on state level. Although we are using a relatively small amount of 180 hours of untranscribed recordings in addition to the available carefully manually transcribed transcriptions of 100 hours, we are able to significantly improve our final speaker adaptive acoustic model. Furthermore, we present promising results by doing system combination using the acoustic models trained on different confidence thresholds. These methods are evaluated on the EPPS corpus starting from the RWTH European English parliamentary speech transcription system. A significant improvement of 7% relative is achieved using less data for unsupervised training than conventional systems require.

[1]  Richard M. Schwartz,et al.  Unsupervised Training on Large Amounts of Broadcast News Data , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[2]  Georg Heigold,et al.  The RWTH 2007 TC-STAR evaluation system for european English and Spanish , 2007, INTERSPEECH.

[3]  Hermann Ney,et al.  Cross domain automatic transcription on the TC-STAR EPPS corpus , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[4]  Mark J. F. Gales,et al.  Training LVCSR systems on thousands of hours of data , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[5]  Roger K. Moore A comparison of the data requirements of automatic speech recognition systems and human listeners , 2003, INTERSPEECH.

[6]  Hermann Ney,et al.  Confidence measures for large vocabulary continuous speech recognition , 2001, IEEE Trans. Speech Audio Process..

[7]  George Zavaliagkos,et al.  Utilizing untranscribed training data to improve perfomance , 1998, LREC.

[8]  Hermann Ney,et al.  Unsupervised training of acoustic models for large vocabulary continuous speech recognition , 2005, IEEE Transactions on Speech and Audio Processing.