Automatic transcription of general audio data: effect of environment segmentation on phonetic recognition 1

The task of automatically transcribing general audio data is very different from those usually confronted by current automatic speech recognition systems. The general goal of our work is to determine the optimal training strategy for recognizing such data. Specifically, we have studied the effects of different speaking environments on a phonetic recognition task using data collected from a radio news program. We found that if a singlerecognizer is to be used, it is more effective to use a smaller amount of homogeneous, clean data for training. This approach yielded a decrease in phonetic recognition error rate of over 26% over a system trained with an equivalent amount of data which contained a variety of speaking environments. We found that additional gains can be made with a multiple-recognizer system, trained with environment-specific data. Overall, we found that this approach yielded a decrease in error rate of nearly 2%, with some individual speaking environments’ error rate decreasing by over 7%.

[1]  E. A. Martin,et al.  Multi-style training for robust isolated-word speech recognition , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2]  Richard M. Schwartz,et al.  Adaptation to new microphones using tied-mixture normalization , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[3]  David Anthony James,et al.  The Application of Classical Informa - tion Retrieval Techniques to Spoken Documents , 1995 .

[4]  Victor Zue,et al.  Automatic transcription of general audio data: preliminary analyses , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[5]  Francis Kubala,et al.  Modeling Those F-Conditions - Or Not , 1997 .

[6]  Andrew J. Viterbi,et al.  Error bounds for convolutional codes and an asymptotically optimum decoding algorithm , 1967, IEEE Trans. Inf. Theory.

[7]  James R. Glass,et al.  A probabilistic framework for feature-based speech recognition , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.