Automatic segmentation and labeling of speech based on Hidden Markov Models

Abstract An accurate database documentation at phonetic level is very important for speech research: however, manual segmentation and labeling is a time consuming and error prone task. This article describes an automatic procedure for the segmentation of speech: given either the linguistic or the phonetic content of a speech utterance, the system provides phone boundaries. The technique is based on the use of an acoustic-phonetic unit Hidden Markov Model (HMM) recognizer: both the recognizer and the segmentation system have been designed exploiting the DARPA-TIMIT acoustic-phonetic continuous speech database of American English. Segmentation and labeling experiments have been conducted in different conditions to check the reliability of the resulting system. Satisfactory results have been obtained, especially when the system is trained with some manually presegmented material. The size of this material is a crucial factor; system performance has been evaluated with respect to this parameter. It turns out that the system provides 88.3% correct boundary location, given a tolerance of 20 ms, when only 256 phonetically balanced sentences are used for its training.

[1]  Renato De Mori,et al.  Improved connected digit recognition using spectral variation functions , 1992, ICSLP.

[2]  Hsiao-Wuen Hon,et al.  Speaker-independent phone recognition using hidden Markov models , 1989, IEEE Trans. Acoust. Speech Signal Process..

[3]  Michael Riley,et al.  Automatic segmentation and labeling of speech , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[4]  Maurizio Omologo,et al.  A HMM-based system for automatic segmentation and labeling of speech , 1992, ICSLP.

[5]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[6]  Torbjørn Svendsen,et al.  On the automatic segmentation of speech signals , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[7]  Piero Cosi,et al.  A preliminary statistical evaluation of manual and automatic segmentation discrepancies , 1991, EUROSPEECH.