On the automatic segmentation of speech signals

For large vocabulary and continuous speech recognition, the sub-word-unit-based approach is a viable alternative to the whole-word-unit-based approach. For preparing a large inventory of subword units, an automatic segmentation is preferrable to manual segmentation as it substantially reduces the work associated with the generation of templates and gives more consistent results. In this paper we discuss some methods for automatically segmenting speech into phonetic units. Three different approaches are described, one based on template matching, one based on detecting the spectral changes that occur at the boundaries between phonetic units and one based on a constrained-clustering vector quantization approach. An evaluation of the performance of the automatic segmentation methods is given.

[1]  C. Myers,et al.  A level building dynamic time warping algorithm for connected word recognition , 1981 .

[2]  L. Rabiner,et al.  A modified K‐means clustering algorithm for use in speaker‐independent isolated word recognition , 1984 .

[3]  Yoh'ichi Tohkura,et al.  A weighted cepstral distance measure for speech recognition , 1987, IEEE Trans. Acoust. Speech Signal Process..

[4]  Stephen E. Levinson,et al.  Speaker independent connected word recognition using a syntax-directed dynamic programming procedure , 1982 .

[5]  N. Sedgwick,et al.  A method for segmenting acoustic patterns, with applications to automatic speech recognition , 1977 .

[6]  Aaron E. Rosenberg,et al.  On the use of instantaneous and transitional spectral information in speaker recognition , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[7]  L. Rabiner,et al.  Speaker‐independent isolated word recognition using a 129‐word airline vocabulary , 1981 .

[8]  J. P. van Hemert Automatic diphone preparation , 1985 .

[9]  F. Itakura,et al.  A statistical method for estimation of speech spectral density and formant frequencies , 1970 .

[10]  Bishnu S. Atal,et al.  Efficient coding of LPC parameters by temporal decomposition , 1983, ICASSP.

[11]  L. Rabiner,et al.  A bootstrapping training technique for obtaining demisyllable reference patterns , 1981 .

[12]  Frank K. Soong,et al.  A vector-quantization-based preprocessor for speaker-independent isolated word recognition , 1985, IEEE Trans. Acoust. Speech Signal Process..