On temporal alignment of sentences of natural and synthetic speech

One way to improve the quality of synthetic speech, and to learn about temporal aspects of speech recognition, is to study the problem of time aligning pairs of spoken sentences. For example, one could evaluate various sets of duration rules for synthesis by comparing the time alignments of speech sounds within synthetic sentences to those of naturally spoken sentences. In this manner, an improved set of sound duration rules could be obtained by applying some objective measure to the alignment scores. For speech recognition applications, one could obtain automatic labeling of continuous speech from a hand-marked prototype to obtain models and/or statistical data on sounds within sentences. A key question in the use of automatic alignment of sentence length utterances is whether the time warping methods, developed for isolated word recognition, could be extended to the problem of time aligning sentence length utterances (up to several seconds long). A second key question is the reliability and accuracy of such an alignment. In this paper we investigate these questions. It is shown that, with some simple modifications, the dynamic time warping procedures used for isolated word recognition apply almost as well to alignment of sentence length utterances. It is also shown that, on the average, the uncertainty in the location of significant events within the sentence is much smaller than the event durations although the largest errors are longer than some event durations. Hence, one must apply caution in using the time alignment contour for synthesis or recognition applications.

[1]  Michael Wagner Automatic labelling of continuous speech with a given phonetic transcription using dynamic programming algorithms , 1981, ICASSP.

[2]  Catherine P. Browman Rules for demisyllable synthesis using Lingua, a language interpreter , 1980, ICASSP.

[3]  F. Itakura,et al.  Minimum prediction residual principle applied to speech recognition , 1975 .

[4]  S. Chiba,et al.  Dynamic programming algorithm optimization for spoken word recognition , 1978 .

[5]  J. G. Wilpon,et al.  An improved training procedure for connected-digit recognition , 1982, The Bell System Technical Journal.

[6]  Aaron E. Rosenberg,et al.  Performance tradeoffs in dynamic time warping algorithms for isolated word recognition , 1980 .

[7]  Aaron E. Rosenberg,et al.  Considerations in dynamic time warping algorithms for discrete word recognition , 1978 .

[8]  N. Umeda,et al.  Automatic synthesis from ordinary english test , 1973 .

[9]  Lawrence R. Rabiner,et al.  Connected digit recognition using a level-building DTW algorithm , 1981 .

[10]  Lawrence R. Rabiner,et al.  An automated directory listing retrieval system based on recognition of connected letter strings , 1982 .

[11]  Aaron E. Rosenberg,et al.  Demisyllable-based isolated word recognition system , 1983 .

[12]  H. Sakoe,et al.  Two-level DP-matching--A dynamic programming-based pattern matching algorithm for connected word recognition , 1979 .

[13]  L. Rabiner,et al.  A bootstrapping training technique for obtaining demisyllable reference patterns , 1981 .