Automatic labelling of speech using synthesis-by-rule and non-linear time-alignment

in automatic speech recognition and in speech synthesis, the central problem is the relationship between a stream of symbols and a sound pattern. There is general agreement about the need for a symbolic level intermediate between the sound pattern and conventionally spelt text (at least for English), but there are many suggested levels and alternative schemes. In our research, aimed primarily al improved automatic speech recognition, we haw, • chosen to adopt the symbolic descriptions which are used in a particular speech synthesisby-rule system. The main advantage of this choice is the existence of a well-defined relationship between symbol strings and sound pattern (i.e. the synthesis rules themselves). We intend to analyse natural speech to determine this relationship for particular utterances, and understand how it varies in natural speech, particularly for different speakers. A by-product of this work would be an improved synthesis system. To produce a speech recognition system, it will be necessary to invert the relationship. In this paper we are concerned with the first step towards a solution of the first problem: the description of natural speech patterns in terms of synthesis-by-rule segments. As well as being a step towards our long-term goal, we cxpect the technique described below to be useful as an aid in automating the labelling of speech databases. Other automatic labelling schemes have been suggested, using different symbols and for different purposes. Wagner [1] proposes a method for labelling speech databases. The labelling is carried out in two stages. A segmentation algorithm forms acoustic segments, which may be voiced, unvoiced or silent. A coarse labelling associate:~ substrings of the phoneme string with the acoustic segments, using a dynamic programming algorithm. The detailed (frame) labelling uses another dynamic programming algorithm to associate each time frame with one phoneme, based on derivatiw.~s of energy and formant functions. Our method makes use of a program, ZIP [2]. which determines the corresponding times in two utterances of the same text, which can be of any length. This time-alignment algorithm is a modification of the Dynamic Programming techniques developed for speech recognition. From a transcript (consisting of a sequence of phonetic symbols) of a natural utterance, a synthetic version is made using a synthesis-by-rule algorithm. After aligning the timescales of the natural and synthetic versions using ZIP, we transfer to the natural speech the segmental information which underlies the synthetic version.