Automatic time alignment of speech with a phonetic transcription

1%is paper describes a system for time aligning a phonetic transcription to a speech signal. The phonetic segments are described by broad acoustic parameters and a dynarnic programming algorithm is used for optimizing the alignment of segments to the speech signal. In the present study, only two parameters have been used; the intensity of the speech signal below 400 Hz ancl the intensity above 500 Hz. It is shown that this very coarse information is enough to give a correct segmentation in most cases. The signals have been differentiated in the time domain. This will reduce the effects of using different speakers, of varying signal levels, and altering filter characteristics of the speech channel. A text-to-speech system is used to transcribe an orthographic re2resentation of the utterances to phonetic segments. A small experiment consisted of 30 sentences spoken by one male spedker. The average sentence length was 8 words. The rule-based transcription was correct for 97% of the segments. The boundaries were judged to be correct within 10 ms, the sampling interval, for 87% of the segments. Introduction The problem of automatic time alignment of a speech wave to a known phonetic transcription has attracted a lot of attention during the last years. It would facilitate or replace the tedious manual labeling and would be a way to make it more consistent. The development of speakerindependent large-vocabulary speech recognition systems requires very large amounts of speech data to get quantitative and qualitative measures of the influences of, e.g., coarticulation, reduction, and stress patterns on the acoustic speech signal. Several hours of speech will be necessary to cover a sufficient amount of phonetic variation to get reliable statistics of the speech data. The data may also be used for improving speech synthesis rules. A detailed study of alignment errors will reveal difficult phonetic contexts and where the acoustic-phonetic rules should be improved to better predict the data. A possible application can also be found in foreign language education where pronunciation deviations from a teacher's voice could be automatically interpreted in phonetic terms and fed back to the pupil. * also presented at the French-Swedish Sgninar, Grenoble, April 22-24, 1985. The alignment procedure itself is an essential part of the verification component in several phorlet ic speech recognit ion systems ( ~lomberg & Elenius, 1974, 1981; Lowerre & M d y , 1980). It serves the same function as the nonlinear time warping algorithm in standard pattern matching word recognition systems. The alignment process can be performed in different ways and at varying levels of automation. Some of them will be described below, starting with the least automated and gradually increasing the level of automation. The first level would be the completely manual way of inspecting various representations of the speech wave and entering boundaries by hand. This method requires a skilled phonetician and lots of time. Cxle second of speech may require several minutes of hand labeling (bung & Zue, 1984). I A higher level has been used by Bridle & Chamberlain (1984). They I start by labeling a recorded utterance by hand. By means of dynamic I