论文信息 - Hierarchical Duration Modelling for a Speech Recognition System

Hierarchical Duration Modelling for a Speech Recognition System

Durational patterns of phonetic segments and pauses convey information about the linguistic content of an utterance. Most speech recognition systems grossly underutilize the knowledge provided by durational cues due to the vast array of factors that in uence speech timing and the complexity with which they interact. In this thesis, we introduce a duration model based on the Angie framework. Angie is a paradigm which captures morpho-phonemic and phonological phenomena under a uni ed hierarchical structure. Sublexical parse trees provided by Angie are well-suited for constructing complex statistical models to account for durational patterns that are functions of e ects at various linguistic levels. By constructing models for all the sublexical nodes of a parse tree, we implicitly model duration phenomena at these linguistic levels simultaneously, and subsequently account for a vast array of contextual variables a ecting duration from the phone level up to the word level. This thesis will describe our development of a durational model, and will demonstrate its utility in a series of experiments conducted in the Atis domain. The aim is to characterize phenomena such as speaking rate variability and prepausal lengthening in a quantitative manner. The duration model has been incorporated into a phonetic recognizer and a wordspotting system. We will report on the resulting improvement in performance. In this duration model, a strategy has been formulated in which node durations in upper layers are successively normalized by their respective realizations in the layers below; that is, given a nonterminal node, individual probability distributions, corresponding with each di erent realization in the layer immediately below, are all scaled to have the same mean. This reduces the variance at each node, and enables the sharing of statistical distributions. Upon normalization, a set of relative duration models is constructed by measuring the percentage duration of nodes occupied with respect to their parent nodes. Under this normalization scheme, the normalized duration of a word node is independent of the inherent durations of its descendents and hence is an indicator of speaking rate. A speaking rate parameter can be de ned as a ratio of the normalized word duration over the global average normalized word duration. This speaking rate parameter is then used to construct absolute duration models that are normalized by speaking rate. This is done by scaling either absolute phone or phoneme duration by the above parameter. By combining the hierarchical normalization and speaking rate normalization, the average standard deviation for phoneme duration was reduced from 50ms to 33ms. Using the hierarchical structure, we have conducted a series of experiments investigating speech timing phenomena. We are speci cally interested in the (1) examining secondary e ects of speaking rate, (2) characterizing the e ects of prepausal lengthening and (3) detecting other word boundary e ects associated with duration such as gemination. For example, we have found, with statistical signi cance, that a su x within a word is a ected far more by speaking rate than is a pre x. It is also observed that prepausal lengthening a ects the various sublexical units non-uniformly. For example, the stressed nucleus in the syllable tends to be lengthened more than the onset position. The nal duration model has been implemented into the Angie phonetic recognizer. In addition to contextual e ects captured by the model at various sublexical levels, the scoring mechanism also accounts explicitly for two inter-word level phenomena, namely, prepausal lengthening and gemination. Our experiments have been conducted under increasing levels of linguistic constraint with correspondingly di erent baseline performances. The improved performance is obtained by providing implicit lexical knowledge during recognition. When maximal linguistic contraint is imposed, the incorporation of the relative and speaking rate normalized absolute phoneme duration scores reduced the phonetic error rate from 29.7% to 27.4%, a relative reduction of 7.7%. These gains are over and above any gains realized from standard phone duration models present in the baseline system. As a rst step towards demonstrating the bene t of duration modelling for full word recognition, we have conducted a preliminary study using duration as a post-processor in a word-spotting task. We have simpli ed the task of spotting city names in the Atis domain by choosing a pair of highly confusable keywords, \New York" and \Newark." All tokens initially spotted as \New York" are passed to a post-processor, which reconsiders those words and makes a nal decision, with the duration component incorporated. For this task, the duration post-processor reduced the number of confusions from 60 to 19 tokens out of a total of 323 tokens, a 68% reduction of error. In another experiment, the duration model is fully integrated into an Angie-based wordspotting system. As in our phonetic recognition experiments, results were obtained at varying degrees of linguistic contraint. Here, when maximum constraint is imposed, the duration model improved performance from 89.3 to 91.6 (FOM), a relative improvement of 21.5%. This research has demonstrated success in employing a complex statistical duration model in order to improve speech recognition performance. It has shown that duration can play an important role in aiding word recognition and promises to o er greater gains for continuous word recognition. Thesis Supervisor: Stephanie Sene Title: Principal Research Scientist Hierarchical Duration Modelling for a Speech Recognition System

Goopeel Chung

[1] Xue Wang,et al. Analysis of context-dependent segmental duration for automatic speech recognition , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[2] W. Nick Campbell. Predicting segmental durations for accommodation within a syllable-level timing framework , 1993, EUROSPEECH.

[3] Xue Wang,et al. Modelling of phone duration (using the TIMIT database) and its potential benefit for ASR , 1996, Speech Commun..

[4] D. Klatt. Linguistic uses of segmental duration in English: acoustic and perceptual evidence. , 1976, The Journal of the Acoustical Society of America.

[5] Shozo Makino,et al. Spoken word recognition using phoneme duration information estimated from speaking rate of input speech , 1994, ICSLP.

[6] A. House. On Vowel Duration in English , 1961 .

[7] T. Crystal,et al. Segmental durations in connected‐speech signals: Current results , 1988 .

[8] N. Umeda. Vowel duration in American English. , 1975, The Journal of the Acoustical Society of America.

[9] R. Port. Linguistic timing factors in combination. , 1981, The Journal of the Acoustical Society of America.

[10] Stephen Isard,et al. Segment durations in a syllable frame , 1991 .

[11] Jan P. H. van Santen,et al. Deriving text-to-speech durations from natural speech , 1990, SSW.

[12] T H Crystal,et al. Segmental durations in connected speech signals: preliminary results. , 1982, The Journal of the Acoustical Society of America.

[13] Stephen E. Levinson,et al. Continuously variable duration hidden Markov models for automatic speech recognition , 1986 .

[14] T. Crystal,et al. Segmental durations in connected-speech signals: Syllabic stress , 1988 .

[15] Fredinand Pitrelli John. Hierarchical modeling of phoneme duration : application to speech recognition , 1990 .

[16] Wayne A. Lea,et al. Prosodic Aids to Speech Recognition , 1972 .

[17] Michael Riley. Statistical tree‐based modeling of phonetic segment durations , 1989 .

[18] Jacques M. B. Terken,et al. Rhythmic constraints in durational control , 1994, ICSLP.

[19] Victor W. Zue,et al. The effect of speech rate on the application of low‐level phonological rules in American English , 1985 .

[20] R F Port,et al. Use of syllable-scale timing to discriminate words. , 1988, The Journal of the Acoustical Society of America.

[21] M. D. Riley. Tree-based modeling of segmental durations , 1992 .

[22] Colin W. Wightman,et al. Segmental durations in the vicinity of prosodic phrase boundaries. , 1992, The Journal of the Acoustical Society of America.

[23] Stephanie Seneff,et al. ANGIE: a new framework for speech analysis based on morpho-phonological modelling , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[24] Philip C. Woodland,et al. Using relative duration in large vocabulary speech recognition , 1993, EUROSPEECH.

[25] Douglas B. Paul,et al. An Efficient A* Stack Decoder Algorithm for Continuous Speech Recognition with a Stochastic Language Model , 1992, HLT.

[26] Richard M. Schwartz,et al. Duration modeling in large vocabulary speech recognition , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[27] Raymond Lau,et al. Subword lexical modelling for speech recognition , 1998 .

[28] G. E. Peterson,et al. Duration of Syllable Nuclei in English , 1960 .

[29] D. O'Shaughnessy,et al. A multispeaker analysis of durations in read French paragraphs. , 1984, The Journal of the Acoustical Society of America.

[30] Jan P. H. van Santen,et al. Contextual effects on vowel duration , 1992, Speech Commun..

[31] P W Nye,et al. Stress and vowel duration effects on syllable recognition. , 1983, The Journal of the Acoustical Society of America.

[32] D. Klatt. Letter: Interaction between two factors that influence vowel duration. , 1973, The Journal of the Acoustical Society of America.

[33] Xue Wang,et al. Integration of context-dependent durational knowledge into HMM-based speech recognition , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[34] N. Umeda. Consonant duration in American English , 1977 .

[35] Douglas B. Paul. An Efficient A* Stack Decoder Algorithm for Continuous Speech Recognition with a Stochastic Language Model , 1992, HLT.