A hierarchical duration model for speech recognition based on the ANGIE framework

Abstract This paper presents a hierarchical duration model applied to enhance speech recognition. The model is based on the novel ANGIE framework which is a flexible unified sublexical representation designed for speech applications. This duration model captures duration phenomena operating at the phonological, phonemic, syllabic and morphological levels. At the core of the modelling scheme is a hierarchical normalization procedure performed on the ANGIE parse structure. From this, we derive a robust measure for the rate of speech. The model uses two sets of statistical models – a first set based on relative duration between sublexical units and a second set based on absolute duration that has been normalized with respect to the speaking rate. We have used this paradigm to explore some speech timing phenomena such as the secondary effects on relative duration due to variations in speaking rate, the characteristics of anomalously slow words, and prepausal lengthening effects. Finally, we successfully demonstrate the utility of durational information for recognition applications. In phonetic recognition, we achieve a relative improvement of up to 7.7% by incorporating our model over and above a standard phone duration model, and similarly, in a word spotting task, an improvement from 89.3 to 91.6 (FOM) has resulted.

[1]  Jan P. H. van Santen,et al.  Contextual effects on vowel duration , 1992, Speech Commun..

[2]  N. Umeda Consonant duration in American English , 1977 .

[3]  Stephanie Seneff,et al.  Providing sublexical constraints for word spotting within the ANGIE framework , 1997, EUROSPEECH.

[4]  Gérard Bailly,et al.  Talking Machines: Theories, Models, and Designs , 1992 .

[5]  Stephanie Seneff,et al.  ANGIE: a new framework for speech analysis based on morpho-phonological modelling , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[6]  Raymond Lau,et al.  Subword lexical modelling for speech recognition , 1998 .

[7]  Goopeel Chung Hierarchical Duration Modelling for a Speech Recognition System , 1997 .

[8]  T. Crystal,et al.  Segmental durations in connected‐speech signals: Current results , 1988 .

[9]  N. Umeda Vowel duration in American English. , 1975, The Journal of the Acoustical Society of America.

[10]  Alexander I. Rudnicky,et al.  Expanding the Scope of the ATIS Task: The ATIS-3 Corpus , 1994, HLT.

[11]  A. B.,et al.  SPEECH COMMUNICATION , 2001 .

[12]  A. House On Vowel Duration in English , 1961 .

[13]  Fredinand Pitrelli John Hierarchical modeling of phoneme duration : application to speech recognition , 1990 .

[14]  Colin W. Wightman,et al.  Segmental durations in the vicinity of prosodic phrase boundaries. , 1992, The Journal of the Acoustical Society of America.

[15]  Stephen Isard,et al.  Segment durations in a syllable frame , 1991 .

[16]  Jonathan G. Fiscus,et al.  1993 Benchmark Tests for the ARPA Spoken Language Program , 1994, HLT.

[17]  D. Klatt Linguistic uses of segmental duration in English: acoustic and perceptual evidence. , 1976, The Journal of the Acoustical Society of America.