Segment-based recognition on the phonebook task: initial results and observations on duration modeling

This paper describes preliminary recognition experiments on PhoneBook [1], a corpus of isolated, telephone-bandwidth, read words from a large (almost 8,000-word) vocabulary. We have chosen this corpus as a testbed for experiments on the language model-independent parts of a segment-based recognizer. We present results showing that a segment-based recognizer performs well on this task, and that a simple Gaussian mixture phone duration model significantly reduces the error rate. We compare context-independent, stress-dependent, and word position-dependent duration models and obtain relative error rate reductions of up to 12% on the test set. Finally, we make some observations regarding the effects of stress and word position in this isolated-word task and discuss our plans for further research using PhoneBook.

[1]  Bo Xu,et al.  Towards high performance continuous Mandarin digit string recognition , 2000, INTERSPEECH.

[2]  Stephanie Seneff,et al.  A hierarchical duration model for speech recognition based on the ANGIE framework , 1999, Speech Commun..

[3]  Stephen E. Levinson,et al.  Continuously variable duration hidden Markov models for speech analysis , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[4]  James R. Glass,et al.  A probabilistic framework for feature-based speech recognition , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[5]  Hong C. Leung,et al.  PhoneBook: a phonetically-rich isolated-word telephone-speech database , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[6]  Geoffrey Zweig,et al.  Speech Recognition with Dynamic Bayesian Networks , 1998, AAAI/IAAI.

[7]  Mari Ostendorf,et al.  From HMM's to segment models: a unified view of stochastic modeling for speech recognition , 1996, IEEE Trans. Speech Audio Process..

[8]  Fredinand Pitrelli John Hierarchical modeling of phoneme duration : application to speech recognition , 1990 .

[9]  James R. Glass,et al.  Real-time telephone-based speech recognition in the Jupiter domain , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[10]  Hervé Bourlard,et al.  Hybrid HMM/ANN systems for training independent tasks: experiments on Phonebook and related improvements , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[11]  Treebank Penn,et al.  Linguistic Data Consortium , 1999 .

[12]  James Glass,et al.  The SUMMIT speech recognition system: phonological modelling and lexical access , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[13]  Jeff A. Bilmes,et al.  Dynamic Bayesian Multinets , 2000, UAI.

[14]  Jeff A. Bilmes,et al.  Hidden-articulator Markov models: performance improvements and robustness to noise , 2000, INTERSPEECH.

[15]  D. Klatt Linguistic uses of segmental duration in English: acoustic and perceptual evidence. , 1976, The Journal of the Acoustical Society of America.