Heterogeneous lexical units for automatic speech recognition: preliminary investigations

This paper explores the use of the phone and syllable as primary units of representation in the first stage of a two-stage recognizer. A finite-state transducer speech recognizer is utilized to configure the recognition as a two-stage process, where either phone or syllable graphs are computed in the first stage, and passed to the second stage to determine the most likely word hypotheses. Preliminary experiments in a weather information speech understanding domain show that a syllable representation with either bigram or trigram language models provides more constraint than a phonetic representation with a higher-order n-gram language model (up to a 6-gram), and approaches the performance of a more conventional single-stage word-based configuration.

[1]  James R. Glass,et al.  Natural-sounding speech synthesis using variable-length units , 1998, ICSLP.

[2]  Karen Livescu Analysis and modeling of non-native speech for automatic speech recognition , 1999 .

[3]  Stephanie Seneff,et al.  Improvements in speech understanding accuracy through the integration of hierarchical linguistic, prosodic, and phonological constraints in the jupiter domain , 1998, ICSLP.

[4]  James R. Glass,et al.  A probabilistic framework for feature-based speech recognition , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[5]  James R. Glass,et al.  Real-time telephone-based speech recognition in the Jupiter domain , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).