Feature-based Pronunciation Modeling for Speech Recognition

We present an approach to pronunciation modeling in which the evolution of multiple linguistic feature streams is explicitly represented. This differs from phone-based models in that pronunciation variation is viewed as the result of feature asynchrony and changes in feature values, rather than phone substitutions, insertions, and deletions. We have implemented a flexible feature-based pronunciation model using dynamic Bayesian networks. In this paper, we describe our approach and report on a pilot experiment using phonetic transcriptions of utterances from the Switchboard corpus. The experimental results, as well as the model's qualitative behavior, suggest that this is a promising way of accounting for the types of pronunciation variation often seen in spontaneous speech.

[1]  Don McAllaster,et al.  Fabricating conversational speech data with acoustic models: a program to examine model-data mismatch , 1998, ICSLP.

[2]  Jeff A. Bilmes,et al.  Hidden-articulator Markov models for speech recognition , 2003, Speech Commun..

[3]  Timothy J. Hazen,et al.  Pronunciation modeling using a finite-state transducer representation , 2005, Speech Commun..

[4]  Katrin Kirchhoff Syllable-level desynchronisation of phonetic features for speech recognition , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[5]  Mari Ostendorf,et al.  Incorporating linguistic theories of pronunciation variation into speech–recognition models , 2000, Philosophical Transactions of the Royal Society of London. Series A: Mathematical, Physical and Engineering Sciences.

[6]  Kuldip K. Paliwal,et al.  Automatic Speech and Speaker Recognition , 1996 .

[7]  James R. Glass,et al.  Hidden feature models for speech recognition using dynamic Bayesian networks , 2003, INTERSPEECH.

[8]  Florian Metze,et al.  A flexible stream architecture for ASR using articulatory features , 2002, INTERSPEECH.

[9]  Simon King,et al.  Speech recognition via phonetically featured syllables , 1998, ICSLP.

[10]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[11]  Geoffrey Zweig,et al.  The graphical models toolkit: An open source software system for speech and time-series processing , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[12]  Li Deng,et al.  Production models as a structural basis for automatic speech recognition , 1997, Speech Commun..

[13]  C. Browman,et al.  Articulatory Phonology: An Overview , 1992, Phonetica.

[14]  Keiji Kanazawa,et al.  A model for reasoning about persistence and causation , 1989 .

[15]  Steven Greenberg,et al.  INSIGHTS INTO SPOKEN LANGUAGE GLEANED FROM PHONETIC TRANSCRIPTION OF THE SWITCHBOARD CORPUS , 1996 .

[16]  Andrej Ljolje,et al.  Automatic Generation of Detailed Pronunciation Lexicons , 1996 .