Pronunciation modeling for ASR - knowledge-based and data-derived methods

This paper focuses on modeling pronunciation variation in two different ways: data-derived and knowledge-based. The knowledge-based approach consists of using phonological rules to generate variants. The data-derived approach consists of performing phone recognition, followed by smoothing using decision trees (D-trees) to alleviate some of the errors in the phone recognition. Using phonological rules led to a small improvement in WER; a data-derived approach in which the phone recognition was smoothed using D-trees prior to lexicon generation led to larger improvements compared to the baseline. The lexicon was employed in two different recognition systems: a hybrid HMM/ANN system and a HMM-based system, to ascertain whether pronunciation variation was truly being modeled. This proved to be the case as no significant differences were found between the results obtained with the two systems. A comparison between the knowledge-based and data-derived methods showed that 17% of variants generated by the phonological rules were also found using phone recognition, and this increases to 46% when the phone recognition output is smoothed by using D-trees.

[1]  Don McAllaster,et al.  Fabricating conversational speech data with acoustic models: a program to examine model-data mismatch , 1998, ICSLP.

[2]  Luis A. Hernández Gómez,et al.  Automatic alternative transcription generation and vocabulary selection for flexible word recognizers , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[3]  Helmer Strik,et al.  A data-driven method for modeling pronunciation variation , 2003, Speech Commun..

[4]  Filipp Korkmazskiy,et al.  Joint pronunciation modelling of non-native speakers using data-driven methods , 2000, INTERSPEECH.

[5]  Helmer Strik,et al.  Pronunciation variation in ASR: which variation to model? , 2000, INTERSPEECH.

[6]  Steve Renals,et al.  Confidence Measures for Evaluating Pronunciation Models , 1998 .

[7]  Lori Lamel,et al.  On designing pronunciation lexicons for large vocabulary continuous speech recognition , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[8]  Kuldip K. Paliwal,et al.  Automatic Speech and Speaker Recognition: Advanced Topics , 1999 .

[9]  Gunnar Lehtinen,et al.  Modeling Pronunciation Variations and Coarticulation with Finite-state Transducers in Csr , 1998 .

[10]  S. Quazza,et al.  The use of lexica in text-to-speech systems , 2000 .

[11]  Jean-Pierre Martens,et al.  In search of better pronunciation models for speech recognition , 1999, Speech Commun..

[12]  T. Rietveld,et al.  Prosody in NIROS with FONPARS and ALFEIOS , 1994 .

[13]  Eric Fosler-Lussier,et al.  Not just what, but also when: Guided automatic pronunciation modeling for Broadcast News , 1999 .

[14]  Richard Wiseman,et al.  Dynamic and static improvements to lexical baseforms , 1997, EUROSPEECH.

[15]  Lou Boves,et al.  A spoken dialog system for the Dutch public transport information service , 1997, Int. J. Speech Technol..

[16]  Torbjørn Svendsen,et al.  Maximum likelihood modelling of pronunciation variation , 1999, Speech Commun..

[17]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[18]  Gudrun Flach Modelling pronunciation variability for special domains , 1995, EUROSPEECH.

[19]  Roger K. Moore Computer Speech and Language , 1986 .

[20]  Alexander H. Waibel,et al.  Dictionary learning for spontaneous speech recognition , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[21]  Hermann Ney,et al.  The Philips research system for large-vocabulary continuous-speech recognition , 1993, EUROSPEECH.

[22]  Nelson Morgan,et al.  Dynamic pronunciation models for automatic speech recognition , 1999 .

[23]  Steven Greenberg,et al.  Speaking in shorthand - A syllable-centric perspective for understanding pronunciation variation , 1999, Speech Commun..

[24]  Hervé Bourlard,et al.  Connectionist speech recognition , 1993 .

[25]  Harriet J. Nock,et al.  Pronunciation modeling by sharing gaussian densities across phonetic models , 1999, EUROSPEECH.

[26]  George Zavaliagkos,et al.  Pronunciation modeling for large vocabulary conversational speech recognition , 1998, ICSLP.

[27]  William J. Byrne,et al.  Stochastic pronunciation modelling from hand-labelled phonetic corpora , 1999, Speech Commun..

[28]  Javier Ferreiros,et al.  Improving continuous speech recognition in Spanish by phone-class semicontinuous HMMs with pausing and multiple pronunciations , 1999, Speech Commun..

[29]  Lou Boves,et al.  Acoustic characteristics of lexical stress in continuous telephone speech , 1999, Speech Commun..

[30]  Johannes Martens,et al.  On the importance of exception and cross-word rules for the data-driven creation of lexica for ASR , 2000 .

[31]  Helmer Strik,et al.  Improving the performance of a Dutch CSR by modeling within-word and cross-word pronunciation variation , 1999, Speech Commun..

[32]  Eric Fosler-Lussier,et al.  A comparison of data-derived and knowledge-based modeling of pronunciation variation , 2000, INTERSPEECH.

[33]  Daniel P. W. Ellis,et al.  Connectionist speech recognition of Broadcast News , 2002, Speech Commun..

[34]  Hervé Bourlard,et al.  Connectionist Speech Recognition: A Hybrid Approach , 1993 .

[35]  Helmer Strik,et al.  Modeling pronunciation variation for ASR: A survey of the literature , 1999, Speech Commun..

[36]  Dafydd Gibbon,et al.  Lexicon Development for Speech and Language Processing , 2000 .

[37]  Lotfi A. Zadeh,et al.  Phonological structures for speech recognition , 1989 .