论文信息 - CRF-based stochastic pronunciation modeling for out-of-vocabulary spoken term detection

CRF-based stochastic pronunciation modeling for out-of-vocabulary spoken term detection

Out-of-vocabulary (OOV) terms present a significant challenge to spoken term detection (STD). This challenge, to a large extent, lies in the high degree of uncertainty in pronunciations of OOV terms. In previous work, we presented a stochastic pronunciation modeling (SPM) approach to compensate for this uncertainty. A shortcoming of our original work, however, is that the SPM was based on a joint-multigram model (JMM), which is suboptimal. In this paper, we propose to use conditional random fields (CRFs) for letter-to-sound conversion, which significantly improves quality of the predicted pronunciations. When applied to OOV STD, we achieve considerable performance improvement with both a 1-best system and an SPM-based system. Index Terms: speech recognition, spoken term detection, conditional random field, joint multigram model

Dong Wang | Raphaël Troncy | Simon King | Nicholas W. D. Evans

[1] Steve Renals,et al. Speech Recognition Using Augmented Conditional Random Fields , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[2] Andreas Stolcke,et al. The SRI/OGI 2006 spoken term detection system , 2007, INTERSPEECH.

[3] Bhuvana Ramabhadran,et al. Effect of pronounciations on OOV queries in spoken term detection , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[4] Paul Taylor,et al. Hidden Markov models for grapheme to phoneme conversion , 2005, INTERSPEECH.

[5] Simon King,et al. Stochastic pronunciation modelling for spoken term detection , 2009, INTERSPEECH.

[6] Andrew McCallum,et al. Gene Prediction with Conditional Random Fields , 2005 .

[7] Fernando Pereira,et al. Shallow Parsing with Conditional Random Fields , 2003, NAACL.

[8] Andreas Stolcke,et al. Open-vocabulary spoken term detection using graphone-based hybrid recognition systems , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[9] Andrew McCallum,et al. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[10] Alan W. Black,et al. Issues in building general letter to sound rules , 1998, SSW.

[11] Dong Wang,et al. Out-of-Vocabulary Spoken Term Detection , 2010 .

[12] Simon King,et al. Stochastic pronunciation modelling and soft match for out-of-vocabulary spoken term detection , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[13] Frédéric Bimbot,et al. Variable-length sequence matching for phonetic transcription using joint multigrams , 1995, EUROSPEECH.

[14] Terrence J. Sejnowski,et al. Parallel Networks that Learn to Pronounce English Text , 1987, Complex Syst..

[15] Hermann Ney,et al. Joint-sequence models for grapheme-to-phoneme conversion , 2008, Speech Commun..

[16] Bhuvana Ramabhadran,et al. Phonetic query expansion for spoken document retrieval , 2008, INTERSPEECH.

[17] Bhuvana Ramabhadran,et al. Effect of pronunciations on OOV queries in spoken term detection , 2009 .

[18] Simon King,et al. Term-dependent confidence for out-of-vocabulary term detection , 2009, INTERSPEECH.

[19] Lukás Burget,et al. The AMI Meeting Transcription System: Progress and Performance , 2006, MLMI.