Local Methods for On-Demand Out-of-Vocabulary Word Retrieval

Most of the Web-based methods for lexicon augmenting consist in capturing global semantic features of the targeted domain in order to collect relevant documents from the Web. We suggest that the local context of the out-of-vocabulary (OOV) words contains relevant information on the OOV words. With this information, we propose to use the Web to build locally-augmented lexicons which are used in a final local decoding pass. First, an automatic web based OOV word detection method is proposed. Then, we demonstrate the relevance of the Web for the OOV word retrieval. Different methods are proposed to retrieve the hypothesis words. We finally retrieve about 26% of the OOV words with a lexicon increase of less than 1000 words using the reference context.

[1]  James C. French,et al.  Obtaining language models of web collections using query-based sampling techniques , 2002, Proceedings of the 35th Annual Hawaii International Conference on System Sciences.

[2]  Guillaume Gravier,et al.  Corpus description of the ESTER Evaluation Campaign for the Rich Transcription of French Broadcast News , 2004, LREC.

[3]  Murat Saraclar,et al.  Hybrid language models for out of vocabulary word detection in large vocabulary conversational speech recognition , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[4]  Timothy J. Hazen,et al.  A comparison and combination of methods for OOV word detection and word confidence scoring , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[5]  Katsutoshi Ohtsuki,et al.  Unsupervised vocabulary expansion for automatic transcription of broadcast news , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[6]  Alexandre Allauzen,et al.  Open vocabulary ASR for audiovisual document indexation , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[7]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[8]  Marcello Federico,et al.  Lexicon adaptation for broadcast news transcription , 2001 .

[9]  James R. Glass,et al.  A multi-class approach for modelling out-of-vocabulary words , 2002, INTERSPEECH.

[10]  Yoshinori Sagisaka,et al.  A hierarchical language model incorporating class-dependent word models for OOV words recognition , 2000, INTERSPEECH.

[11]  Y. Kajiura,et al.  Generating search query in unsupervised language model adaptaion using WWW , 2006 .