论文信息 - Vocabulary and language model adaptation using information retrieval

Vocabulary and language model adaptation using information retrieval

The goal of vocabulary optimization is to construct a vocabulary with exactly those words that are the most likely to appear in the test data. We will present a new approach to reduce the out-of-vocabulary (OOV) rate by adapting the vocabulary model during the ASR process. This method can also be used for the statistical language model (SLM) adaptation. An information retrieval system is used after the first pass of the ASR system to obtain a set of relevant documents. These documents are then used to generate the new vocabulary and/or corpus. In this paper, we propose a new retrieving method well-adapted for this purpose. Experiments were carried out on French with a 28% OOV rate reduction. Experiments were also carried out on English for the SLM adaptation, with 7.9% perplexity reduction, and minor WER improvement.

[1] Mari Ostendorf,et al. Relevance weighting for combining multi-domain data for n-gram language modeling , 1999, Comput. Speech Lang..

[2] R. A. Leibler,et al. On Information and Sufficiency , 1951 .

[3] Renato De Mori,et al. A fuzzy decision strategy for topic identification and dynamic selection of language models , 2000, Signal Process..

[4] Philip Clarkson,et al. The applicability of adaptive language modelling for the broadcast news task , 1998, ICSLP.

[5] Claudio Carpineto,et al. An information-theoretic approach to automatic query expansion , 2001, TOIS.

[6] W. Bruce Croft,et al. Cluster-based language models for distributed retrieval , 1999, SIGIR '99.

[7] Ido Dagan,et al. Similarity-Based Models of Word Cooccurrence Probabilities , 1998, Machine Learning.