论文信息 - Using the web for fast language model construction in minority languages

Using the web for fast language model construction in minority languages

The design and construction of a language model for minority languages is a hard task. By minority language, we mean a language with small available resources, especially for the statistical learning problem. In this paper, a new methodology for fast language model construction in minority languages is proposed. It is based on the use of Web resources to collect and make efficient textual corpora. By using some filtering techniques, this methodology allows a quick and efficient construction of a language model with a small cost in term of computational and human resources. Our primary experiments have shown excellent performance of the Web language models vs newspaper language models using the proposed filtering methods on a majority language (French). Following the same way for a minority language (Vietnamese), a valuable language model was constructed in 3 month with only 15% new development to modify some filtering tools.

Laurent Besacier | Brigitte Bigi | Eric Castelli | Viet Bac Le

[1] L. Lamel. Some Issues in Speech Recognizer Portability , 2003 .

[2] Ronald Rosenfeld,et al. A maximum entropy approach to adaptive statistical language modelling , 1996, Comput. Speech Lang..

[3] Dominique Vaufreydaz,et al. From generic to task-oriented speech recognition : French experience in the NESPOLE! European project , 2001 .

[4] Dominique Vaufreydaz,et al. A New Methodology for Speech Corpora Definition from Internet Documents , 2000, LREC.

[5] Andreas Stolcke,et al. SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[6] Rayid Ghani,et al. Building Minority Language Corpora by Learning to Generate Web Search Queries , 2003, Knowledge and Information Systems.

[7] Tanja Schultz,et al. Language-independent and language-adaptive acoustic modeling for speech recognition , 2001, Speech Commun..

[8] Kate Knill,et al. Portability of Automatic Speech Recognition Technology to New Languages: Multilinguality Issues and Speech/Text Resources , 2001 .

[9] Kiyohiro Shikano,et al. Automatic n-gram language model creation from web resources , 2001, INTERSPEECH.

[10] Vincent Berment. Several Technical Issues for Building New Lexical Bases , 2002 .

[11] James C. French,et al. Obtaining language models of web collections using query-based sampling techniques , 2002, Proceedings of the 35th Annual Hawaii International Conference on System Sciences.