论文信息 - N-gram models for language detection

N-gram models for language detection

In this document we report a set of experiments using n-gram language models for automatic language detection of text. We will start with a brief explanation of the concepts and of the mathematics behind n-gram language models and discuss some applications and domains in which they are widely used. We will also present an overview of related work in language detection. Then, we will describe the resources used in the experiments, namely a subset of the Europarl corpus and the SRILM toolkit. We will then perform a toy experiment in order to explain in detail our methodology. Afterwards, we will evaluate the performance of different language models and parameters through a precision measure based on the perplexity of a text according to a model. We conclude that n-gram models are indeed a simple and efficient tool for automatic

Carlos

[1] Andreas Stolcke,et al. SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[2] Paul McNamee,et al. Language identification: a solved problem suitable for undergraduate instruction , 2005 .

[3] Slava M. Katz,et al. Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[4] W. B. Cavnar,et al. N-gram-based text categorization , 1994 .

[5] F ChenStanley,et al. An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[6] Ronald Rosenfeld,et al. Statistical language modeling using the CMU-cambridge toolkit , 1997, EUROSPEECH.

[7] A. Lawrence Spitz,et al. Automatic language identification , 1997 .

[8] Ted E. Dunning,et al. Statistical Identification of Language , 1994 .

[9] Philipp Koehn,et al. Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.