论文信息 - Language Recognition for Mono-and Multi-lingual Documents

Language Recognition for Mono-and Multi-lingual Documents

In this paper we describe language recognition algorithms for monoand multi-lingual documents that are based on mixed-order n-grams, Markov chains, maximum likelihood, and dynamic programming. We compare the monolingual algorithm to those suggested by other researchers. This comparison suggests that this algorithm significantly outperforms commonly used language recognition algorithms. We then describe the multilingual algorithm, which allows for segmenting a multilingual document into single language chunks and identifying the languages of those chunks.

Ron Zacharski | Jim Cowie | Yevgeny Ludovik

[1] W. B. Cavnar,et al. N-gram-based text categorization , 1994 .

[2] Ted E. Dunning,et al. Statistical Identification of Language , 1994 .