Language Recognition for Mono-and Multi-lingual Documents
暂无分享,去创建一个
In this paper we describe language recognition algorithms for monoand multi-lingual documents that are based on mixed-order n-grams, Markov chains, maximum likelihood, and dynamic programming. We compare the monolingual algorithm to those suggested by other researchers. This comparison suggests that this algorithm significantly outperforms commonly used language recognition algorithms. We then describe the multilingual algorithm, which allows for segmenting a multilingual document into single language chunks and identifying the languages of those chunks.
[1] W. B. Cavnar,et al. N-gram-based text categorization , 1994 .
[2] Ted E. Dunning,et al. Statistical Identification of Language , 1994 .