论文信息 - Identification of Document Language is Not yet a Completely Solved Problem

Identification of Document Language is Not yet a Completely Solved Problem

Existing Language Identification (LID) approaches do reach 100% precision, in most common situations, when dealing with documents written in just one language, and when those documents are large enough. However, LID approaches do not provide a reliable solution for some situations: when there is need to discriminate the correct variant of the language used in a text, for example, European or Brazilian variants of Portuguese, UK or USA English variants, or any other language variants. Another hard context occur with small touristic advertisements on the web, addressing foreigners but using local language to name most local entities. In this paper, we present a fully statistics- based LID approach which learns the most discriminant information according to each context, and identifies the correct language or language variant a text is written in. This methodology is shown to be correct for normal texts and maintains its robustness in hard LID contexts.

José Gabriel Pereira Lopes | Joaquim Ferreira da Silva | J. Lopes | J. F. Silva

[1] Bruno Martins,et al. Language Identication in Web Pages , 2005 .

[2] Y. Escoufier,et al. A Propos de la Comparaison Graphique des Matrices de Variance , 1978 .

[3] Kenneth R. Beesley,et al. Language Identifier: A Computer Program for Automatic Natural-Language Identification of On-line Tex , 1988 .

[4] Mário J. Silva,et al. Language identification in web pages , 2005, SAC '05.

[5] Ted E. Dunning,et al. Statistical Identification of Language , 1994 .

[6] W. B. Cavnar,et al. N-gram-based text categorization , 1994 .

[7] Charles E. Heckler,et al. Applied Multivariate Statistical Analysis , 2005, Technometrics.

[8] Rafael Dueire Lins,et al. Automatic language identification of written texts , 2004, SAC '04.

[9] Penelope Sibun,et al. Language Determination: Natural Language Processing from Scanned Document Images , 1994, ANLP.