Linguini: language identification for multilingual documents

Presents Linguini, a vector-space based categorizer tailored for high-precision language identification. We show how the accuracy depends on the size of the input document, the set of languages under consideration and the features used. We found that Linguini could identify the language of documents as short as 5-10% of the size of average Web documents with 100% accuracy. We also describe how to determine if a document is in two or more languages, and in what proportions, without incurring any appreciable computational overhead beyond that of monolingual analysis. This approach can be applied to subject categorization systems to distinguish between cases where, when the system recommends two or more categories, the document belongs strongly to all or really to none.

[1]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[2]  M Damashek,et al.  Gauging Similarity with n-Grams: Language-Independent Categorization of Text , 1995, Science.

[3]  Donna Harman,et al.  How effective is suffixing , 1991 .

[4]  Donna K. Harman,et al.  Ranking Algorithms , 1992, Information Retrieval: Data Structures & Algorithms.

[5]  Yiming Yang,et al.  Expert network: effective and efficient learning from human decisions in text categorization and retrieval , 1994, SIGIR '94.

[6]  Gregory Grefenstette,et al.  Cross-Language Information Retrieval , 1998, The Springer International Series on Information Retrieval.

[7]  Ricardo Baeza-Yates,et al.  Information Retrieval: Data Structures and Algorithms , 1992 .

[8]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[9]  Yoelle Maarek,et al.  Full text indexing based on lexical relations an application: software libraries , 1989, SIGIR '89.

[10]  Mark T. Maybury,et al.  Advances in Automatic Text Summarization , 1999 .

[11]  David A. Hull Stemming algorithms: a case study for detailed evaluation , 1996 .

[12]  Giles,et al.  Searching the world wide Web , 1998, Science.

[13]  James P. Callan,et al.  Training algorithms for linear text classifiers , 1996, SIGIR '96.

[14]  W. B. Cavnar,et al.  N-gram-based text categorization , 1994 .

[15]  Julian Kupiec,et al.  MURAX: a robust linguistic approach for question answering using an on-line encyclopedia , 1993, SIGIR.

[16]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[17]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[18]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[19]  Nina Wacholder,et al.  Disambiguation of Proper Names in Text , 1997, ANLP.

[20]  Sholom M. Weiss,et al.  Automated learning of decision rules for text categorization , 1994, TOIS.