A Fine-Grained Model for Language Identification

Existing state-of-the-art techniques to identify the language of a written text most often use a 3-gram frequency table as basis for ’fingerprinting’ a language. While this approach performs very well in practice (99%-ish accuracy) if the text to be classified is of size, say, 100 characters or more, it cannot be used reliably to classify even shorter input, nor can it detect if the input is a concatenation of text from several languages. The present paper describes a more fine-grained model which aims at reliable classification of input as short as one word. It is heavier than the classic classifiers in that it stores a large frequency dictionary as well as an affix table, but with significant gains in elegance since the classifier is entirely unsupervised. Classifying a short input query in multilingual information retrieval is the target application for which the method was developed, but also tools such as spell-checkers will benefit from recognising occasional interspersed foreign words. It is also acknowledged that a lot of practical applications do not need this fine level of granularity, and thus remain largely unbenefited by the new model. Not having access to real-world multi-lingual query data, we evaluate rigorously, using a 32-language parallel bible corpus, that accuracy is competitive on short input as well as multi-lingual input, and not only for a set of European languages with similar morphological typology.

[1]  Douglas-Val Ziegler The automatic identification of languages using linguistic recognition signals , 1992 .

[2]  W. B. Cavnar,et al.  N-gram-based text categorization , 1994 .

[3]  Ted E. Dunning,et al.  Statistical Identification of Language , 1994 .

[4]  M Damashek,et al.  Gauging Similarity with n-Grams: Language-Independent Categorization of Text , 1995, Science.

[5]  A. Lawrence Spitz,et al.  Automatic language identification , 1997 .

[6]  John M. Prager,et al.  Linguini: language identification for multilingual documents , 1999, Proceedings of the 32nd Annual Hawaii International Conference on Systems Sciences. 1999. HICSS-32. Abstracts and CD-ROM of Full Papers.

[7]  Arjen Poutsma,et al.  Applying Monte Carlo Techniques to Language Identification , 2001, CLIN.

[8]  Ibrahim Sogukpinar,et al.  Centroid-Based Language Identification Using Letter Feature Set , 2004, CICLing.

[9]  Ioannis Pitas,et al.  Language identification in web documents using discrete HMMs , 2004, Pattern Recognit..

[10]  Rafael Dueire Lins,et al.  Automatic language identification of written texts , 2004, SAC '04.

[11]  Christian Biemann,et al.  Disentangling from Babylonian Confusion - Unsupervised Language Identification , 2005, CICLing.

[12]  H. Isahara,et al.  Language identification based on string kernels , 2005, IEEE International Symposium on Communications and Information Technology, 2005. ISCIT 2005..

[13]  Paul McNamee,et al.  Language identification: a solved problem suitable for undergraduate instruction , 2005 .

[14]  Mário J. Silva,et al.  Language identification in web pages , 2005, SAC '05.

[15]  Kavi Narayana Murthy,et al.  Language identification from small text samples* , 2006, J. Quant. Linguistics.

[16]  Timothy Baldwin,et al.  Reconsidering Language Identification for Written Language Resources , 2006, LREC.

[17]  Terrence Martin,et al.  A syllable-scale framework for language identification , 2006, Comput. Speech Lang..

[18]  P. Juola Language Identification, Automatic , 2006 .

[19]  Harald Hammarström A Naive Theory of Morphology and an Algorithm for Extraction , 2006, ACL 2006.

[20]  José Gabriel Pereira Lopes,et al.  Identification of Document Language is Not yet a Completely Solved Problem , 2006, 2006 International Conference on Computational Inteligence for Modelling Control and Automation and International Conference on Intelligent Agents Web Technologies and International Commerce (CIMCA'06).