Effective language identification of forum texts based on statistical approaches

Abstract This investigation deals with the problem of language identification of noisy texts, which could represent the primary step of many natural language processing or information retrieval tasks. Language identification is the task of automatically identifying the language of a given text. Although there exists several methods in the literature, their performances are not so convincing in practice. In this contribution, we propose two statistical approaches: the high frequency approach and the nearest prototype approach. In the first one, 5 algorithms of language identification are proposed and implemented, namely: character based identification (CBA), word based identification (WBA), special characters based identification (SCA), sequential hybrid algorithm (HA1) and parallel hybrid algorithm (HA2). In the second one, we use 11 similarity measures combined with several types of character N-Grams. For the evaluation task, the proposed methods are tested on forum datasets containing 32 different languages. Furthermore, an experimental comparison is made between the proposed approaches and some referential language identification tools such as: LIGA, NTC, Google translate and Microsoft Word. Results show that the proposed approaches are interesting and outperform the baseline methods of language identification on forum texts.

[1]  Ali Selamat,et al.  Arabic script web page language identifications using decision tree neural networks , 2011, Pattern Recognit..

[2]  Andrew Trotman,et al.  A study in language identification , 2012, ADCS.

[3]  Jilei Tian,et al.  Scalable neural network based language identification from written text , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[4]  Anil Kumar Singh Study of Some Distance Measures for Language and Encoding Identification , 2006 .

[5]  J. Vinosh Babu,et al.  Automatic language identification using multivariate analysis , 2005 .

[6]  Jilei Tian,et al.  n-gram and decision tree based language identification for written words , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[7]  Theresa Wilson,et al.  Language Identification for Creating Language-Specific Twitter Collections , 2012 .

[8]  Mário J. Silva,et al.  Language identification in web pages , 2005, SAC '05.

[9]  Ioannis Pitas,et al.  Language identification in web documents using discrete HMMs , 2004, Pattern Recognit..

[10]  Taeho Jo,et al.  Neural Text Categorizer for Exclusive Text Categorization , 2008, J. Inf. Process. Syst..

[11]  Eugénio C. Oliveira,et al.  Determining language variant in microblog messages , 2013, SAC '13.

[12]  Ali Selamat,et al.  Arabic Script Web Documents Language Identification Using Decision Tree-ARTMAP Model , 2007 .

[13]  Timothy Baldwin,et al.  Reconsidering Language Identification for Written Language Resources , 2006, LREC.

[14]  Thomas Gottron,et al.  A Comparison of Language Identification Approaches on Short, Query-Style Texts , 2010, ECIR.