Automatic Language Identification: An Alternative Unsupervised Approach Using a New Hybrid Algorithm

This paper deals with our research on unsupervised classification for automatic language identification purpose. The study of this new hybrid algorithm shows that the combination of the Kmeans and the artificial ants and taking advantage of an n-gram text representation is promising. We propose an alternative approach to the standard use of both algorithms. A multilingual text corpus is used to assess this approach. Taking into account that this method does not require a priori information (number of classes, initial partition), is able to quickly process large amount of data, and that the results can also be visualised. We can say that, these results are very promising and offer many perspectives.

[1]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[2]  Stan Matwin,et al.  A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization , 2001 .

[3]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[4]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[5]  Alain Lelu,et al.  Consultation " floue " de grandes listes de formes lexicales simples et composées : un outil préparatoire pour l'analyse de grands corpus textuels. , 2000 .

[6]  Timothy Baldwin,et al.  Reconsidering Language Identification for Written Language Resources , 2006, LREC.

[7]  Abdelmalek Amine,et al.  SOM-BASED CLUSTERING OF TEXTUAL DOCUMENTS USING WORDNET , 2009 .

[8]  Daphne Koller,et al.  Using machine learning to improve information access , 1998 .

[9]  Baldo Faieta,et al.  Diversity and adaptation in populations of clustering ants , 1994 .

[10]  Radim Rehurek,et al.  Language Identification on the Web: Extending the Dictionary Method , 2009, CICLing.

[11]  Ying Liu,et al.  On Document Representation and Term Weights in Text Classification , 2009 .

[12]  Jean-Louis Deneubourg,et al.  The dynamics of collective sorting robot-like ants and ant-like robots , 1991 .

[13]  C. E. SHANNON,et al.  A mathematical theory of communication , 1948, MOCO.

[14]  Kenneth R. Beesley,et al.  Language Identifier: A Computer Program for Automatic Natural-Language Identification of On-line Tex , 1988 .

[15]  Dan Shen,et al.  Performance and Scalability of a Large-Scale N-gram Based Information Retrieval System , 2000, J. Digit. Inf..

[16]  Hinrich Schütze,et al.  A comparison of classifiers and document representations for the routing problem , 1995, SIGIR '95.

[17]  Mathieu Stricker Reseaux de neurones pour le traitement automatique du langage : conception et realisation de filtres d'informations , 2000 .

[18]  Nicolas Monmarché,et al.  On Improving Clustering in Numerical Databases with Artificial Ants , 1999, ECAL.

[19]  Douglas W. Oard,et al.  Multilingual Information Access , 2010 .

[20]  W. B. Cavnar,et al.  N-gram-based text categorization , 1994 .

[21]  Abdellatif Rahmoun,et al.  Experimenting N-Grams in Text Categorization , 2007, Int. Arab J. Inf. Technol..