Evaluation and comparison of concept based and n-grams based text clustering using SOM

With the great and rapidly growing number of documents available in digital form (Internet, library, CD-Rom…), the automatic classification of texts has become a significant research field and a fundamental task in document processing. This paper deals with unsupervised classification of textual documents also called text clustering using Self-Organizing Maps of Kohonen in two new situations: a conceptual representation of texts and a representation based on n-grams, instead of a representation based on words. The effects of these combinations are examined in several experiments using 4 measurements of similarity. The Reuters-21578 corpus is used for evaluation. The evaluation was done by using the F-measure and the entropy.

[1]  George W. Furnas,et al.  Pictures of relevance: A geometric analysis of similarity measures , 1987, J. Am. Soc. Inf. Sci..

[2]  Hinrich Schütze,et al.  A comparison of classifiers and document representations for the routing problem , 1995, SIGIR '95.

[3]  Kjersti Aas,et al.  Text Categorisation: A Survey , 1999 .

[4]  Abdelmalek Amine,et al.  Classification Automatique Non supervisée de Documents Textuels basés sur Wordnet , 2008, EGC.

[5]  Teuvo Kohonen,et al.  Self-organized formation of topologically correct feature maps , 2004, Biological Cybernetics.

[6]  Dan Shen,et al.  Performance and Scalability of a Large-Scale N-gram Based Information Retrieval System , 2000, J. Digit. Inf..

[7]  Daphne Koller,et al.  Using machine learning to improve information access , 1998 .

[8]  Rada Mihalcea,et al.  Semantic Indexing using WordNet Senses , 2000 .

[9]  Abdelmalek Amine,et al.  Concept-based clustering of textual documents using SOM , 2008, 2008 IEEE/ACS International Conference on Computer Systems and Applications.

[10]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[11]  Pavel Berkhin,et al.  A Survey of Clustering Data Mining Techniques , 2006, Grouping Multidimensional Data.

[12]  Delphine Bernhard,et al.  SOM-based Clustering of Multilingual Documents Using an Ontology , 2008 .

[13]  Claude de Loupy,et al.  Evaluation de l'apport de connaissances linguistiques en desambigui͏̈sation sémantique et recherche documentaire , 2000 .

[14]  Julio Gonzalo,et al.  Indexing with WordNet synsets can improve text retrieval , 1998, WordNet@ACL/COLING.

[15]  Abdellatif Rahmoun,et al.  Experimenting N-Grams in Text Categorization , 2007, Int. Arab J. Inf. Technol..

[16]  T. Kohonen Self-Organized Formation of Correct Feature Maps , 1982 .

[17]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[18]  Samuel Kaski,et al.  Self organization of a massive document collection , 2000, IEEE Trans. Neural Networks Learn. Syst..

[19]  Yong Wang,et al.  Incorporating semantic and syntactic information into document representation for document clustering , 2005 .