Evaluation of text clustering methods using wordnet

The increasing number of digitized texts presently available notably on the Web has developed an acute need in text mining techniques. Clustering systems are used more and more often in text mining, especially to analyze texts and to extract knowledge they contain. With the availability of the vast amount of clustering algorithms and techniques, it becomes highly confusing to a user to choose the algorithm that best suits its target dataset. Actually, it is very hard to define which algorithms work the best, since results depend considerably on the application and on the kinds of data at hand. In this paper, we propose, study and compare three text clustering methods: an ascending hierarchical clustering method, a SOM.based clustering method and an ant.based clustering method, all of these based on the synsets of WordNet as terms for the representation of textual documents. The effects of these methods are examined in several experiments using 3 similarity measurements: the cosine distance, the Euclidean distance and the manhattan distance. The reuters.21578 corpus is used for evaluation. The evaluation was done, by using the F.measure. The results obtained show that the SOM.based clustering method using the cosine distance provides the best results.

[1]  R. Sokal,et al.  Numerical Taxonomy: The Principles and Practice of Numerical Classification. , 1975 .

[2]  Julio Gonzalo,et al.  Indexing with WordNet synsets can improve text retrieval , 1998, WordNet@ACL/COLING.

[3]  Abdellatif Rahmoun,et al.  Experimenting N-Grams in Text Categorization , 2007, Int. Arab J. Inf. Technol..

[4]  Daphne Koller,et al.  Using machine learning to improve information access , 1998 .

[5]  Rada Mihalcea,et al.  Semantic Indexing using WordNet Senses , 2000 .

[6]  Samuel Kaski,et al.  Self organization of a massive document collection , 2000, IEEE Trans. Neural Networks Learn. Syst..

[7]  Katherine J. Miller,et al.  Adjectives in WordNet , 1990 .

[8]  Pavel Berkhin,et al.  A Survey of Clustering Data Mining Techniques , 2006, Grouping Multidimensional Data.

[9]  George W. Furnas,et al.  Pictures of relevance: A geometric analysis of similarity measures , 1987, J. Am. Soc. Inf. Sci..

[10]  Baldo Faieta,et al.  Diversity and adaptation in populations of clustering ants , 1994 .

[11]  Kjersti Aas,et al.  Text Categorisation: A Survey , 1999 .

[12]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[13]  Abdelmalek Amine,et al.  SOM-BASED CLUSTERING OF TEXTUAL DOCUMENTS USING WORDNET , 2009 .

[14]  Dan Shen,et al.  Performance and Scalability of a Large-Scale N-gram Based Information Retrieval System , 2000, J. Digit. Inf..

[15]  Hinrich Schütze,et al.  A comparison of classifiers and document representations for the routing problem , 1995, SIGIR '95.

[16]  Jean-Louis Deneubourg,et al.  Harvesting by a group of robots , 1992 .

[17]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[18]  Claude de Loupy,et al.  Evaluation de l'apport de connaissances linguistiques en desambigui͏̈sation sémantique et recherche documentaire , 2000 .

[19]  Abdelmalek Amine,et al.  Classification Automatique Non supervisée de Documents Textuels basés sur Wordnet , 2008, EGC.

[20]  Teuvo Kohonen,et al.  Self-organized formation of topologically correct feature maps , 2004, Biological Cybernetics.

[21]  Yong Wang,et al.  Incorporating semantic and syntactic information into document representation for document clustering , 2005 .