Concept-based clustering of textual documents using SOM

The classification of textual documents has been widely studied. The majority of classification approaches use supervised learning methods, which are acceptable for rather small corpora allowing experts to generate representative sets of data for the training, but are not feasible for significant flows of data. Unsupervised classification methods discover latent (hidden) classes automatically while minimizing human intervention. Many such methods exist, among which Kohonen self- organizing maps (SOM), which gather a certain number of similar objects without prior information. In this paper, we evaluate and compare the use of SOMs for the classification of textual documents in two situations: a conceptual representation of texts and a representation based on n-grams.

[1]  Teuvo Kohonen,et al.  Self-organized formation of topologically correct feature maps , 2004, Biological Cybernetics.

[2]  Yong Wang,et al.  Incorporating semantic and syntactic information into document representation for document clustering , 2005 .

[3]  Abdelmalek Amine,et al.  SOM-BASED CLUSTERING OF TEXTUAL DOCUMENTS USING WORDNET , 2009 .

[4]  George W. Furnas,et al.  Pictures of relevance: A geometric analysis of similarity measures , 1987, J. Am. Soc. Inf. Sci..

[5]  Kjersti Aas,et al.  Text Categorisation: A Survey , 1999 .

[6]  Samuel Kaski,et al.  Self organization of a massive document collection , 2000, IEEE Trans. Neural Networks Learn. Syst..

[7]  Hinrich Schütze,et al.  A comparison of classifiers and document representations for the routing problem , 1995, SIGIR '95.

[8]  Claude de Loupy,et al.  Evaluation de l'apport de connaissances linguistiques en desambigui͏̈sation sémantique et recherche documentaire , 2000 .

[9]  Pavel Berkhin,et al.  A Survey of Clustering Data Mining Techniques , 2006, Grouping Multidimensional Data.

[10]  Daphne Koller,et al.  Using machine learning to improve information access , 1998 .

[11]  Rada Mihalcea,et al.  Semantic Indexing using WordNet Senses , 2000 .

[12]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[13]  T. Kohonen Self-Organized Formation of Correct Feature Maps , 1982 .

[14]  Delphine Bernhard,et al.  SOM-based Clustering of Multilingual Documents Using an Ontology , 2008 .

[15]  Dan Shen,et al.  Performance and Scalability of a Large-Scale N-gram Based Information Retrieval System , 2000, J. Digit. Inf..

[16]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[17]  Julio Gonzalo,et al.  Indexing with WordNet synsets can improve text retrieval , 1998, WordNet@ACL/COLING.