Text classification with self-organizing maps: Some lessons learned

Abstract The self-organizing map has already found appreciation for document classification in the information retrieval community. The map display is a highly effective and intuitive metaphor for orientation in the information space established by a document collection. In this paper we discuss ways for using self-organizing maps for document classification. Furthermore, we argue in favor of paying more attention to the fact that document collections lend themselves naturally to a hierarchical structure defined by the subject matter of the documents. We take advantage of this fact by using a hierarchically organized neural network, built up from a number of independent self-organizing maps in order to enable the true establishment of a document taxonomy. As a highly convenient side effect of using such an architecture, the time needed for training is reduced substantially and the user is provided with an even more intuitive metaphor for visualization. Since the single layers of self-organizing maps represent different aspects of the document collection at different levels of detail, the neural network shows the document collection in a form comparable to an atlas where the user may easily select the most appropriate degree of granularity depending on the actual focus of interest during the exploration of the document collection.

[1]  Peter Willett,et al.  Recent trends in hierarchic document clustering: A critical review , 1988, Inf. Process. Manag..

[2]  Jarkko Venna,et al.  Automatic Coloring of Data According to Its Cluster Structure , 1997 .

[3]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[4]  Susan T. Dumais,et al.  The vocabulary problem in human-system communication , 1987, CACM.

[5]  Dieter Merkl,et al.  Exploration of text collections with hierarchical feature maps , 1997, SIGIR '97.

[6]  Risto Mukkulainen,et al.  Script Recognition with Hierarchical Feature Maps , 1990 .

[7]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[8]  Bernd Fritzke,et al.  Growing cell structures--A self-organizing network for unsupervised and supervised learning , 1994, Neural Networks.

[9]  Gary Marchionini,et al.  A self-organizing semantic map for information retrieval , 1991, SIGIR '91.

[10]  Timo Honkela,et al.  Self-Organizing Maps of Document Collections: A New Approach to Interactive Exploration , 1996, KDD.

[11]  Dieter Merkl,et al.  En route to data mining in legal text corpora: clustering, neural computation, and international treaties , 1997, Database and Expert Systems Applications. 8th International Conference, DEXA '97. Proceedings.

[12]  Risto Miikkulainen,et al.  Incremental grid growing: encoding high-dimensional structure into a two-dimensional feature map , 1993, IEEE International Conference on Neural Networks.

[13]  Dieter Merkl,et al.  Exploration of Document Collections with Self-Organizing Maps: A Novel Approach to Similarity Representation , 1997, PKDD.

[14]  D. Merkl,et al.  CONCAT - Connotation Analysis of Thesauri Based on the Interpretation of Context Meaning , 1994, DEXA.

[15]  Andreas Rauber,et al.  Cluster Connections: A visualization technique to reveal cluster boundaries in self-organizing maps , 1998 .

[16]  Dieter Merkl,et al.  The exploration of legal text corpora with hierarchical neural networks: a guided tour in public international law , 1997, ICAIL '97.

[17]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[18]  Risto Miikkulainen,et al.  Script Recognition with Hierarchical Feature Maps , 1992 .

[19]  Timo Honkela,et al.  Contextual Relations of Words in Grimm Tales, Analyzed by Self-Organizing Map , 1995 .

[20]  Teuvo Kohonen,et al.  Self-Organizing Maps , 2010 .

[21]  Timo Honkela,et al.  WEBSOM - Self-organizing maps of document collections , 1998, Neurocomputing.

[22]  Timo Honkela,et al.  Newsgroup Exploration with WEBSOM Method and Browsing Interface , 1996 .

[23]  Dieter Merkl,et al.  A Connectionist View on Document Classification , 1995, Australasian Database Conference.

[24]  D. Merkl,et al.  Content-based software classification by self-organization , 1995, Proceedings of ICNN'95 - International Conference on Neural Networks.

[25]  Timo Honkela,et al.  Very Large Two-Level SOM for the Browsing of Newsgroups , 1996, ICANN.

[26]  Dieter Merkl,et al.  Visualizing Similarities in High Dimensional Input Spaces with a Growing and Splitting Neural Network , 1996, ICANN.

[27]  W. Bruce Croft,et al.  A Comparison of Text Retrieval Models , 1992, Comput. J..

[28]  Reginald Meeson Book Review: Data Abstraction and Object-Oriented Programming in C++ by Keith Gorlen, Sanford Orlow, and Perry Plexico: (John Wiley & Sons, 1990) , 1991 .