Exploration of Document Collections with Self-Organizing Maps: A Novel Approach to Similarity Representation

Classification is one of the central issues in any system dealing with text data. The need for effective approaches is dramatically increased nowadays due to the advent of massive digital libraries containing free-form documents. What we are looking for are powerful methods for the exploration of such libraries whereby the detection of similarities between the various text documents is the overall goal. In other words, methods that may be used to gain insight in the inherent structure of the various items contained in a text archive are needed. In this paper we demonstrate the applicability of self-organizing maps, a neural network model adhering to the unsupervised learning paradigm, for the task of text document clustering. In order to improve the representation of the result we present an extension to the basic learning rule that captures the movement of the various weight vectors in a two-dimensional output space for convenient visual inspection. The result of the extended training algorithm allows intuitive analysis of the similarities inherent in the input data and most important, intuitive recognition of cluster boundaries.

[1]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[2]  James L. McClelland,et al.  Parallel distributed processing: explorations in the microstructure of cognition, vol. 1: foundations , 1986 .

[3]  Andreas Rauber,et al.  On the Similarity of Eagles, Hawks, and Cows: Visualization of Semantic Similarity in Self-Organizin , 1997 .

[4]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[5]  Dieter Merkl,et al.  Exploration of text collections with hierarchical feature maps , 1997, SIGIR '97.

[6]  Dieter Merkl,et al.  A Connectionist View on Document Classification , 1995, Australasian Database Conference.

[7]  Teuvo Kohonen,et al.  Self-Organizing Maps , 2010 .

[8]  David Zipser,et al.  Feature Discovery by Competive Learning , 1986, Cogn. Sci..

[9]  Reginald Meeson Book Review: Data Abstraction and Object-Oriented Programming in C++ by Keith Gorlen, Sanford Orlow, and Perry Plexico: (John Wiley & Sons, 1990) , 1991 .

[10]  Peter Willett,et al.  Recent trends in hierarchic document clustering: A critical review , 1988, Inf. Process. Manag..

[11]  W. Bruce Croft,et al.  A Comparison of Text Retrieval Models , 1992, Comput. J..

[12]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[13]  D. Merkl,et al.  Content-based software classification by self-organization , 1995, Proceedings of ICNN'95 - International Conference on Neural Networks.

[14]  Perry S. Plexico,et al.  Data abstraction and object-oriented programming in C++ , 1990 .

[15]  Gary Marchionini,et al.  A self-organizing semantic map for information retrieval , 1991, SIGIR '91.

[16]  Bernd Fritzke,et al.  Growing cell structures--A self-organizing network for unsupervised and supervised learning , 1994, Neural Networks.

[17]  Timo Honkela,et al.  Self-Organizing Maps of Document Collections: A New Approach to Interactive Exploration , 1996, KDD.

[18]  Risto Miikkulainen,et al.  Incremental grid growing: encoding high-dimensional structure into a two-dimensional feature map , 1993, IEEE International Conference on Neural Networks.

[19]  Dieter Merkl,et al.  Visualizing Similarities in High Dimensional Input Spaces with a Growing and Splitting Neural Network , 1996, ICANN.