Integrating contextual information to enhance SOM-based text document clustering

Exploration of text corpora using self-organizing maps has shown promising results in recent years. Topographic map approaches usually use the original vector space model known from Information Retrieval for text document representation. In this paper I present a two stage model using features based on sentence categories as alternative approach which includes contextual information. Algorithmic optimizations required by this computationally expensive model are shown and evaluated. Also a method for model independent comparison of document maps by evaluation of document distribution on maps is introduced and used to compare results obtained with both the new model and the vector space model.

[1]  Timo Honkela,et al.  Self-Organizing Maps In Natural Language Processing , 1997 .

[2]  Andreas Rauber,et al.  The SOMLib Digital Library System , 1999, ECDL.

[3]  Shimon Edelman,et al.  Learning Similarity-based Word Sense Disambiguation from Sparse Data , 1996, VLC@COLING.

[4]  Marti A. Hearst TextTiling: A Quantitative Approach to Discourse , 1993 .

[5]  Giovanni Da San Martino Self-Organizing Maps in Natural Language Processing , 2003 .

[6]  Timo Honkela,et al.  Creating an Order in Digital Libraries with Self-Organizing Maps , 1996 .

[7]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[8]  Andrew McCallum,et al.  Distributional clustering of words for text classification , 1998, SIGIR '98.

[9]  Marti A. Hearst Text tiling: A quantitative approach to discourse segmentation , 1993, ACL 1993.

[10]  Mu-Chun Su,et al.  Fast self-organizing feature map algorithm , 2000, IEEE Trans. Neural Networks Learn. Syst..

[11]  Ralf Der,et al.  Integrating Contextual Information into Text Document Clustering with Self-Organizing Maps , 2001, WSOM.

[12]  Teuvo Kohonen,et al.  Self-Organizing Maps , 2010 .

[13]  Naftali Tishby,et al.  Distributional Clustering of English Words , 1993, ACL.

[14]  Samuel Kaski,et al.  Dimensionality reduction by random mapping: fast similarity computation for clustering , 1998, 1998 IEEE International Joint Conference on Neural Networks Proceedings. IEEE World Congress on Computational Intelligence (Cat. No.98CH36227).

[15]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[16]  Samuel Kaski,et al.  Self organization of a massive document collection , 2000, IEEE Trans. Neural Networks Learn. Syst..