Enhanced phrase-based document clustering using Self-Organizing Map (SOM) architectures

Availability of large full-text document collections in electronic form has created a need for tools and techniques that assist users in organizing these collections. Document clustering is one of the popular methods used for this purpose. The Self-organizing map (SOM), an unsupervised algorithm for clustering and topographic mapping, has shown promising results in this task. Most of the existing SOM techniques rely on a “bag of words” document representation. Each word in the document is considered as a separate feature, ignoring the word order. In this chapter we investigate the use of phrases rather than words as document features for document clustering. We present a phrase grammar extraction technique, and use the extracted phrases as features in two different document clustering algorithms, self-organizing map (SOM) and hierarchical self-organizing map (HSOM). We present results of clustering documents from the REUTERS corpus and show an improvement in the clustering performance evaluated using the entropy and F-measure.

[1]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[2]  Oren Etzioni,et al.  Fast and Intuitive Clustering of Web Documents , 1997, KDD.

[3]  Stan Matwin,et al.  A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization , 2001 .

[4]  Mohamed S. Kamel,et al.  Phrase-based document similarity based on an index graph model , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[5]  Andreas Rauber,et al.  The SOMLib Digital Library System , 1999, ECDL.

[6]  Samuel Kaski,et al.  Dimensionality reduction by random mapping: fast similarity computation for clustering , 1998, 1998 IEEE International Joint Conference on Neural Networks Proceedings. IEEE World Congress on Computational Intelligence (Cat. No.98CH36227).

[7]  Mohamed S. Kamel,et al.  Document Similarity Using a Phrase Indexing Graph Model , 2003, Knowledge and Information Systems.

[8]  Samuel Kaski,et al.  Self organization of a massive document collection , 2000, IEEE Trans. Neural Networks Learn. Syst..

[9]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[10]  Oren Etzioni,et al.  Web document clustering: a feasibility demonstration , 1998, SIGIR '98.

[11]  Ellen Riloff,et al.  A Case Study in Using Linguistic Phrases for Text Categorization on the WWW , 1998 .

[12]  Mohamed S. Kamel,et al.  A SOM-based document clustering using phrases , 2002, Proceedings of the 9th International Conference on Neural Information Processing, 2002. ICONIP '02..

[13]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[14]  Giovanni Da San Martino Self-Organizing Maps in Natural Language Processing , 2003 .

[15]  Teuvo Kohonen,et al.  Self-Organizing Maps , 2010 .

[16]  Daphne Koller,et al.  Hierarchically Classifying Documents Using Very Few Words , 1997, ICML.

[17]  Mohamed S. Kamel,et al.  Extraction of Text Phrases Using Hierarchical Grammar , 2002, Canadian Conference on AI.

[18]  Mohamed S. Kamel,et al.  Document clustering using hierarchical SOMART neural network , 2003, Proceedings of the International Joint Conference on Neural Networks, 2003..

[19]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[20]  Daniel Pullwitt,et al.  Integrating contextual information to enhance SOM-based text document clustering , 2002, Neural Networks.

[21]  Teuvo Kohonen,et al.  Self-Organization and Associative Memory , 1988 .

[22]  Amita Goyal Chin Text Databases and Document Management: Theory and Practice , 2000 .