Research of fast SOM clustering for text information

The state-of-the-art text clustering methods suffer from the huge size of documents with high-dimensional features. In this paper, we studied fast SOM clustering technology for Text Information. Our focus is on how to enhance the efficiency of text clustering system whereas high clustering qualities are also kept. To achieve this goal, we separate the system into two stages: offline and online. In order to make text clustering system more efficient, feature extraction and semantic quantization are done offline. Although neurons are represented as numerical vectors in high-dimension space, documents are represented as collections of some important keywords, which is different from many related works, thus the requirement for both time and space in the offline stage can be alleviated. Based on this scenario, fast clustering techniques for online stage are proposed including how to project documents onto output layers in SOM, fast similarity computation method and the scheme of Incremental clustering technology for real-time processing, We tested the system using different datasets, the practical performance demonstrate that our approach has been shown to be much superior in clustering efficiency whereas the clustering quality are comparable to traditional methods.

[1]  José David Martín-Guerrero,et al.  Studying the feasibility of a recommender in a citizen web portal based on user modeling and clustering algorithms , 2006, Expert Syst. Appl..

[2]  William M. Pottenger,et al.  A framework for understanding Latent Semantic Indexing (LSI) performance , 2006, Inf. Process. Manag..

[3]  Ridvan Saraçoglu,et al.  A fuzzy clustering approach for finding similar documents using a novel similarity measure , 2007, Expert Syst. Appl..

[4]  皓仁 柯 Structure clustering for Chinese patent documents , 2008 .

[5]  Michael Wurst,et al.  Incremental Clustering of Newsgroup Articles , 2006, IEA/AIE.

[6]  Young Han Kim,et al.  CLAGen: a tool for clustering and annotating gene sequences using a suffix tree algorithm. , 2006, Bio Systems.

[7]  Louis Massey Evaluating and Comparing Text Clustering Results , 2005, Computational Intelligence.

[8]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[9]  Gülsen Aydin Keskin,et al.  The Fuzzy ART algorithm: A categorization method for supplier evaluation and selection , 2010, Expert Syst. Appl..

[10]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[11]  T. Kohonen Self-organized formation of topographically correct feature maps , 1982 .

[12]  Philip S. Yu,et al.  A Framework for Clustering Massive Text and Categorical Data Streams , 2006, SDM.

[13]  Daniel Pullwitt,et al.  Integrating contextual information to enhance SOM-based text document clustering , 2002, Neural Networks.

[14]  Roberto Avogadri,et al.  Fuzzy ensemble clustering based on random projections for DNA microarray data analysis , 2009, Artif. Intell. Medicine.

[15]  Wei Song,et al.  Genetic algorithm for text clustering using ontology and evaluating the validity of various semantic similarity measures , 2009, Expert Syst. Appl..

[16]  Philip S. Yu,et al.  A Framework for Projected Clustering of High Dimensional Data Streams , 2004, VLDB.

[17]  Teuvo Kohonen,et al.  Self-organized formation of topologically correct feature maps , 2004, Biological Cybernetics.

[18]  Yu-Liang Chi,et al.  An attentive self-organizing neural model for text mining , 2009, Expert Syst. Appl..

[19]  Wojciech Szpankowski,et al.  Finding biclusters by random projections , 2006, Theor. Comput. Sci..

[20]  Xue Z. Wang,et al.  Knowledge discovery from process operational data using PCA and fuzzy clustering , 2001 .

[21]  Jian Yin,et al.  An Efficient Clustering Algorithm for Small Text Documents , 2006, 2006 Seventh International Conference on Web-Age Information Management Workshops.

[22]  William-Chandra Tjhi,et al.  A heuristic-based fuzzy co-clustering algorithm for categorization of high-dimensional data , 2008, Fuzzy Sets Syst..

[23]  Ramayya Krishnan,et al.  Incremental hierarchical clustering of text documents , 2006, CIKM '06.

[24]  Hsin-Chang Yang,et al.  Construction of supervised and unsupervised learning systems for multilingual text categorization , 2009, Expert Syst. Appl..

[25]  Li Jing,et al.  Modeling user multiple interests by an improved GCS approach , 2005, Expert Syst. Appl..

[26]  Timo Honkela,et al.  WEBSOM - Self-organizing maps of document collections , 1998, Neurocomputing.

[27]  Malcolm I. Heywood,et al.  Comparing Dimension Reduction Techniques for Document Clustering , 2005, Canadian Conference on AI.

[28]  Samuel Kaski,et al.  Self organization of a massive document collection , 2000, IEEE Trans. Neural Networks Learn. Syst..

[29]  José M. González-Barrios,et al.  A clustering procedure based on the comparison between the k nearest neighbors graph and the minimal spanning tree , 2003 .

[30]  Yan Fu,et al.  Clustering High-Dimensional Data Using Growing SOM , 2005, ISNN.

[31]  Oren Etzioni,et al.  Grouper: A Dynamic Clustering Interface to Web Search Results , 1999, Comput. Networks.

[32]  Bernd Fritzke Growing Grid — a self-organizing network with constant neighborhood range and adaptation strength , 1995, Neural Processing Letters.

[33]  Andreas Rauber,et al.  The growing hierarchical self-organizing map: exploratory analysis of high-dimensional data , 2002, IEEE Trans. Neural Networks.

[34]  Kjersti Aas,et al.  Text Categorisation: A Survey , 1999 .

[35]  Tommy W. S. Chow,et al.  A new document representation using term frequency and vectorized graph connectionists with application to document retrieval , 2009, Expert Syst. Appl..

[36]  Shi Zhong,et al.  Efficient streaming text clustering , 2005, Neural Networks.

[37]  Mohamed S. Kamel,et al.  Efficient phrase-based document indexing for Web document clustering , 2004, IEEE Transactions on Knowledge and Data Engineering.

[38]  Sudipto Guha,et al.  Streaming-data algorithms for high-quality clustering , 2002, Proceedings 18th International Conference on Data Engineering.

[39]  Dino Isa,et al.  Using the self organizing map for clustering of text documents , 2009, Expert Syst. Appl..

[40]  Gerald Salton,et al.  Automatic text processing , 1988 .

[41]  Chung-Chian Hsu,et al.  Incremental clustering of mixed data based on distance hierarchy , 2008, Expert Syst. Appl..

[42]  David W. Corne,et al.  The BankSearch web document dataset: investigating unsupervised clustering and category similarity , 2005, J. Netw. Comput. Appl..

[43]  M. Narasimha Murty,et al.  An adaptive rough fuzzy single pass algorithm for clustering large data sets , 2003, Pattern Recognit..

[44]  François Yvon,et al.  Inference and evaluation of the multinomial mixture model for text clustering , 2006, Inf. Process. Manag..

[45]  Masao Fuketa,et al.  Word classification and hierarchy using co-occurrence word information , 2004, Inf. Process. Manag..

[46]  José Ferrer,et al.  Using SOM and PCA for analysing and interpreting data from a P-removal SBR , 2008, Eng. Appl. Artif. Intell..

[47]  Shai Ben-David,et al.  Detecting Change in Data Streams , 2004, VLDB.

[48]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[49]  Pasi Fränti,et al.  Gradual model generator for single-pass clustering , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[50]  Taeho Jo,et al.  Text clustering with NTSO (neural text self organizer) , 2005, Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005..

[51]  Ridvan Saraçoglu,et al.  A new approach on search for similar documents with multiple categories using fuzzy clustering , 2008, Expert Syst. Appl..

[52]  Luis Gravano,et al.  An investigation of linguistic features and clustering algorithms for topical document clustering , 2000, SIGIR '00.

[53]  Philip S. Yu,et al.  A Framework for Clustering Evolving Data Streams , 2003, VLDB.

[54]  Ramiz M. Aliguliyev,et al.  Clustering of document collection - A weighting approach , 2009, Expert Syst. Appl..

[55]  Witold Pedrycz,et al.  Data Mining Methods for Knowledge Discovery , 1998, IEEE Trans. Neural Networks.

[56]  Chih-Ping Wei,et al.  A Latent Semantic Indexing-based approach to multilingual document clustering , 2008, Decis. Support Syst..