Probability-based text clustering algorithm by alternately repeating two operations

Owing to the rapid advance of internet technology, users have to face to a large amount of raw data from the World Wide Web every day, most of which is displayed in text format. This situation brings a great demand for efficient text analysis techniques by internet users. Since clustering is unsupervised and requires no prior knowledge, it is extensively adopted to help analyse textual data. Unfortunately, as far as I know, almost all the clustering algorithms proposed so far fail to deal with large-scale text collection. For precisely classifying large-scale text collection, a novel probability based text clustering algorithm by alternately repeating two operations (abbreviated as PTCART) is proposed in this paper. This algorithm just repeats two operations of (a) feature set construction and (b) text partition until the optimal partition is reached. Its convergent capacity is also validated. Experiments results demonstrate that, compared with several popular text clustering algorithms, PTCART has excellent performance.

[1]  Kok-Leong Ong,et al.  Enhancing the Effectiveness of Clustering with Spectra Analysis , 2007, IEEE Transactions on Knowledge and Data Engineering.

[2]  Reynaldo Gil-García,et al.  Dynamic hierarchical algorithms for document clustering , 2010, Pattern Recognit. Lett..

[3]  Yiu-ming Cheung,et al.  k*-Means: A new generalized k-means clustering algorithm , 2003, Pattern Recognit. Lett..

[4]  Johan A. K. Suykens,et al.  Sparse kernel spectral clustering models for large-scale data analysis , 2011, Neurocomputing.

[5]  Xi Wang,et al.  Clustering aggregation by probability accumulation , 2009, Pattern Recognit..

[6]  Wei-Ying Ma,et al.  Multitype features coselection for Web document clustering , 2006 .

[7]  Danushka Bollegala,et al.  A Web Search Engine-Based Approach to Measure Semantic Similarity between Words , 2011, IEEE Transactions on Knowledge and Data Engineering.

[8]  M. H. Ghaseminezhad,et al.  A novel self-organizing map (SOM) neural network for discrete groups of data clustering , 2011, Appl. Soft Comput..

[9]  Andreas Rauber,et al.  The growing hierarchical self-organizing map: exploratory analysis of high-dimensional data , 2002, IEEE Trans. Neural Networks.

[10]  Lei Zhang,et al.  A novel ant-based clustering algorithm using the kernel method , 2011, Inf. Sci..

[11]  Anil K. Jain,et al.  Simultaneous feature selection and clustering using mixture models , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Robert W. P. Luk,et al.  A new measure of clustering effectiveness: Algorithms and experimental studies , 2008 .

[13]  Inderjit S. Dhillon,et al.  Concept Decompositions for Large Sparse Text Data Using Clustering , 2004, Machine Learning.

[14]  Mohamed S. Kamel,et al.  Efficient phrase-based document indexing for Web document clustering , 2004, IEEE Transactions on Knowledge and Data Engineering.

[15]  Stéphane Ducasse,et al.  Semantic clustering: Identifying topics in source code , 2007, Inf. Softw. Technol..

[16]  Philip J. Morrow,et al.  Knowledge discovery by probabilistic clustering of distributed databases , 2005, Data Knowl. Eng..

[17]  Xiaohua Hu,et al.  A comparative evaluation of different link types on enhancing document clustering , 2008, SIGIR '08.

[18]  Michael K. Ng,et al.  An Entropy Weighting k-Means Algorithm for Subspace Clustering of High-Dimensional Sparse Data , 2007, IEEE Transactions on Knowledge and Data Engineering.

[19]  Xiaolong Wang,et al.  ConSOM: A conceptional self-organizing map model for text clustering , 2008, Neurocomputing.

[20]  Ching-Hsue Cheng,et al.  Data spread-based entropy clustering method using adaptive learning , 2009, Expert Syst. Appl..

[21]  Xiaoying Tai,et al.  A hierarchical clustering algorithm based on fuzzy graph connectedness , 2006, Fuzzy Sets Syst..

[22]  Shi Zhong,et al.  Efficient streaming text clustering , 2005, Neural Networks.

[23]  Qinglin Guo,et al.  Multi-documents Automatic Abstracting based on text clustering and semantic analysis , 2009, Knowl. Based Syst..

[24]  Djamel Bouchaffra,et al.  Genetic-based EM algorithm for learning Gaussian mixture models , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  Teuvo Kohonen,et al.  Self-Organizing Maps , 2010 .

[26]  Kazuaki Kishida Double-pass clustering technique for multilingual document collections , 2011, J. Inf. Sci..

[27]  Samuel Kaski,et al.  Mining massive document collections by the WEBSOM method , 2004, Inf. Sci..

[28]  Rui Xu,et al.  Survey of clustering algorithms , 2005, IEEE Transactions on Neural Networks.

[29]  Xiaohua Hu,et al.  Towards effective document clustering: A constrained K-means based approach , 2008, Inf. Process. Manag..

[30]  Samuel Kaski,et al.  Self organization of a massive document collection , 2000, IEEE Trans. Neural Networks Learn. Syst..

[31]  W. John Wilbur,et al.  Global term weights for document retrieval learned from TREC data , 2001, J. Inf. Sci..

[32]  Yuanchao Liu,et al.  Research of fast SOM clustering for text information , 2011, Expert Syst. Appl..

[33]  Bala Srinivasan,et al.  Dynamic self-organizing maps with controlled growth for knowledge discovery , 2000, IEEE Trans. Neural Networks Learn. Syst..