Text Clustering with Seeds Affinity Propagation

Based on an effective clustering algorithm-Affinity Propagation (AP)-we present in this paper a novel semisupervised text clustering algorithm, called Seeds Affinity Propagation (SAP). There are two main contributions in our approach: 1) a new similarity metric that captures the structural information of texts, and 2) a novel seed construction method to improve the semisupervised clustering process. To study the performance of the new algorithm, we applied it to the benchmark data set Reuters-21578 and compared it to two state-of-the-art clustering algorithms, namely, k-means algorithm and the original AP algorithm. Furthermore, we have analyzed the individual impact of the two proposed contributions. Results show that the proposed similarity metric is more effective in text clustering (F-measures ca. 21 percent higher than in the AP algorithm) and the proposed semisupervised strategy achieves both better clustering results and faster convergence (using only 76 percent iterations of the original AP). The complete SAP algorithm obtains higher F-measure (ca. 40 percent improvement over k-means and AP) and lower entropy (ca. 28 percent decrease over k-means and AP), improves significantly clustering execution time (20 times faster) in respect that k-means, and provides enhanced robustness compared with all other methods.

[1]  Soon Myoung Chung,et al.  Text Clustering with Feature Selection by Using Statistical Data , 2008, IEEE Transactions on Knowledge and Data Engineering.

[2]  Manuel J. Maña López,et al.  Multidocument summarization: An added value to clustering in interactive retrieval , 2004, TOIS.

[3]  Hans-Friedrich Köhn,et al.  Comment on "Clustering by Passing Messages Between Data Points" , 2008, Science.

[4]  Tianyi Jiang,et al.  Dynamic Micro Targeting: Fitness-Based Approach to Predicting Individual Preferences , 2007, ICDM.

[5]  Philip S. Yu,et al.  Top 10 algorithms in data mining , 2007, Knowledge and Information Systems.

[6]  Michael K. Ng,et al.  An Entropy Weighting k-Means Algorithm for Subspace Clustering of High-Dimensional Sparse Data , 2007, IEEE Transactions on Knowledge and Data Engineering.

[7]  R. Bharat Rao,et al.  Bayesian Co-Training , 2007, J. Mach. Learn. Res..

[8]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[9]  C. J. van Rijsbergen,et al.  The use of hierarchic clustering in information retrieval , 1971, Inf. Storage Retr..

[10]  Zhi-Hua Zhou,et al.  Semisupervised Regression with Cotraining-Style Algorithms , 2007, IEEE Transactions on Knowledge and Data Engineering.

[11]  Manuel de Buenaga,et al.  Multidocument summarization: An added value to clustering in interactive retrieval , 2004 .

[12]  Yoshua Bengio,et al.  Convergence Properties of the K-Means Algorithms , 1994, NIPS.

[13]  Xindong Wu,et al.  Efficient mining of both positive and negative association rules , 2004, TOIS.

[14]  Ji Chen,et al.  An Incremental Chinese Text Classification Algorithm Based on Quick Clustering , 2008, 2008 International Symposiums on Information Processing.

[15]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[16]  Fei Ding,et al.  An affinity propagation based method for vector quantization codebook design , 2008, 2008 19th International Conference on Pattern Recognition.

[17]  Shi Bing,et al.  Inductive learning algorithms and representations for text categorization , 2006 .

[18]  Brendan J. Frey,et al.  Response to Comment on "Clustering by Passing Messages Between Data Points" , 2008, Science.

[19]  Zhi-Hua Zhou,et al.  Tri-training: exploiting unlabeled data using three classifiers , 2005, IEEE Transactions on Knowledge and Data Engineering.

[20]  David D. Lewis,et al.  Reuters-21578 Text Categorization Test Collection, Distribution 1.0 , 1997 .

[21]  Qiang Yang,et al.  Semi-Supervised Learning with Very Few Labeled Training Examples , 2007, AAAI.

[22]  Alexander Dekhtyar,et al.  Information Retrieval , 2018, Lecture Notes in Computer Science.

[23]  Chris H. Q. Ding,et al.  Spectral Relaxation for K-means Clustering , 2001, NIPS.

[24]  Gerard Salton,et al.  Dynamic information and library processing , 1975 .

[25]  Taeho Jo,et al.  A Multiple Resampling Method for Learning from Imbalanced Data Sets , 2004, Comput. Intell..

[26]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[27]  Michele Leone,et al.  Clustering by Soft-constraint Affinity Propagation: Applications to Gene-expression Data , 2022 .

[28]  Brendan J. Frey,et al.  Non-metric affinity propagation for unsupervised image categorization , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[29]  Fei Wang,et al.  Label Propagation through Linear Neighborhoods , 2006, IEEE Transactions on Knowledge and Data Engineering.

[30]  Tao Guo,et al.  Adaptive Affinity Propagation Clustering , 2008, ArXiv.

[31]  Wenhua Wang,et al.  Large Scale of E-learning Resources Clustering with Parallel Affinity Propagation , 2008 .

[32]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[33]  Wei-Ying Ma,et al.  Multitype Features Coselection for Web Document Clustering , 2006, IEEE Trans. Knowl. Data Eng..

[34]  Michael I. Jordan,et al.  Advances in Neural Information Processing Systems 30 , 1995 .

[35]  Zhi-Hua Zhou,et al.  Semi-Supervised Regression with Co-Training Style Algorithms , 2007 .

[36]  Mikhail Belkin,et al.  Semi-Supervised Learning on Riemannian Manifolds , 2004, Machine Learning.

[37]  Chris Buckley,et al.  Optimization of inverted vector searches , 1985, SIGIR '85.

[38]  Delbert Dueck,et al.  Clustering by Passing Messages Between Data Points , 2007, Science.

[39]  Ellen M. Voorhees,et al.  The efficiency of inverted index and cluster searches , 1986, SIGIR '86.

[40]  Alexander Zien,et al.  Semi-Supervised Learning , 2006 .

[41]  CH. GOWTHAMI,et al.  Distributional Features for Text Categorization Based on Weight , 2011 .

[42]  Chin-Hui Lee,et al.  A maximal figure-of-merit (MFoM)-learning approach to robust classifier design for text categorization , 2006, ACM Trans. Inf. Syst..

[43]  Wei-Ying Ma,et al.  Multitype features coselection for Web document clustering , 2006 .

[44]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.