ESPClust: An Effective Skew Prevention Method for Model-Based Document Clustering

Document clustering is necessary for information retrieval, Web data mining, and Web data management. To support very high dimensionality and the sparsity of document feature, the model-based clustering has been proved to be an intuitive choice for document clustering. However, the current model-based algorithms are prone to generating the skewed clusters, which influence the quality of clustering seriously. In this paper, the reasons of skew generating are examined and determined as the inappropriate initial model, and the interaction between the decentralization of estimation samples and the over-generalized cluster model. An effective clustering skew prevention method (ESPClust) is proposed to focus on the last reason. To break this interaction, for each cluster, ESPClust automatically selects a part of documents that most relevant to its corresponding class as the estimation samples to re-estimate the cluster model. Based on the ESPClust, two algorithms with respect to the quality and efficiency are provided for different kinds of applications. Compared with balanced model-based algorithms, the ESPClust method has less restrictions and more applicability. The experiments show that the ESPClust can avoid the clustering skew in a great degree and its Macro-F1 measure outperforms the previous methods' measure.

[1]  Joydeep Ghosh,et al.  A Unified Framework for Model-based Clustering , 2003, J. Mach. Learn. Res..

[2]  Vipin Kumar,et al.  A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs , 1998, SIAM J. Sci. Comput..

[3]  Ayhan Demiriz,et al.  Constrained K-Means Clustering , 2000 .

[4]  Andrew P. Sage,et al.  Uncertainty in Artificial Intelligence , 1987, IEEE Transactions on Systems, Man, and Cybernetics.

[5]  Charu C. Aggarwal,et al.  On the Surprising Behavior of Distance Metrics in High Dimensional Spaces , 2001, ICDT.

[6]  Inderjit S. Dhillon,et al.  Concept Decompositions for Large Sparse Text Data Using Clustering , 2004, Machine Learning.

[7]  Padhraic Smyth,et al.  A general probabilistic framework for clustering individuals and objects , 2000, KDD '00.

[8]  Dan Klein,et al.  Interpreting and Extending Classical Agglomerative Clustering Algorithms using a Model-Based approach , 2002, ICML.

[9]  Joydeep Ghosh,et al.  Model-based clustering with soft balancing , 2003, Third IEEE International Conference on Data Mining.

[10]  Marina Meila,et al.  An Experimental Comparison of Model-Based Clustering Methods , 2004, Machine Learning.

[11]  Paul R. Cohen,et al.  Bayesian Clustering by Dynamics Contents 1 Introduction 1 2 Clustering Markov Chains 2 , 2022 .

[12]  Shivakumar Vaithyanathan,et al.  Model-Based Hierarchical Clustering , 2000, UAI.

[13]  Chris Fraley,et al.  Algorithms for Model-Based Gaussian Hierarchical Clustering , 1998, SIAM J. Sci. Comput..

[14]  Thomas G. Dietterich Machine-Learning Research , 1997, AI Mag..

[15]  Joydeep Ghosh,et al.  Frequency sensitive competitive learning for clustering on high-dimensional hyperspheres , 2002, Proceedings of the 2002 International Joint Conference on Neural Networks. IJCNN'02 (Cat. No.02CH37290).

[16]  Ian Witten,et al.  Data Mining , 2000 .

[17]  A. Raftery,et al.  Model-based Gaussian and non-Gaussian clustering , 1993 .

[18]  R. Mooney,et al.  Impact of Similarity Measures on Web-page Clustering , 2000 .

[19]  Joydeep Ghosh,et al.  On Scaling Up Balanced Clustering Algorithms , 2002, SDM.

[20]  Geoffrey J. McLachlan,et al.  Finite Mixture Models , 2019, Annual Review of Statistics and Its Application.

[21]  Raghu Ramakrishnan,et al.  When Is Nearest Neighbors Indexable? , 2005, ICDT.

[22]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[23]  Sanjeev Khanna,et al.  Why and Where: A Characterization of Data Provenance , 2001, ICDT.

[24]  Jonathan Goldstein,et al.  When Is ''Nearest Neighbor'' Meaningful? , 1999, ICDT.

[25]  Joydeep Ghosh,et al.  Probabilistic model-based clustering of complex data , 2003 .