MMPClust: A Skew Prevention Algorithm for Model-Based Document Clustering

To support very high dimensionality, model-based clustering is an intuitive choice for document clustering. However, the current model-based algorithms are prone to generating the skewed clusters, which influence the quality of clustering seriously. In this paper, the reasons of skew are examined and determined as the inappropriate initial model, the unfitness of cluster model and the interaction between the decentralization of estimation samples and the over-generalized cluster model. This paper proposes a skew prevention document-clustering algorithm (MMPClust), which has two features: (1) a content-based cluster model is used to model the cluster better; (2) at the re-estimation step, a part of documents most relevant to its corresponding class are selected automatically for each cluster as the estimation samples to break this interaction. MMPClust has less restrictions and more applicability in document clustering than the previous methods.

[1]  Inderjit S. Dhillon,et al.  Concept Decompositions for Large Sparse Text Data Using Clustering , 2004, Machine Learning.

[2]  Ian Witten,et al.  Data Mining , 2000 .

[3]  R. Mooney,et al.  Impact of Similarity Measures on Web-page Clustering , 2000 .

[4]  Joydeep Ghosh,et al.  Model-based clustering with soft balancing , 2003, Third IEEE International Conference on Data Mining.

[5]  Jonathan Goldstein,et al.  When Is ''Nearest Neighbor'' Meaningful? , 1999, ICDT.

[6]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[7]  Sanjeev Khanna,et al.  Why and Where: A Characterization of Data Provenance , 2001, ICDT.

[8]  Joydeep Ghosh,et al.  Probabilistic model-based clustering of complex data , 2003 .

[9]  Joydeep Ghosh,et al.  A Unified Framework for Model-based Clustering , 2003, J. Mach. Learn. Res..

[10]  Joydeep Ghosh,et al.  On Scaling Up Balanced Clustering Algorithms , 2002, SDM.

[11]  Geoffrey J. McLachlan,et al.  Finite Mixture Models , 2019, Annual Review of Statistics and Its Application.

[12]  Vipin Kumar,et al.  A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs , 1998, SIAM J. Sci. Comput..

[13]  Marina Meila,et al.  An Experimental Comparison of Model-Based Clustering Methods , 2004, Machine Learning.

[14]  Paul R. Cohen,et al.  Bayesian Clustering by Dynamics Contents 1 Introduction 1 2 Clustering Markov Chains 2 , 2022 .

[15]  A. Raftery,et al.  Model-based Gaussian and non-Gaussian clustering , 1993 .

[16]  Joydeep Ghosh,et al.  Frequency sensitive competitive learning for clustering on high-dimensional hyperspheres , 2002, Proceedings of the 2002 International Joint Conference on Neural Networks. IJCNN'02 (Cat. No.02CH37290).

[17]  Shivakumar Vaithyanathan,et al.  Model-Based Hierarchical Clustering , 2000, UAI.

[18]  Charu C. Aggarwal,et al.  On the Surprising Behavior of Distance Metrics in High Dimensional Spaces , 2001, ICDT.

[19]  Padhraic Smyth,et al.  A general probabilistic framework for clustering individuals and objects , 2000, KDD '00.

[20]  Thomas G. Dietterich Machine-Learning Research , 1997, AI Mag..

[21]  Chris Fraley,et al.  Algorithms for Model-Based Gaussian Hierarchical Clustering , 1998, SIAM J. Sci. Comput..

[22]  Andrew P. Sage,et al.  Uncertainty in Artificial Intelligence , 1987, IEEE Transactions on Systems, Man, and Cybernetics.

[23]  Malcolm P. Atkinson,et al.  Issues Raised by Three Years of Developing PJama: An Orthogonally Persistent Platform for Java , 1999, ICDT.

[24]  Dan Klein,et al.  Interpreting and Extending Classical Agglomerative Clustering Algorithms using a Model-Based approach , 2002, ICML.

[25]  Ayhan Demiriz,et al.  Constrained K-Means Clustering , 2000 .