Data clustering with size constraints

Data clustering is an important and frequently used unsupervised learning method. Recent research has demonstrated that incorporating instance-level background information to traditional clustering algorithms can increase the clustering performance. In this paper, we extend traditional clustering by introducing additional prior knowledge such as the size of each cluster. We propose a heuristic algorithm to transform size constrained clustering problems into integer linear programming problems. Experiments on both synthetic and UCI datasets demonstrate that our proposed approach can utilize cluster size constraints and lead to the improvement of clustering accuracy.

[1]  Joydeep Ghosh,et al.  A Unified Framework for Model-based Clustering , 2003, J. Mach. Learn. Res..

[2]  Witold Pedrycz,et al.  Fuzzy Clustering With Viewpoints , 2010, IEEE Transactions on Fuzzy Systems.

[3]  Nello Cristianini,et al.  Efficiently Learning the Metric with Side-Information , 2003, ALT.

[4]  Ganapati P. Patil,et al.  Hot-Spot Geoinformatics for Digital Governance , 2008 .

[5]  Joydeep Ghosh,et al.  On Scaling Up Balanced Clustering Algorithms , 2002, SDM.

[6]  Christopher Wilson,et al.  Mining GPS data to augment road models , 1999, KDD '99.

[7]  Dan Klein,et al.  From Instance-level Constraints to Space-Level Constraints: Making the Most of Prior Knowledge in Data Clustering , 2002, ICML.

[8]  H. Messatfa An algorithm to maximize the agreement between partitions , 1992 .

[9]  Vipin Kumar,et al.  Introduction to Data Mining, (First Edition) , 2005 .

[10]  Minqiang Li,et al.  Multinomial mixture model with feature selection for text clustering , 2008, Knowl. Based Syst..

[11]  George Karypis,et al.  Soft clustering criterion functions for partitional document clustering: a summary of results , 2004, CIKM '04.

[12]  Inderjit S. Dhillon,et al.  Semi-supervised graph clustering: a kernel approach , 2005, Machine Learning.

[13]  Rong Ge,et al.  Constraint-driven clustering , 2007, KDD '07.

[14]  Arindam Banerjee,et al.  Active Semi-Supervision for Pairwise Constrained Clustering , 2004, SDM.

[15]  François Jacquenet,et al.  Discovering unexpected documents in corpora , 2009, Knowl. Based Syst..

[16]  Gary Carpenter 동적 사용자를 위한 Scalable 인증 그룹 키 교환 프로토콜 , 2005 .

[17]  Witold Pedrycz,et al.  Fuzzy clustering with a knowledge-based guidance , 2004, Pattern Recognit. Lett..

[18]  Arindam Banerjee,et al.  Semi-supervised Clustering by Seeding , 2002, ICML.

[19]  Colin Studholme,et al.  An overlap invariant entropy measure of 3D medical image alignment , 1999, Pattern Recognit..

[20]  Witold Pedrycz,et al.  Fuzzy clustering with partial supervision , 1997, IEEE Trans. Syst. Man Cybern. Part B.

[21]  Shenghuo Zhu,et al.  Algorithms for clustering high dimensional and distributed data , 2003, Intell. Data Anal..

[22]  Joydeep Ghosh,et al.  Scalable Clustering Algorithms with Balancing Constraints , 2006, Data Mining and Knowledge Discovery.

[23]  Ayhan Demiriz,et al.  Constrained K-Means Clustering , 2000 .

[24]  Claire Cardie,et al.  Constrained K-means Clustering with Background Knowledge , 2001, ICML.

[25]  Joydeep Ghosh,et al.  Scalable, Balanced Model-based Clustering , 2003, SDM.

[26]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[27]  Claire Cardie,et al.  Clustering with Instance-Level Constraints , 2000, AAAI/IAAI.

[28]  Juan Luis Castro,et al.  Local distance-based classification , 2008, Knowl. Based Syst..

[29]  Raymond J. Mooney,et al.  Integrating constraints and metric learning in semi-supervised clustering , 2004, ICML.

[30]  Chris H. Q. Ding,et al.  Solving Consensus and Semi-supervised Clustering Problems Using Nonnegative Matrix Factorization , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[31]  Thorsten Joachims,et al.  Learning a Distance Metric from Relative Comparisons , 2003, NIPS.

[32]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.