Optimizing MSE for Clustering with Balanced Size Constraints

Clustering is to group data so that the observations in the same group are more similar to each other than to those in other groups. k-means is a popular clustering algorithm in data mining. Its objective is to optimize the mean squared error (MSE). The traditional k-means algorithm is not suitable for applications where the sizes of clusters need to be balanced. Given n observations, our objective is to optimize the MSE under the constraint that the observations need to be evenly divided into k clusters. In this paper, we propose an iterative method for the task of clustering with balanced size constraints. Each iteration can be split into two steps, namely an assignment step and an update step. In the assignment step, the data are evenly assigned to each cluster. The balanced assignment task here is formulated as an integer linear program (ILP), and we prove that the constraint matrix of this ILP is totally unimodular. Thus the ILP is relaxed as a linear program (LP) which can be efficiently solved with the simplex algorithm. In the update step, the new centers are updated as the centroids of the observations in the clusters. Assuming that there are n observations and the algorithm needs m iterations to converge, we show that the average time complexity of the proposed algorithm is O ( m n 1.65 ) – O ( m n 1.70 ) . Experimental results indicate that, comparing with state-of-the-art methods, the proposed algorithm is efficient in deriving more accurate clustering.

[1]  N. L. Johnson,et al.  Multivariate Analysis , 1958, Nature.

[2]  C. K. Michael Tse,et al.  Data Clustering with Cluster Size Constraints Using a Modified K-Means Algorithm , 2014, 2014 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery.

[3]  Balaji Padmanabhan,et al.  Segmenting customer transactions using a pattern-based clustering approach , 2003, Third IEEE International Conference on Data Mining.

[4]  Ulrike von Luxburg,et al.  A tutorial on spectral clustering , 2007, Stat. Comput..

[5]  Shu-Cherng Fang,et al.  Linear Optimization and Extensions: Theory and Algorithms , 1993 .

[6]  Andrew B. Kahng,et al.  Fast spectral methods for ratio cut partitioning and clustering , 1991, 1991 IEEE International Conference on Computer-Aided Design Digest of Technical Papers.

[7]  Ayhan Demiriz,et al.  Constrained K-Means Clustering , 2000 .

[8]  George Karypis,et al.  Document Clustering , 2010, Encyclopedia of Machine Learning.

[9]  Achim Koberstein,et al.  Progress in the dual simplex algorithm for solving large scale LP problems: techniques for a fast and stable implementation , 2008, Comput. Optim. Appl..

[10]  Franco Turini,et al.  Survey on using constraints in data mining , 2017, Data Mining and Knowledge Discovery.

[11]  K. Borgwardt The Simplex Method: A Probabilistic Analysis , 1986 .

[12]  Yoshio Okamoto,et al.  Submodular fractional programming for balanced clustering , 2011, Pattern Recognit. Lett..

[13]  Chen-Shu Wang,et al.  Balanced k-Means , 2017, ACIIDS.

[14]  Shang-Hua Teng,et al.  Smoothed analysis of algorithms: why the simplex algorithm usually takes polynomial time , 2001, STOC '01.

[15]  Kenneth Steiglitz,et al.  Combinatorial Optimization: Algorithms and Complexity , 1981 .

[16]  Joydeep Ghosh,et al.  Scalable Clustering Algorithms with Balancing Constraints , 2006, Data Mining and Knowledge Discovery.

[17]  Yixin Chen,et al.  Size Regularized Cut for Data Clustering , 2005, NIPS.

[18]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[19]  David K. Smith Theory of Linear and Integer Programming , 1987 .

[20]  Ruhan He,et al.  Balanced K-Means Algorithm for Partitioning Areas in Large-Scale Vehicle Routing Problem , 2009, 2009 Third International Symposium on Intelligent Information Technology Application.

[21]  Wu Cheng,et al.  A Modified k-means Algorithm for Clustering Problem with Balancing Constraints , 2011, 2011 Third International Conference on Measuring Technology and Mechatronics Automation.

[22]  Ying Liao,et al.  Load-Balanced Clustering Algorithm With Distributed Self-Organization for Wireless Sensor Networks , 2013, IEEE Sensors Journal.

[23]  Yi Yang,et al.  Balanced k-Means and Min-Cut Clustering , 2014, ArXiv.

[24]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[25]  Shunzhi Zhu,et al.  Data clustering with size constraints , 2010, Knowl. Based Syst..

[26]  Rui Xu,et al.  Survey of clustering algorithms , 2005, IEEE Transactions on Neural Networks.

[27]  Pasi Fränti,et al.  Balanced K-Means for Clustering , 2014, S+SSPR.

[28]  Joydeep Ghosh,et al.  Frequency-sensitive competitive learning for scalable balanced clustering on high-dimensional hyperspheres , 2004, IEEE Transactions on Neural Networks.

[29]  Tim Althoff,et al.  Balanced Clustering for Content-based Image Browsing , 2011, Informatiktage.