A data selection framework for k-means algorithm to mine high precision clusters

Traditional clustering algorithms employ all the data items to learn the cluster patterns. However, in real-world applications, some data show clear coherent behaviour and can be summarized well, while some data present weak tendencies to be assigned to any particular pattern. For such situation, this paper presents a data selection framework for K-Means algorithm to get high precision clusters from the data collection. It differs from traditional k-means-type algorithms in three respects. First, in the cluster learning process, we take the changed value of cluster's Bregman Information, which is generated by merging one data item into the potential clusters, as the measure of data item's clustering tendency. Second, only data items with strong clustering tendencies, that is the changed value of cluster's Bregman Information is less than the predefined radius, are selected to learn the cluster patterns, while the remaining data points are ignored and belong to no cluster. The clustering is non-exhaustive. Third, the radius of the clusters can be changed in the learning process. It is a dynamic learning framework. Experiments on synthetic, document and image data show the effectiveness of the proposed algorithm.

[1]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[2]  Alexei A. Efros,et al.  Discovering objects and their location in images , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[3]  Ken Lang,et al.  NewsWeeder: Learning to Filter Netnews , 1995, ICML.

[4]  John A. Hartigan,et al.  Clustering Algorithms , 1975 .

[5]  Yangdong Ye,et al.  The Multi-Feature Information Bottleneck with Application to Unsupervised Image Categorization , 2013, IJCAI.

[6]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[7]  Cordelia Schmid,et al.  Coloring Local Feature Extraction , 2006, ECCV.

[8]  Trevor Darrell,et al.  Adapting Visual Category Models to New Domains , 2010, ECCV.

[9]  Philip S. Yu,et al.  Towards Cohesive Anomaly Mining , 2013, AAAI.

[10]  Koby Crammer,et al.  Hartigan's K-Means Versus Lloyd's K-Means - Is It Time for a Change? , 2013, IJCAI.

[11]  Joydeep Ghosh,et al.  Bregman bubble clustering , 2008, ACM Trans. Knowl. Discov. Data.

[12]  Koby Crammer,et al.  A needle in a haystack: local one-class optimization , 2004, ICML.

[13]  L. Bregman The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming , 1967 .

[14]  Matus Telgarsky,et al.  Hartigan's Method: k-means Clustering without Voronoi , 2010, AISTATS.

[15]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[16]  Yangdong Ye,et al.  Unsupervised video categorization based on multivariate information bottleneck method , 2015, Knowl. Based Syst..

[17]  Anil K. Jain Data clustering: 50 years beyond K-means , 2008, Pattern Recognit. Lett..

[18]  Yangdong Ye,et al.  Unsupervised Human Action Categorization with Consensus Information Bottleneck Method , 2016, IJCAI.

[19]  Luc Van Gool,et al.  Speeded-Up Robust Features (SURF) , 2008, Comput. Vis. Image Underst..

[20]  Inderjit S. Dhillon,et al.  Clustering with Bregman Divergences , 2005, J. Mach. Learn. Res..

[21]  Naftali Tishby,et al.  Unsupervised document classification using sequential information maximization , 2002, SIGIR '02.