An optimization approach to partitional data clustering

Scalability of clustering algorithms is a critical issue facing the data mining community. One method to handle this issue is to use only a subset of all instances. This paper develops an optimization-based approach to the partitional clustering problem using an algorithm specifically designed for noisy performance, which is a problem that arises when using a subset of instances. Numerical results show that computation time can be dramatically reduced by using a partial set of instances without sacrificing solution quality. In addition, these results are more persuasive as the size of the problem is larger.

[1]  Sigurdur Olafsson,et al.  Improving Scalability of E-Commerce Systems with Knowledge Discovery , 2003 .

[2]  Estivill-CastroVladimir Why so many clustering algorithms , 2002 .

[3]  Paul S. Bradley,et al.  Feature Selection via Mathematical Programming , 1997, INFORMS J. Comput..

[4]  Andreas Rudolph,et al.  Techniques of Cluster Algorithms in Data Mining , 2002, Data Mining and Knowledge Discovery.

[5]  Peter C. Cheeseman,et al.  Bayesian Classification (AutoClass): Theory and Results , 1996, Advances in Knowledge Discovery and Data Mining.

[6]  Hannu Toivonen,et al.  Sampling Large Databases for Association Rules , 1996, VLDB.

[7]  Dimitrios Gunopulos,et al.  Automatic Subspace Clustering of High Dimensional Data , 2005, Data Mining and Knowledge Discovery.

[8]  Jiawei Han,et al.  Efficient and Effective Clustering Methods for Spatial Data Mining , 1994, VLDB.

[9]  Dimitrios Gunopulos,et al.  Locally adaptive metrics for clustering high dimensional data , 2007, Data Mining and Knowledge Discovery.

[10]  Charles Elkan,et al.  Scalability for clustering algorithms revisited , 2000, SKDD.

[11]  Paul S. Bradley,et al.  Scaling Clustering Algorithms to Large Databases , 1998, KDD.

[12]  Sigurdur Ólafsson,et al.  Iterative ranking-and-selection for large-scale optimization , 1999, WSC '99.

[13]  T. Kohonen Self-organized formation of topographically correct feature maps , 1982 .

[14]  Heikki Mannila,et al.  The power of sampling in knowledge discovery , 1994, PODS '94.

[15]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[16]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[17]  A. Gordaliza,et al.  Robustness Properties of k Means and Trimmed k Means , 1999 .

[18]  Jiong Yang,et al.  STING: A Statistical Information Grid Approach to Spatial Data Mining , 1997, VLDB.

[19]  Johannes Gehrke,et al.  Scaling mining algorithms to large databases , 2002, CACM.

[20]  Foster J. Provost,et al.  A Survey of Methods for Scaling Up Inductive Algorithms , 1999, Data Mining and Knowledge Discovery.

[21]  J. Chauchat,et al.  Sampling Strategy for Building Decision Trees from Very Large Databases Comprising Many Continuous Attributes , 2001 .

[22]  Vladimir Estivill-Castro,et al.  Why so many clustering algorithms: a position paper , 2002, SKDD.

[23]  Leyuan Shi,et al.  Nested Partitions Method for Global Optimization , 2000, Oper. Res..

[24]  Yi-Shen Lin,et al.  An extended study of the K-means algorithm for data clustering and its applications , 2004, J. Oper. Res. Soc..

[25]  Teuvo Kohonen,et al.  Self-organized formation of topologically correct feature maps , 2004, Biological Cybernetics.

[26]  Hans-Peter Kriegel,et al.  Incremental Clustering for Mining in a Data Warehousing Environment , 1998, VLDB.

[27]  Tim Oates,et al.  Efficient progressive sampling , 1999, KDD '99.

[28]  Huan Liu,et al.  Instance Selection and Construction for Data Mining , 2001 .

[29]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[30]  David A. Clausi,et al.  K-means Iterative Fisher (KIF) unsupervised clustering algorithm applied to image texture segmentation , 2002, Pattern Recognit..

[31]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[32]  Manjunath Kamath,et al.  Scalable enterprise systems : an introduction to recent advances , 2003 .

[33]  Bin Zhang,et al.  Distributed data clustering can be efficient and exact , 2000, SKDD.

[34]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[35]  Rakesh Agrawal,et al.  Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining , 1998, KDD 1998.

[36]  D. Madigan,et al.  Proceedings : KDD-99 : the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 15-18, 1999, San Diego, California, USA , 1999 .

[37]  Joydeep Ghosh,et al.  Scalable Clustering Algorithms with Balancing Constraints , 2006, Data Mining and Knowledge Discovery.

[38]  Amit Basu,et al.  Perspectives on operations research in data and knowledge management , 1998, Eur. J. Oper. Res..

[39]  Aseem Prakash,et al.  Advocacy Organizations and Collective Action: Conclusions and future research , 2010 .

[40]  Chun Zhang,et al.  Storing and querying ordered XML using a relational database system , 2002, SIGMOD '02.

[41]  Nikos A. Vlassis,et al.  Accelerated EM-based clustering of large data sets , 2006, Data Mining and Knowledge Discovery.

[42]  Osamu Watanabe,et al.  Adaptive Sampling Methods for Scaling Up Knowledge Discovery Algorithms , 1999, Data Mining and Knowledge Discovery.

[43]  Sudha Ram,et al.  Proceedings of the 1997 ACM SIGMOD international conference on Management of data , 1997, ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems.

[44]  George Karypis,et al.  Hierarchical Clustering Algorithms for Document Datasets , 2005, Data Mining and Knowledge Discovery.

[45]  M. Narasimha Murty,et al.  Genetic K-means algorithm , 1999, IEEE Trans. Syst. Man Cybern. Part B.

[46]  D.M. Mount,et al.  An Efficient k-Means Clustering Algorithm: Analysis and Implementation , 2002, IEEE Trans. Pattern Anal. Mach. Intell..