论文信息 - An optimization approach to partitional data clustering

An optimization approach to partitional data clustering

Scalability of clustering algorithms is a critical issue facing the data mining community. One method to handle this issue is to use only a subset of all instances. This paper develops an optimization-based approach to the partitional clustering problem using an algorithm specifically designed for noisy performance, which is a problem that arises when using a subset of instances. Numerical results show that computation time can be dramatically reduced by using a partial set of instances without sacrificing solution quality. In addition, these results are more persuasive as the size of the problem is larger.

J. Yang | J. Kim | Sigurdur Ólafsson

[1] Sigurdur Olafsson,et al. Improving Scalability of E-Commerce Systems with Knowledge Discovery , 2003 .

[2] Estivill-CastroVladimir. Why so many clustering algorithms , 2002 .

[3] Paul S. Bradley,et al. Feature Selection via Mathematical Programming , 1997, INFORMS J. Comput..

[4] Andreas Rudolph,et al. Techniques of Cluster Algorithms in Data Mining , 2002, Data Mining and Knowledge Discovery.

[5] Peter C. Cheeseman,et al. Bayesian Classification (AutoClass): Theory and Results , 1996, Advances in Knowledge Discovery and Data Mining.

[6] Hannu Toivonen,et al. Sampling Large Databases for Association Rules , 1996, VLDB.

[7] Dimitrios Gunopulos,et al. Automatic Subspace Clustering of High Dimensional Data , 2005, Data Mining and Knowledge Discovery.

[8] Jiawei Han,et al. Efficient and Effective Clustering Methods for Spatial Data Mining , 1994, VLDB.

[9] Dimitrios Gunopulos,et al. Locally adaptive metrics for clustering high dimensional data , 2007, Data Mining and Knowledge Discovery.

[10] Charles Elkan,et al. Scalability for clustering algorithms revisited , 2000, SKDD.

[11] Paul S. Bradley,et al. Scaling Clustering Algorithms to Large Databases , 1998, KDD.

[12] Sigurdur Ólafsson,et al. Iterative ranking-and-selection for large-scale optimization , 1999, WSC '99.

[13] T. Kohonen. Self-organized formation of topographically correct feature maps , 1982 .

[14] Heikki Mannila,et al. The power of sampling in knowledge discovery , 1994, PODS '94.

[15] Hans-Peter Kriegel,et al. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[16] Sudipto Guha,et al. CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[17] A. Gordaliza,et al. Robustness Properties of k Means and Trimmed k Means , 1999 .

[18] Jiong Yang,et al. STING: A Statistical Information Grid Approach to Spatial Data Mining , 1997, VLDB.

[19] Johannes Gehrke,et al. Scaling mining algorithms to large databases , 2002, CACM.

[20] Foster J. Provost,et al. A Survey of Methods for Scaling Up Inductive Algorithms , 1999, Data Mining and Knowledge Discovery.

[21] J. Chauchat,et al. Sampling Strategy for Building Decision Trees from Very Large Databases Comprising Many Continuous Attributes , 2001 .

[22] Vladimir Estivill-Castro,et al. Why so many clustering algorithms: a position paper , 2002, SKDD.

[23] Leyuan Shi,et al. Nested Partitions Method for Global Optimization , 2000, Oper. Res..

[24] Yi-Shen Lin,et al. An extended study of the K-means algorithm for data clustering and its applications , 2004, J. Oper. Res. Soc..

[25] Teuvo Kohonen,et al. Self-organized formation of topologically correct feature maps , 2004, Biological Cybernetics.

[26] Hans-Peter Kriegel,et al. Incremental Clustering for Mining in a Data Warehousing Environment , 1998, VLDB.

[27] Tim Oates,et al. Efficient progressive sampling , 1999, KDD '99.

[28] Huan Liu,et al. Instance Selection and Construction for Data Mining , 2001 .

[29] Tian Zhang,et al. BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[30] David A. Clausi,et al. K-means Iterative Fisher (KIF) unsupervised clustering algorithm applied to image texture segmentation , 2002, Pattern Recognit..

[31] Anil K. Jain,et al. Data clustering: a review , 1999, CSUR.

[32] Manjunath Kamath,et al. Scalable enterprise systems : an introduction to recent advances , 2003 .

[33] Bin Zhang,et al. Distributed data clustering can be efficient and exact , 2000, SKDD.

[34] Ali S. Hadi,et al. Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[35] Rakesh Agrawal,et al. Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining , 1998, KDD 1998.

[36] D. Madigan,et al. Proceedings : KDD-99 : the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 15-18, 1999, San Diego, California, USA , 1999 .

[37] Joydeep Ghosh,et al. Scalable Clustering Algorithms with Balancing Constraints , 2006, Data Mining and Knowledge Discovery.

[38] Amit Basu,et al. Perspectives on operations research in data and knowledge management , 1998, Eur. J. Oper. Res..

[39] Aseem Prakash,et al. Advocacy Organizations and Collective Action: Conclusions and future research , 2010 .

[40] Chun Zhang,et al. Storing and querying ordered XML using a relational database system , 2002, SIGMOD '02.

[41] Nikos A. Vlassis,et al. Accelerated EM-based clustering of large data sets , 2006, Data Mining and Knowledge Discovery.

[42] Osamu Watanabe,et al. Adaptive Sampling Methods for Scaling Up Knowledge Discovery Algorithms , 1999, Data Mining and Knowledge Discovery.

[43] Sudha Ram,et al. Proceedings of the 1997 ACM SIGMOD international conference on Management of data , 1997, ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems.

[44] George Karypis,et al. Hierarchical Clustering Algorithms for Document Datasets , 2005, Data Mining and Knowledge Discovery.

[45] M. Narasimha Murty,et al. Genetic K-means algorithm , 1999, IEEE Trans. Syst. Man Cybern. Part B.

[46] D.M. Mount,et al. An Efficient k-Means Clustering Algorithm: Analysis and Implementation , 2002, IEEE Trans. Pattern Anal. Mach. Intell..