Testing of clustering

A set X of points in /spl Rfr//sup d/ is (k,b)-clusterable if X can be partitioned into k subsets (clusters) so that the diameter (alternatively, the radius) of each cluster is at most b. We present algorithms that by sampling from a set X, distinguish between the case that X is (k,b)-clusterable and the case that X is /spl epsiv/-far from being (k,b')-clusterable for any given 0</spl epsiv//spl les/1 and for b'/spl ges/b. In /spl epsiv/-far from being (k,b')-clusterable we mean that more than /spl epsiv/.|X| points should be removed from X so that it becomes (k,b')-clusterable. We give algorithms for a variety of cost measures that use a sample of size independent of |X|, and polynomial in k and 1//spl epsiv/. Our algorithms can also be used to find approximately good clusterings. Namely, these are clusterings of all but an /spl epsiv/-fraction of the points in X that have optimal (or close to optimal) cost. The benefit of our algorithms is that they construct an implicit representation of such clusterings in time independent of |X|. That is, without actually having to partition all points in X, the implicit representation can be used to answer queries concerning the cluster any given point belongs to.

[1]  Vladimir Vapnik,et al.  Chervonenkis: On the uniform convergence of relative frequencies of events to their probabilities , 1971 .

[2]  Michael R. Anderberg,et al.  Cluster Analysis for Applications , 1973 .

[3]  Robert J. Fowler,et al.  Optimal Packing and Covering in the Plane are NP-Complete , 1981, Inf. Process. Lett..

[4]  Nimrod Megiddo,et al.  Linear Programming in Linear Time When the Dimension Is Fixed , 1984, JACM.

[5]  Teofilo F. GONZALEZ,et al.  Clustering to Minimize the Maximum Intercluster Distance , 1985, Theor. Comput. Sci..

[6]  David B. Shmoys,et al.  A Best Possible Heuristic for the k-Center Problem , 1985, Math. Oper. Res..

[7]  David B. Shmoys,et al.  A unified approach to approximation algorithms for bottleneck problems , 1986, JACM.

[8]  Nimrod Megiddo,et al.  An O(n log n) Randomizing Algorithm for the Weighted Euclidean 1-Center Problem , 1986, J. Algorithms.

[9]  David Haussler,et al.  ɛ-nets and simplex range queries , 1987, Discret. Comput. Geom..

[10]  Tomás Feder,et al.  Optimal algorithms for approximate clustering , 1988, STOC '88.

[11]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[12]  Manuel Blum,et al.  Self-testing/correcting with applications to numerical problems , 1990, STOC '90.

[13]  László Lovász,et al.  Approximating clique is almost NP-complete , 1991, [1991] Proceedings 32nd Annual Symposium of Foundations of Computer Science.

[14]  Jean-Michel Jolion,et al.  Robust Clustering with Applications in Computer Vision , 1991, IEEE Trans. Pattern Anal. Mach. Intell..

[15]  Leonid A. Levin,et al.  Checking computations in polylogarithmic time , 1991, STOC '91.

[16]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[17]  Bernard Chazelle,et al.  On linear-time deterministic algorithms for optimization problems in fixed dimension , 1996, SODA '93.

[18]  M. Blum,et al.  Designing programs that check their work , 1995, JACM.

[19]  Rajeev Motwani,et al.  Randomized Algorithms , 1995, SIGA.

[20]  Rakesh Agrawal,et al.  SPRINT: A Scalable Parallel Classifier for Data Mining , 1996, VLDB.

[21]  Said Salhi,et al.  Facility Location: A Survey of Applications and Methods , 1996 .

[22]  Neal E. Young,et al.  Data collection for the Sloan Digital Sky Survey—a network-flow heuristic , 1996, SODA '96.

[23]  Ronitt Rubinfeld,et al.  Robust Characterizations of Polynomials with Applications to Program Testing , 1996, SIAM J. Comput..

[24]  Dorit S. Hochbaum,et al.  Approximation Algorithms for NP-Hard Problems , 1996 .

[25]  Oded Goldreich,et al.  Combinatorial property testing (a survey) , 1997, Randomization Methods in Algorithm Design.

[26]  Jon Hamkins,et al.  Asymptotically dense spherical codes - Part h Wrapped spherical codes , 1997, IEEE Trans. Inf. Theory.

[27]  Prabhakar Raghavan,et al.  Information retrieval algorithms: a survey , 1997, SODA '97.

[28]  Dana Ron,et al.  Property Testing in Bounded Degree Graphs , 1997, STOC.

[29]  Ronitt Rubinfeld,et al.  Spot-checkers , 1998, STOC '98.

[30]  Dana Ron,et al.  Property testing and its connection to learning and approximation , 1998, JACM.

[31]  Pankaj K. Agarwal,et al.  Exact and Approximation Algortihms for Clustering , 1997 .

[32]  Micha Sharir,et al.  Efficient algorithms for geometric optimization , 1998, CSUR.

[33]  Dana Ron,et al.  Testing problems with sub-learning sample complexity , 1998, COLT' 98.

[34]  Ronitt Rubinfeld On the Robustness of Functional Equations , 1999, SIAM J. Comput..

[35]  Alan M. Frieze,et al.  Quick Approximation to Matrices and Applications , 1999, Comb..

[36]  Dana Ron,et al.  A Sublinear Bipartiteness Tester for Bounded Degree Graphs , 1999, Comb..

[37]  Dana Ron,et al.  Improved Testing Algorithms for Monotonicity , 1999, Electron. Colloquium Comput. Complex..

[38]  Artur Czumaj,et al.  Property Testing in Computational Geometry , 2000, ESA.

[39]  Dana Ron,et al.  Testing Problems with Sublearning Sample Complexity , 2000, J. Comput. Syst. Sci..

[40]  Edgar A. Ramos Deterministic algorithms for 3-D diameter and some 2-D lower envelopes , 2000, SCG '00.

[41]  Noga Alon,et al.  Efficient Testing of Large Graphs , 2000, Comb..

[42]  Michael A. Bender,et al.  Testing Acyclicity of Directed Graphs in Sublinear Time , 2000, ICALP.

[43]  Noga Alon,et al.  Regular Languages are Testable with a Constant Number of Queries , 2000, SIAM J. Comput..

[44]  Dana Ron,et al.  Testing Monotonicity , 2000, Comb..

[45]  Eldar Fischer,et al.  Testing of matrix properties , 2001, STOC '01.

[46]  Leonard Pitt,et al.  Sublinear time approximate clustering , 2001, SODA '01.

[47]  Dana Ron,et al.  Testing the diameter of graphs , 1999, RANDOM-APPROX.