Subquadratic approximation algorithms for clustering problems in high dimensional spaces

One of the central problems in information retrieval, data mining, computational biology, statistical analysis, computer vision, geographic analysis, pattern recognition, distributed protocols is the question of classification of data according to some clustering rule. Often the data is noisy and even approximate classification is of extreme importance. The difficulty of such classification stems from the fact that usually the data has many incomparable attributes, and often results in the question of clustering problems in high dimensional spaces. Since they require measuring distance between every pair of data points, standard algorithms for computing the exact clustering solutions use quadratic or “nearly quadratic” running time; i.e., O(dn2−α(d)) time where n is the number of data points, d is the dimension of the space and α(d) approaches 0 as d grows. In this paper, we show (for three fairly natural clustering rules) that computing an approximate solution can be done much more efficiently. More specifically, for agglomerative clustering (used, for example, in the Alta Vista™ search engine), for the clustering defined by sparse partitions, and for a clustering based on minimum spanning trees we derive randomized (1 + ∈) approximation algorithms with running times O(d2n2−γ) where γ > 0 depends only on the approximation parameter ∈ and is independent of the dimension d.

[1]  Claude Berge,et al.  Programming, games and transportation networks , 1966 .

[2]  Andrew Chi-Chih Yao,et al.  On Constructing Minimum Spanning Trees in k-Dimensional Spaces and Related Problems , 1977, SIAM J. Comput..

[3]  W. B. Johnson,et al.  Extensions of Lipschitz mappings into Hilbert space , 1984 .

[4]  Bernard Chazelle,et al.  How to Search in History , 1983, Inf. Control..

[5]  Peter Frankl,et al.  The Johnson-Lindenstrauss lemma and the sphericity of some graphs , 1987, J. Comb. Theory, Ser. B.

[6]  Baruch Awerbuch,et al.  Sparse partitions , 1990, Proceedings [1990] 31st Annual Symposium on Foundations of Computer Science.

[7]  Michael E. Saks,et al.  Decomposing graphs into regions of small diameter , 1991, SODA '91.

[8]  Godfried T. Toussaint,et al.  Relative neighborhood graphs and their relatives , 1992, Proc. IEEE.

[9]  Lenore Cowen,et al.  Near-linear cost sequential and distributed constructions of sparse neighborhood covers , 1993, Proceedings of 1993 IEEE 34th Annual Foundations of Computer Science.

[10]  Nathan Linial,et al.  The geometry of graphs and some of its algorithmic applications , 1994, Proceedings 35th Annual Symposium on Foundations of Computer Science.

[11]  Joseph O'Rourke,et al.  Handbook of Discrete and Computational Geometry, Second Edition , 1997 .

[12]  Edith Cohen,et al.  Approximating matrix multiplication for pattern recognition tasks , 1997, SODA '97.

[13]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[14]  Jon M. Kleinberg,et al.  Two algorithms for nearest-neighbor search in high dimensions , 1997, STOC '97.

[15]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[16]  David Eppstein,et al.  Fast hierarchical clustering and other applications of dynamic closest pairs , 1999, SODA '98.

[17]  David G. Stork,et al.  Pattern classification, 2nd Edition , 2000 .

[18]  Rafail Ostrovsky,et al.  Efficient search for approximate nearest neighbor in high dimensional spaces , 1998, STOC '98.