The Anchors Hierarchy: Using the Triangle Inequality to Survive High Dimensional Data

This paper is about metric data structures in high-dimensional or non-Euclidean space that permit cached sufficient statistics accelerations of learning algorithms. It has recently been shown that for less than about 10 dimensions, decorating kd-trees with additional "cached sufficient statistics" such as first and second moments and contingency tables can provide satisfying acceleration for a very wide range of statistical learning tasks such as kernel regression, locally weighted regression, k-means clustering, mixture modeling and Bayes Net learning. In this paper, we begin by defining the anchors hierarchy--a fast data structure and algorithm for localizing data based only on a triangle-inequality-obeying distance metric. We show how this, in its own right, gives a fast and effective clustering of data. But more importantly we show how it can produce a well-balanced structure similar to a Ball-Tree (Omohundro, 1991) or a kind of metric tree (Uhlmann, 1991; Ciaccia, Patella, & Zezula, 1997) in a way that is neither "topdown" nor "bottom-up" but instead "middleout". We then show how this structure, decorated with cached sufficient statistics, allows a wide variety of statistical learning algorithms to be accelerated even in thousands of dimensions.

[1]  Jon Louis Bentley,et al.  An Algorithm for Finding Best Matches in Logarithmic Expected Time , 1977, TOMS.

[2]  Piet Hut,et al.  A hierarchical O(N log N) force-calculation algorithm , 1986, Nature.

[3]  V. Rich Personal communication , 1989, Nature.

[4]  Stephen M. Omohundro,et al.  Bumptrees for Efficient Function, Constraint and Classification Learning , 1990, NIPS.

[5]  Jeffrey K. Uhlmann,et al.  Satisfying General Proximity/Similarity Queries with Metric Trees , 1991, Inf. Process. Lett..

[6]  Andrew W. Moore,et al.  Multiresolution Instance-Based Learning , 1995, IJCAI.

[7]  Heikki Mannila,et al.  Verkamo: Fast Discovery of Association Rules , 1996, KDD 1996.

[8]  Venky Harinarayan,et al.  Implementing Data Cubes E ciently , 1996 .

[9]  Heikki Mannila,et al.  Fast Discovery of Association Rules , 1996, Advances in Knowledge Discovery and Data Mining.

[10]  Jeffrey D. Ullman,et al.  Implementing data cubes efficiently , 1996, SIGMOD '96.

[11]  Mary S. Lee Cached Suucient Statistics for Eecient Machine Learning with Large Datasets 1. Caching Suucient Statistics , 1997 .

[12]  Pavel Zezula,et al.  M-tree: An Efficient Access Method for Similarity Search in Metric Spaces , 1997, VLDB.

[13]  Andrew W. Moore,et al.  Efficient Locally Weighted Polynomial Regression Predictions , 1997, ICML.

[14]  Andrew W. Moore,et al.  Very Fast EM-Based Mixture Model Clustering Using Multiresolution Kd-Trees , 1998, NIPS.

[15]  Andrew W. Moore,et al.  Cached Sufficient Statistics for Efficient Machine Learning with Large Datasets , 1998, J. Artif. Intell. Res..

[16]  Andrew W. Moore,et al.  Accelerating exact k-means algorithms with geometric reasoning , 1999, KDD '99.

[17]  P. Indyk,et al.  Efficient regular data structures and algorithms for location and proximity problems , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[18]  Andrew W. Moore,et al.  X-means: Extending K-means with Efficient Estimation of the Number of Clusters , 2000, ICML.

[19]  Hanan Samet,et al.  Efficient Regular Data Structures and Algorithms for Dilation, Location, and Proximity Problems , 1999, Algorithmica.