Fast Indexing and Visualization of Metric Data Sets using Slim-Trees

Many recent database applications need to deal with similarity queries. For such applications, it is important to measure the similarity between two objects using the distance between them. Focusing on this problem, this paper proposes the slim-tree, a new dynamic tree for organizing metric data sets in pages of fixed size. The slim-tree uses the triangle inequality to prune the distance calculations that are needed to answer similarity queries over objects in metric spaces. The proposed insertion algorithm uses new policies to select the nodes where incoming objects are stored. When a node overflows, the slim-tree uses a minimal spanning tree to help with the splitting. The new insertion algorithm leads to a tree with high storage utilization and improved query performance. The slim-tree is a metric access method that tackles the problem of overlaps between nodes in metric spaces and that allows one to minimize the overlap. The proposed "fat-factor" is a way to quantify whether a given tree can be improved and also to compare two trees. We show how to use the fat-factor to achieve accurate estimates of the search performance and also how to improve the performance of a metric tree through the proposed "slim-down" algorithm. This paper also presents a new tool in the slim-tree's arsenal of resources, aimed at visualizing it. Visualization is a powerful tool for interactive data mining and for the visual tracking of the behavior of a tree under updates. Finally, we present a formula to estimate the number of disk accesses in range queries. Results from experiments with real and synthetic data sets show that the new slim-tree algorithms lead to performance improvements. These results show that the slim-tree outperforms the M-tree by up to 200% for range queries. For insertion and splitting, the minimal-spanning-tree-based algorithm achieves up to 40 times faster insertions. We observed improvements of up to 40% in range queries after applying the slim-down algorithm.

[1]  Nellie Clarke Brown Trees , 1896, Savage Dreams.

[2]  J. Kruskal On the shortest spanning subtree of a graph and the traveling salesman problem , 1956 .

[3]  Walter A. Burkhard,et al.  Some approaches to best-match file searching , 1973, Commun. ACM.

[4]  共立出版株式会社 コンピュータ・サイエンス : ACM computing surveys , 1978 .

[5]  Antonin Guttman,et al.  R-trees: a dynamic index structure for spatial searching , 1984, SIGMOD '84.

[6]  Christos Faloutsos,et al.  The R+-Tree: A Dynamic Index for Multi-Dimensional Objects , 1987, VLDB.

[7]  Nick Roussopoulos,et al.  Faloutsos: "the r+- tree: a dynamic index for multidimensional objects , 1987 .

[8]  Dennis Shasha,et al.  Utilization of B-trees with inserts, deletes and modifies , 1989, PODS '89.

[9]  Hans-Peter Kriegel,et al.  The R*-tree: an efficient and robust access method for points and rectangles , 1990, SIGMOD '90.

[10]  Dennis Shasha,et al.  New techniques for best-match retrieval , 1990, TOIS.

[11]  Jeffrey K. Uhlmann,et al.  Satisfying General Proximity/Similarity Queries with Metric Trees , 1991, Inf. Process. Lett..

[12]  Peter N. Yianilos,et al.  Data structures and algorithms for nearest neighbor search in general metric spaces , 1993, SODA '93.

[13]  Ricardo A. Baeza-Yates,et al.  Proximity Matching Using Fixed-Queries Trees , 1994, CPM.

[14]  Tzi-cker Chiueh,et al.  Content-Based Image Indexing , 1994, VLDB.

[15]  Christos Faloutsos,et al.  Beyond uniformity and independence: analysis of R-trees using the concept of fractal dimension , 1994, PODS.

[16]  Sergey Brin,et al.  Near Neighbor Search in Large Metric Spaces , 1995, VLDB.

[17]  Christos Faloutsos,et al.  FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets , 1995, SIGMOD '95.

[18]  Takeo Kanade,et al.  Intelligent Access to Digital Video: Informedia Project , 1996, Computer.

[19]  Pavel Zezula,et al.  M-tree: An Efficient Access Method for Similarity Search in Metric Spaces , 1997, VLDB.

[20]  Pavel Zezula,et al.  Indexing Metric Spaces with M-Tree , 1997, SEBD.

[21]  Z. Meral Özsoyoglu,et al.  Distance-based indexing for high-dimensional metric spaces , 1997, SIGMOD '97.

[22]  Christian Böhm,et al.  A cost model for nearest neighbor search in high-dimensional data space , 1997, PODS.

[23]  Pavel Zezula,et al.  A cost model for similarity queries in metric spaces , 1998, PODS '98.

[24]  Mario A. López,et al.  On Optimal Node Splitting for R-trees , 1998, VLDB.

[25]  Oliver Günther,et al.  Multidimensional access methods , 1998, CSUR.

[26]  Maria das Graças Volpe Nunes,et al.  Linguistic issues in the development of ReGra: A grammar checker for Brazilian Portuguese , 1998, Natural Language Engineering.

[27]  Joseph M. Hellerstein,et al.  Amdb: a visual access method development tool , 1999, Proceedings User Interfaces to Data Intensive Systems.

[28]  H. Gabriela,et al.  Cluster-preserving Embedding of Proteins , 1999 .

[29]  Z. Meral Özsoyoglu,et al.  Indexing large metric spaces for similarity search queries , 1999, TODS.

[30]  Kaizhong Zhang,et al.  Evaluating a class of distance-mapping algorithms for data mining and clustering , 1999, KDD '99.

[31]  Christos Faloutsos,et al.  Distance Exponent: A New Concept for Selectivity Estimation in Metric Trees , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[32]  Christos Faloutsos,et al.  Deflating the dimensionality curse using multiple fractal dimensions , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[33]  Christos Faloutsos,et al.  Slim-Trees: High Performance Metric Trees Minimizing Overlap Between Nodes , 2000, EDBT.

[34]  Marco Patella,et al.  Bulk Loading the M-tree , 2001 .