Indexing very high-dimensional sparse and quasi-sparse vectors for similarity searches

Abstract. Similarity queries on complex objects are usually translated into searches among their feature vectors. This paper studies indexing techniques for very high-dimensional (e.g., in hundreds) vectors that are sparse or quasi-sparse, i.e., vectors each having only a small number (e.g., ten) of non-zero or significant values. Based on the R-tree, the paper introduces the xS-tree that uses lossy compression of bounding regions to guarantee a reasonable minimum fan-out within the allocated storage space for each node. In addition, the paper studies the performance and scalability of the xS-tree via experiments.

[1]  David Salesin,et al.  Fast multiresolution image querying , 1995, SIGGRAPH.

[2]  Steven Roman Advanced Linear Algebra , 1992 .

[3]  Stefan Berchtold,et al.  High-dimensional index structures database support for next decade's applications (tutorial) , 1998, SIGMOD '98.

[4]  James Ze Wang,et al.  Wavelet-based image indexing techniques with partial sketch retrieval capability , 1997, Proceedings of ADL '97 Forum on Research and Technology. Advances in Digital Libraries.

[5]  Clu-istos Foutsos,et al.  Fast subsequence matching in time-series databases , 1994, SIGMOD '94.

[6]  Christos Faloutsos,et al.  FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets , 1995, SIGMOD '95.

[7]  Pavel Zezula,et al.  M-tree: An Efficient Access Method for Similarity Search in Metric Spaces , 1997, VLDB.

[8]  Jeffrey K. Uhlmann,et al.  Satisfying General Proximity/Similarity Queries with Metric Trees , 1991, Inf. Process. Lett..

[9]  Shin'ichi Satoh,et al.  The SR-tree: an index structure for high-dimensional nearest neighbor queries , 1997, SIGMOD '97.

[10]  Christos Faloutsos,et al.  Hilbert R-tree: An Improved R-tree using Fractals , 1994, VLDB.

[11]  Andreas Henrich,et al.  Adapting a spatial access structure for document representations in vector space , 1996, CIKM '96.

[12]  Alberto O. Mendelzon,et al.  Similarity-based queries for time series data , 1997, SIGMOD '97.

[13]  Christian Böhm,et al.  A cost model for nearest neighbor search in high-dimensional data space , 1997, PODS.

[14]  Ramesh C. Jain,et al.  Similarity indexing with the SS-tree , 1996, Proceedings of the Twelfth International Conference on Data Engineering.

[15]  Hans-Peter Kriegel,et al.  The X-tree : An Index Structure for High-Dimensional Data , 2001, VLDB.

[16]  Hans-Peter Kriegel,et al.  The pyramid-technique: towards breaking the curse of dimensionality , 1998, SIGMOD '98.

[17]  Hee-Kap Ahn,et al.  A survey on multidimensional access methods , 2001 .

[18]  Gerald Kowalski,et al.  Information Retrieval Systems: Theory and Implementation , 1997 .

[19]  Christos Faloutsos,et al.  Efficient Similarity Search In Sequence Databases , 1993, FODO.

[20]  Hans-Peter Kriegel,et al.  The R*-tree: an efficient and robust access method for points and rectangles , 1990, SIGMOD '90.

[21]  Z. Meral Özsoyoglu,et al.  Distance-based indexing for high-dimensional metric spaces , 1997, SIGMOD '97.

[22]  David B. Lomet,et al.  The hB-tree: a multiattribute indexing method with good guaranteed performance , 1990, TODS.

[23]  Antonin Guttman,et al.  R-trees: a dynamic index structure for spatial searching , 1984, SIGMOD '84.

[24]  Dina Q. Goldin,et al.  On Similarity Queries for Time-Series Data: Constraint Specification and Implementation , 1995, CP.

[25]  Christos Faloutsos,et al.  The R+-Tree: A Dynamic Index for Multi-Dimensional Objects , 1987, VLDB.

[26]  Nick Roussopoulos,et al.  Nearest neighbor queries , 1995, SIGMOD '95.

[27]  Sergey Brin,et al.  Near Neighbor Search in Large Metric Spaces , 1995, VLDB.