Searching in high-dimensional spaces: Index structures for improving the performance of multimedia databases

During the last decade, multimedia databases have become increasingly important in many application areas such as medicine, CAD, geography, and molecular biology. An important research issue in the field of multimedia databases is the content-based retrieval of similar multimedia objects such as images, text, and videos. However, in contrast to searching data in a relational database, a content-based retrieval requires the search of similar objects as a basic functionality of the database system. Most of the approaches addressing similarity search use a so-called feature transformation that transforms important properties of the multimedia objects into high-dimensional points (feature vectors). Thus, the similarity search is transformed into a search of points in the feature space that are close to a given query point in the high-dimensional feature space. Query processing in high-dimensional spaces has therefore been a very active research area over the last few years. A number of new index structures and algorithms have been proposed. It has been shown that the new index structures considerably improve the performance in querying large multimedia databases. Based on recent tutorials [Berchtold and Keim 1998], in this survey we provide an overview of the current state of the art in querying multimedia databases, describing the index structures and algorithms for an efficient query processing in high-dimensional spaces. We identify the problems of processing queries in high-dimensional space, and we provide an overview of the proposed approaches to overcome these problems.

[1]  Jürg Nievergelt,et al.  The Grid File: An Adaptable, Symmetric Multikey File Structure , 1984, TODS.

[2]  H. V. Jagadish,et al.  A retrieval technique for similar shapes , 1991, SIGMOD '91.

[3]  D. B. Lomet,et al.  A robust multi-attribute search structure , 1989, [1989] Proceedings. Fifth International Conference on Data Engineering.

[4]  Oliver Günther,et al.  Multidimensional access methods , 1998, CSUR.

[5]  C. Faloutsos Eecient Similarity Search in Sequence Databases , 1993 .

[6]  Rajiv Mehrotra,et al.  Feature-Index-Based Similar Shape Retrieval , 1997, VDB.

[7]  Sergey Brin,et al.  Near Neighbor Search in Large Metric Spaces , 1995, VLDB.

[8]  Christian Böhm,et al.  On Optimizing Nearest Neighbor Queries in High-Dimensional Data Spaces , 2001, ICDT.

[9]  Timos K. Sellis,et al.  A model for the prediction of R-tree performance , 1996, PODS.

[10]  Christos Faloutsos,et al.  Fast subsequence matching in time-series databases , 1994, SIGMOD '94.

[11]  Pavel Zezula,et al.  M-tree: An Efficient Access Method for Similarity Search in Metric Spaces , 1997, VLDB.

[12]  Volker Gaede,et al.  Optimal Redundancy in Spatial Database Systems , 1995, SSD.

[13]  Christos Faloutsos,et al.  Beyond uniformity and independence: analysis of R-trees using the concept of fractal dimension , 1994, PODS.

[14]  Hans-Jörg Schek,et al.  A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces , 1998, VLDB.

[15]  S. Arya Nearest neighbor searching and applications , 1996 .

[16]  Yannis Manolopoulos,et al.  Performance of Nearest Neighbor Queries in R-Trees , 1997, ICDT.

[17]  Christian Böhm,et al.  A cost model for query processing in high dimensional data spaces , 2000, TODS.

[18]  Diane Greene,et al.  An implementation and performance analysis of spatial data access methods , 1989, [1989] Proceedings. Fifth International Conference on Data Engineering.

[19]  Nick Roussopoulos,et al.  Nearest neighbor queries , 1995, SIGMOD '95.

[20]  Hans-Peter Kriegel,et al.  The Buddy-Tree: An Efficient and Robust Access Method for Spatial Data Base Systems , 1990, VLDB.

[21]  Christos Faloutsos,et al.  Fast Nearest Neighbor Search in Medical Image Databases , 1996, VLDB.

[22]  Aris M. Ouksel,et al.  The Nested Interpolation Based Grid File , 1991, MFDBS.

[23]  BöhmChristian A cost model for query processing in high dimensional data spaces , 2000 .

[24]  Jon Louis Bentley,et al.  Multidimensional binary search trees used for associative searching , 1975, CACM.

[25]  Sakti Pramanik,et al.  Optimal file distribution for partial match retrieval , 1988, SIGMOD '88.

[26]  Hans-Peter Kriegel,et al.  Indexing the Solution Space: A New Technique for Nearest Neighbor Search in High-Dimensional Space , 2000, IEEE Trans. Knowl. Data Eng..

[27]  Jeffrey F. Naughton,et al.  Generalized Search Trees for Database Systems , 1995, VLDB.

[28]  Kyuseok Shim,et al.  Fast Similarity Search in the Presence of Noise, Scaling, and Translation in Time-Series Databases , 1995, VLDB.

[29]  Paul M. Aoki Generalizing "search" in generalized search trees , 1998, Proceedings 14th International Conference on Data Engineering.

[30]  Andreas Henrich A Distance Scan Algorithm for Spatial Access Structures , 1994, ACM-GIS.

[31]  Hanan Samet,et al.  Incremental distance join algorithms for spatial databases , 1998, SIGMOD '98.

[32]  Clu-istos Foutsos,et al.  Fast subsequence matching in time-series databases , 1994, SIGMOD '94.

[33]  Christos Faloutsos,et al.  Estimating the Selectivity of Spatial Queries Using the 'Correlation' Fractal Dimension , 1995, VLDB.

[34]  Stefan Berchtold,et al.  Independence Diagrams: A Technique for Visual Data Mining , 1998, KDD.

[35]  Christos Faloutsos,et al.  Active Storage for Large-Scale Data Mining and Multimedia , 1998, VLDB.

[36]  Stefan Berchtold,et al.  High-dimensional index structures database support for next decade's applications (tutorial) , 1998, SIGMOD '98.

[37]  Christos Faloutsos,et al.  Analysis of object oriented spatial access methods , 1987, SIGMOD '87.

[38]  Peter N. Yianilos,et al.  Data structures and algorithms for nearest neighbor search in general metric spaces , 1993, SODA '93.

[39]  Walter A. Burkhard,et al.  Some approaches to best-match file searching , 1973, Commun. ACM.

[40]  Christos Faloutsos,et al.  Multiattribute hashing using Gray codes , 1986, SIGMOD '86.

[41]  James K. Mullin Retrieval—Update speed tradeoffs using combined indices , 1971, CACM.

[42]  Hans-Peter Kriegel,et al.  The pyramid-technique: towards breaking the curse of dimensionality , 1998, SIGMOD '98.

[43]  Donald E. Knuth,et al.  The art of computer programming, volume 3: (2nd ed.) sorting and searching , 1998 .

[44]  Aris M. Ouksel The interpolation-based grid file , 1985, PODS '85.

[45]  H. V. Jagadish Spatial search with polyhedra , 1990, [1990] Proceedings. Sixth International Conference on Data Engineering.

[46]  Michael Freeston,et al.  The BANG file: A new kind of grid file , 1987, SIGMOD '87.

[47]  John G. Cleary,et al.  Analysis of an Algorithm for Finding Nearest Neighbors in Euclidean Space , 1979, TOMS.

[48]  Klaus H. Hinrichs,et al.  Implementation of the grid file: Design concepts and experience , 1985, BIT.

[49]  Christian Böhm,et al.  Fast parallel similarity search in multimedia databases , 1997, SIGMOD '97.

[50]  Daniel A. Keim,et al.  An Efficient Approach to Clustering in Large Multimedia Databases with Noise , 1998, KDD.

[51]  Hans-Werner Six,et al.  Twin grid files: space optimizing access schemes , 1988, SIGMOD '88.

[52]  Z. Meral Özsoyoglu,et al.  Distance-based indexing for high-dimensional metric spaces , 1997, SIGMOD '97.

[53]  Hans-Werner Six,et al.  The LSD tree: Spatial Access to Multidimensional Point and Nonpoint Objects , 1989, VLDB.

[54]  Christian Böhm,et al.  Efficiently Indexing High-Dimensional Data Spaces , 1998 .

[55]  Shin'ichi Satoh,et al.  The SR-tree: an index structure for high-dimensional nearest neighbor queries , 1997, SIGMOD '97.

[56]  Christos Faloutsos,et al.  Hilbert R-tree: An Improved R-tree using Fractals , 1994, VLDB.

[57]  Hans-Peter Kriegel,et al.  Fast nearest neighbor search in high-dimensional space , 1998, Proceedings 14th International Conference on Data Engineering.

[58]  Hans-Peter Kriegel,et al.  The R*-tree: an efficient and robust access method for points and rectangles , 1990, SIGMOD '90.

[59]  Jonathan Goldstein,et al.  When Is ''Nearest Neighbor'' Meaningful? , 1999, ICDT.

[60]  Christian Böhm,et al.  Optimal Multidimensional Query Processing Using Tree Striping , 2000, DaWaK.

[61]  Christos Faloutsos,et al.  The R+-Tree: A Dynamic Index for Multi-Dimensional Objects , 1987, VLDB.

[62]  Christian Böhm,et al.  Independent quantization: an index compression technique for high-dimensional data spaces , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[63]  Stefan Berchtold,et al.  High-Dimensional Index Structures : Databases Support for Next Decade's Applications's , 2000, ICDE 2000.

[64]  Brian K. Shoichet,et al.  Molecular docking using shape descriptors , 1992 .

[65]  Pavel Zezula,et al.  A cost model for similarity queries in metric spaces , 1998, PODS '98.

[66]  Caroline M. Eastman Optimal Bucket Size for Nearest Neighbor Searching in k-d Trees , 1981, Inf. Process. Lett..

[67]  David B. Lomet,et al.  The hB-tree: a multiattribute indexing method with good guaranteed performance , 1990, TODS.

[68]  S. Muthukrishnan,et al.  Influence sets based on reverse nearest neighbor queries , 2000, SIGMOD '00.

[69]  Jon Louis Bentley,et al.  An Algorithm for Finding Best Matches in Logarithmic Expected Time , 1977, TOMS.

[70]  H. V. Jagadish,et al.  Linear clustering of objects with multiple attributes , 1990, SIGMOD '90.

[71]  K. Wakimoto,et al.  Efficient and Effective Querying by Image Content , 1994 .

[72]  Michael Stonebraker,et al.  An Analysis of Rule Indexing Implementations in Data Base Systems , 1986, Expert Database Conf..

[73]  Ekow J. Otoo,et al.  A Mapping Function for the Directory of a Multidimensional Extendible Hashing , 1984, VLDB.

[74]  Thomas Seidl,et al.  Adaptable Similarity Search in 3-D Spatial Database Systems (Abstract) , 1998, Datenbank Rundbr..

[75]  Hans-Peter Kriegel,et al.  Efficient User-Adaptable Similarity Search in Large Multimedia Databases , 1997, VLDB.

[76]  Oliver Günther,et al.  The design of the cell tree: an object-oriented index structure for geometric databases , 1989, [1989] Proceedings. Fifth International Conference on Data Engineering.

[77]  Christos Faloutsos,et al.  Parallel R-trees , 1992, SIGMOD '92.

[78]  A. Guttman,et al.  A Dynamic Index Structure for Spatial Searching , 1984, SIGMOD 1984.

[79]  Flip Korn,et al.  Influence sets based on reverse nearest neighbor queries , 2000, SIGMOD 2000.

[80]  Christian Böhm,et al.  A cost model for nearest neighbor search in high-dimensional data space , 1997, PODS.

[81]  Christian Böhm,et al.  Improving the Query Performance of High-Dimensional Index Structures by Bulk-Load Operations , 1998, EDBT.

[82]  Yannis Manolopoulos,et al.  Nearest Neighbor Queries in Shared-Nothing Environments , 1997, GeoInformatica.

[83]  Antonin Guttman,et al.  R-trees: a dynamic index structure for spatial searching , 1984, SIGMOD '84.

[84]  Jack A. Orenstein A comparison of spatial query processing techniques for native and parameter spaces , 1990, SIGMOD '90.

[85]  Yannis Manolopoulos,et al.  Closest pair queries in spatial databases , 2000, SIGMOD '00.

[86]  Andreas Henrich,et al.  The LSD/sup h/-tree: an access structure for feature vectors , 1998, Proceedings 14th International Conference on Data Engineering.

[87]  Douglas Comer,et al.  Ubiquitous B-Tree , 1979, CSUR.

[88]  Rajiv Mehrotra,et al.  Feature-based retrieval of similar shapes , 1993, Proceedings of IEEE 9th International Conference on Data Engineering.

[89]  Benoit B. Mandelbrot,et al.  Fractal Geometry of Nature , 1984 .

[90]  Christos Faloutsos,et al.  FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets , 1995, SIGMOD '95.

[91]  J. T. Robinson,et al.  The K-D-B-tree: a search structure for large multidimensional dynamic indexes , 1981, SIGMOD '81.

[92]  Jeffrey K. Uhlmann,et al.  Satisfying General Proximity/Similarity Queries with Metric Trees , 1991, Inf. Process. Lett..

[93]  Georgios Evangelidis,et al.  The hB $^\Pi$-tree: a multi-attribute index supporting concurrency, recovery and node consolidation , 1997, The VLDB Journal.

[94]  Nick Roussopoulos,et al.  Faloutsos: "the r+- tree: a dynamic index for multidimensional objects , 1987 .

[95]  John S. Sobolewski,et al.  Disk allocation for Cartesian product files on multiple-disk systems , 1982, TODS.

[96]  T. H. Merrett,et al.  A class of data structures for associative searching , 1984, PODS.

[97]  Karen Kukich,et al.  Techniques for automatically correcting words in text , 1992, CSUR.

[98]  Paul M. Aoki Generalizing Search'' in Generalized Search Trees (Extended Abstract) , 1998, ICDE 1998.

[99]  Ada Wai-Chee Fu,et al.  Enhanced nearest neighbour search on the R-tree , 1998, SGMD.

[100]  Hanan Samet,et al.  Ranking in Spatial Databases , 1995, SSD.

[101]  Ramesh C. Jain,et al.  Similarity indexing with the SS-tree , 1996, Proceedings of the Twelfth International Conference on Data Engineering.

[102]  LometDavid,et al.  The hB $^\Pi$-tree: a multi-attribute index supporting concurrency, recovery and node consolidation , 1997, VLDB 1997.

[103]  Georgios D. Evangelidis,et al.  The hB II -Tree: A Concurrent And Recoverable Multi-Attribute Index Structure , 1994 .

[104]  Hans-Peter Kriegel,et al.  The X-tree : An Index Structure for High-Dimensional Data , 2001, VLDB.

[105]  Andrew Chi-Chih Yao,et al.  A general approach to d-dimensional geometric queries , 1985, STOC '85.

[106]  Peter Widmayer,et al.  The LSD tree: spatial access to multidimensional and non-point objects , 1989, VLDB 1989.

[107]  Bernhard Seeger,et al.  A Generic Approach to Bulk Loading Multidimensional Index Structures , 1997, VLDB.

[108]  Marcel Kornacker,et al.  High-Performance Extensible Indexing , 1999, VLDB.

[109]  Manfred Schroeder,et al.  Fractals, Chaos, Power Laws: Minutes From an Infinite Paradise , 1992 .

[110]  Hans-Peter Kriegel,et al.  PLOP-hashing: A grid file without directory , 1988, Proceedings. Fourth International Conference on Data Engineering.

[111]  Harpreet Sawhney,et al.  Efficient color histogram indexing , 1994, Proceedings of 1st International Conference on Image Processing.

[112]  R. Bayer,et al.  Organization and maintenance of large ordered indices , 1970, SIGFIDET '70.

[113]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[114]  Hans-Werner Six,et al.  Globally order preserving multidimensional linear hashing , 1988, Proceedings. Fourth International Conference on Data Engineering.

[115]  P. Wintz,et al.  An efficient three-dimensional aircraft recognition algorithm using normalized fourier descriptors , 1980 .

[116]  Donald E. Knuth,et al.  The Art of Computer Programming: Volume 3: Sorting and Searching , 1998 .

[117]  Christos Faloutsos,et al.  On packing R-trees , 1993, CIKM '93.

[118]  Christos Faloutsos,et al.  Gray Codes for Partial Match and Range Queries , 1988, IEEE Trans. Software Eng..

[119]  Hans-Peter Kriegel,et al.  Multidimensional dynamic quantile hashing is very efficient for non-uniform record distributions , 1987, 1987 IEEE Third International Conference on Data Engineering.

[120]  Hans-Peter Kriegel,et al.  Multidimensional Order Preserving Linear Hashing with Partial Expansions , 1986, ICDT.

[121]  Divyakant Agrawal,et al.  Reverse Nearest Neighbor Queries for Dynamic Databases , 2000, ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery.

[122]  Christos Faloutsos,et al.  Efficient Similarity Search In Sequence Databases , 1993, FODO.

[123]  M. V. Wilkes,et al.  The Art of Computer Programming, Volume 3, Sorting and Searching , 1974 .

[124]  J. L. Smith,et al.  A data structure and algorithm based on a linear key for a rectangle retrieval problem , 1983, Comput. Vis. Graph. Image Process..

[125]  ManolopoulosYannis,et al.  Closest pair queries in spatial databases , 2000 .

[126]  Bernd-Uwe Pagel,et al.  Towards an analysis of range query performance in spatial data structures , 1993, PODS '93.

[127]  Ricardo A. Baeza-Yates,et al.  Proximity Matching Using Fixed-Queries Trees , 1994, CPM.

[128]  Peter Yianilos,et al.  Excluded middle vantage point forests for nearest neighbor search , 1998 .

[129]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[130]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[131]  Walid G. Aref,et al.  Optimization for Spatial Query Processing , 1991, Very Large Data Bases Conference.

[132]  Ramesh C. Jain,et al.  Similarity indexing: algorithms and performance , 1996, Electronic Imaging.

[133]  Christos Faloutsos,et al.  Analysis of n-Dimensional Quadtrees using the Hausdorff Fractal Dimension , 1996, VLDB.

[134]  Christos H. Papadimitriou,et al.  On the analysis of indexing schemes , 1997, PODS '97.

[135]  Sunil Arya,et al.  Accounting for boundary effects in nearest neighbor searching , 1995, SCG '95.

[136]  Tzi-cker Chiueh,et al.  Content-Based Image Indexing , 1994, VLDB.

[137]  Christos Faloutsos,et al.  Fractals for secondary key retrieval , 1989, PODS.

[138]  Yannis Manolopoulos,et al.  Similarity query processing using disk arrays , 1998, SIGMOD '98.