Managing Large Multidimensional Datasets Inside a Database System

Many modern database applications deal with large amounts of multidimensional data. Examples include multimedia content-based retrieval (high dimensional multimedia feature data), time-series similarity retrieval, data mining/OLAP and spatial/spatio-temporal applications. To be able to handle multidimensional data efficiently, we need access methods (AMs) to selectively access some data items in a large collection associatively. Traditional database AMs like B+-tree and hashing are not suitable for multidimensional data as they can handle only one dimensional data. Using multiple B+-trees (one per dimension) or space linearization followed by B+-tree indexing are not efficient solutions. We need multidimensional index structures: those that can index data based on multiple dimensions simultaneously. Most multidimensional index structures proposed so far do not scale beyond 10-15 dimensional spaces and are hence not suitable for high dimensional spaces that arise in modern database applications like multimedia retrieval (e.g., 64-d color histograms), data mining/OLAP (e.g., 52-d bank data in clustering) and time series/scientific/medical applications (e.g., 20-d Space Shuttle data, 64-d Electrocardiogram data). A simple sequential scan through the entire dataset to answer the query is often faster than using a multidimensional index structure. To address the above need, we design and implement the hybrid tree, a multidimensional index structure that scales to high dimensional spaces. The hybrid tree combines the positive aspects of the two types of multidimensional index structures, namely data partitioning (e.g., R-tree and derivatives) and space partitioning (e.g., kdB-tree and derivatives), to achieve search performance more scalable to high dimensionalities than either of the above techniques. Our experiments show that the hybrid tree scales well to high dimensionalities for real-life datasets. To achieve further scalability, we develop the local dimensionality reduction (LDR) technique to reduce the dimensionality of high dimensional data. The reduced space can be indexed more effectively using a multidimensional index structure. LDR exploits local, as opposed to global, correlations in the data and

[1]  Yannis E. Ioannidis,et al.  Histogram-Based Approximation of Set-Valued Query-Answers , 1999, VLDB.

[2]  Jeffrey Scott Vitter,et al.  Approximate computation of multidimensional aggregates of sparse data using wavelets , 1999, SIGMOD '99.

[3]  B. S. Manjunath,et al.  An eigenspace update algorithm for image analysis , 1995, Proceedings of International Symposium on Computer Vision - ISCV.

[4]  Michael K. Ng,et al.  Data-Mining Massive Time Series Astronomical Data Sets - A Case Study , 1998, PAKDD.

[5]  Christos Faloutsos,et al.  Efficient Similarity Search In Sequence Databases , 1993, FODO.

[6]  Sharad Mehrotra,et al.  Similar shape retrieval in MARS , 2000, 2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532).

[7]  Joseph M. Hellerstein,et al.  Creating a Customized Access Method for Blobworld , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[8]  David B. Lomet,et al.  Key Range Locking Strategies for Improved Concurrency , 1993, VLDB.

[9]  Antonin Guttman,et al.  R-trees: a dynamic index structure for spatial searching , 1984, SIGMOD '84.

[10]  Eamonn J. Keogh,et al.  Dimensionality Reduction for Fast Similarity Search in Large Time Series Databases , 2001, Knowledge and Information Systems.

[11]  Raymond T. Ng,et al.  Evaluating multidimensional indexing structures for images transformed by principal component analysis , 1996, Electronic Imaging.

[12]  Viswanath Poosala,et al.  Fast approximate answers to aggregate queries on a data cube , 1999, Proceedings. Eleventh International Conference on Scientific and Statistical Database Management.

[13]  Christian Böhm,et al.  A cost model for nearest neighbor search in high-dimensional data space , 1997, PODS.

[14]  Peter J. Haas,et al.  Ripple joins for online aggregation , 1999, SIGMOD '99.

[15]  Thomas S. Huang,et al.  Content-based image retrieval with relevance feedback in MARS , 1997, Proceedings of International Conference on Image Processing.

[16]  Christos Faloutsos,et al.  FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets , 1995, SIGMOD '95.

[17]  Yossi Matias,et al.  Aqua Project White Paper , 1997 .

[18]  Stefan Berchtold,et al.  Indexing High-Dimensional Spaces: Database Support for Next Decade's Applications , 2000, ICDE.

[19]  Jack A. Orenstein Spatial query processing in an object-oriented database system , 1986, SIGMOD '86.

[20]  J. T. Robinson,et al.  The K-D-B-tree: a search structure for large multidimensional dynamic indexes , 1981, SIGMOD '81.

[21]  Andreas Henrich,et al.  The LSD/sup h/-tree: an access structure for feature vectors , 1998, Proceedings 14th International Conference on Data Engineering.

[22]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[23]  Hans-Peter Kriegel,et al.  The X-tree : An Index Structure for High-Dimensional Data , 2001, VLDB.

[24]  Leonidas J. Guibas,et al.  A metric for distributions with applications to image databases , 1998, Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271).

[25]  Teofilo F. GONZALEZ,et al.  Clustering to Minimize the Maximum Intercluster Distance , 1985, Theor. Comput. Sci..

[26]  Ambuj K. Singh,et al.  Dimensionality reduction for similarity searching in dynamic databases , 1998, SIGMOD '98.

[27]  Clu-istos Foutsos,et al.  Fast subsequence matching in time-series databases , 1994, SIGMOD '94.

[28]  Michael Stonebraker,et al.  Object-Relational DBMSs: The Next Great Wave , 1995 .

[29]  Miron Livny,et al.  Models for studying concurrency control performance: alternatives and implications , 1985, SIGMOD Conference.

[30]  Jeffrey F. Naughton,et al.  Caching multidimensional queries using chunks , 1998, SIGMOD '98.

[31]  Peter J. Haas,et al.  The New Jersey Data Reduction Report , 1997 .

[32]  Thomas S. Huang,et al.  Supporting Ranked Boolean Similarity Queries in MARS , 1998, IEEE Trans. Knowl. Data Eng..

[33]  Stefan Berchtold,et al.  High-dimensional index structures database support for next decade's applications (tutorial) , 1998, SIGMOD '98.

[34]  Beng Chin Ooi,et al.  Fast High-Dimensional Data Search in Incomplete Databases , 1998, VLDB.

[35]  Ambuj K. Singh,et al.  Efficient retrieval for browsing large image databases , 1996, CIKM '96.

[36]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[37]  Christos H. Papadimitriou,et al.  The Theory of Database Concurrency Control , 1986 .

[38]  Thomas S. Huang,et al.  Supporting similarity queries in MARS , 1997, MULTIMEDIA '97.

[39]  Ronald Fagin,et al.  Fuzzy queries in multimedia database systems , 1998, PODS '98.

[40]  Andreas Reuter,et al.  Transaction Processing: Concepts and Techniques , 1992 .

[41]  Hans-Peter Kriegel,et al.  The R*-tree: an efficient and robust access method for points and rectangles , 1990, SIGMOD '90.

[42]  Heikki Mannila,et al.  Rule Discovery from Time Series , 1998, KDD.

[43]  Sunita Sarawagi Indexing OLAP Data , 1997, IEEE Data Eng. Bull..

[44]  H. Buchner The Grid File : An Adaptable , Symmetric Multikey File Structure , 2001 .

[45]  Alan R. Simon,et al.  Understanding the New SQL: A Complete Guide , 1993 .

[46]  Hans-Peter Kriegel,et al.  The pyramid-technique: towards breaking the curse of dimensionality , 1998, SIGMOD '98.

[47]  Hanan Samet,et al.  The Design and Analysis of Spatial Data Structures , 1989 .

[48]  Christos Faloutsos,et al.  Efficiently supporting ad hoc queries in large datasets of time sequences , 1997, SIGMOD '97.

[49]  Yossi Matias,et al.  Fast incremental maintenance of approximate histograms , 1997, TODS.

[50]  Betty Salzberg,et al.  Access methods , 1996, CSUR.

[51]  Hans-Peter Kriegel,et al.  Optimal multi-step k-nearest neighbor search , 1998, SIGMOD '98.

[52]  Jennifer Widom,et al.  Offering a Precision-Performance Tradeoff for Aggregation Queries over Replicated Data , 2000, VLDB.

[53]  Sharad Mehrotra,et al.  Progressive approximate aggregate queries with a multi-resolution tree structure , 2001, SIGMOD '01.

[54]  Christos Faloutsos,et al.  Fast Time Sequence Indexing for Arbitrary Lp Norms , 2000, VLDB.

[55]  C. Mohan,et al.  Concurrency and recovery in generalized search trees , 1997, SIGMOD '97.

[56]  Jeffrey Scott Vitter,et al.  Dynamic Maintenance of Wavelet-Based Histograms , 2000, VLDB.

[57]  David Salesin,et al.  Wavelets for computer graphics: theory and applications , 1996 .

[58]  Samuel DeFazio,et al.  Extensible indexing: a framework for integrating domain-specific indexing schemes into Oracle8i , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[59]  Wim Sweldens,et al.  An Overview of Wavelet Based Multiresolution Analyses , 1994, SIAM Rev..

[60]  Peter J. Haas,et al.  Improved histograms for selectivity estimation of range predicates , 1996, SIGMOD '96.

[61]  Sharad Mehrotra,et al.  The hybrid tree: an index structure for high dimensional feature spaces , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[62]  Philip S. Yu,et al.  Fast algorithms for projected clustering , 1999, SIGMOD '99.

[63]  Christos H. Papadimitriou,et al.  Towards a theory of indexability , 1997, PODS 1997.

[64]  Divyakant Agrawal,et al.  A comparison of DFT and DWT based similarity search in time-series databases , 2000, CIKM '00.

[65]  Jeffrey F. Naughton,et al.  Generalized Search Trees for Database Systems , 1995, VLDB.

[66]  Jeffrey Scott Vitter,et al.  Data cube approximation and histograms via wavelets , 1998, CIKM '98.

[67]  K. Chakrabarti Query Reenement for Content Based Multimedia Retrieval in Mars , 1999 .

[68]  T. H. Merrett,et al.  A class of data structures for associative searching , 1984, PODS.

[69]  Mario A. López,et al.  On Optimal Node Splitting for R-trees , 1998, VLDB.

[70]  Carl-Erik Särndal,et al.  Model Assisted Survey Sampling , 1997 .

[71]  Peter J. Haas,et al.  Large-sample and deterministic confidence intervals for online aggregation , 1997, Proceedings. Ninth International Conference on Scientific and Statistical Database Management (Cat. No.97TB100150).

[72]  Jonathan Goldstein,et al.  When Is ''Nearest Neighbor'' Meaningful? , 1999, ICDT.

[73]  Christos Faloutsos,et al.  MindReader: Querying Databases Through Multiple Examples , 1998, VLDB.

[74]  Theodosios Pavlidis,et al.  Waveform Segmentation Through Functional Approximation , 1973, IEEE Transactions on Computers.

[75]  Sridhar Ramaswamy,et al.  Join synopses for approximate query answering , 1999, SIGMOD '99.

[76]  Sharad Mehrotra,et al.  Dynamic granular locking approach to phantom protection in R-trees , 1998, Proceedings 14th International Conference on Data Engineering.

[77]  Hans-Peter Kriegel,et al.  The DC-tree: a fully dynamic index structure for data warehouses , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[78]  Pavel Zezula,et al.  M-tree: An Efficient Access Method for Similarity Search in Metric Spaces , 1997, VLDB.

[79]  Hamid Pirahesh,et al.  ARIES: a transaction recovery method supporting fine-granularity locking and partial rollbacks using write-ahead logging , 1998 .

[80]  Deok-Hwan Kim,et al.  Multi-dimensional selectivity estimation using compressed histogram information , 1999, SIGMOD '99.

[81]  Nick Roussopoulos,et al.  Nearest neighbor queries , 1995, SIGMOD '95.

[82]  Christos Faloutsos,et al.  Fast Nearest Neighbor Search in Medical Image Databases , 1996, VLDB.

[83]  Yossi Matias,et al.  New sampling-based summary statistics for improving approximate query answers , 1998, SIGMOD '98.

[84]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[85]  Ramesh C. Jain,et al.  Similarity indexing: algorithms and performance , 1996, Electronic Imaging.

[86]  Hanan Samet,et al.  Ranking in Spatial Databases , 1995, SSD.

[87]  Ramesh C. Jain,et al.  Similarity indexing with the SS-tree , 1996, Proceedings of the Twelfth International Conference on Data Engineering.

[88]  Christian S. Jensen,et al.  Developing a DataBlade for a new index , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[89]  Sharad Mehrotra,et al.  Local Dimensionality Reduction: A New Approach to Indexing High Dimensional Spaces , 2000, VLDB.

[90]  Jiawei Han,et al.  Efficient and Effective Clustering Methods for Spatial Data Mining , 1994, VLDB.

[91]  Yannis E. Ioannidis,et al.  Balancing histogram optimality and practicality for query result size estimation , 1995, SIGMOD '95.

[92]  Christos Faloutsos,et al.  The R+-Tree: A Dynamic Index for Multi-Dimensional Objects , 1987, VLDB.

[93]  David B. Lomet,et al.  A review of recent work on multi-attribute access methods , 1992, SGMD.

[94]  H. V. Jagadish,et al.  Linear clustering of objects with multiple attributes , 1990, SIGMOD '90.

[95]  K. Wakimoto,et al.  Efficient and Effective Querying by Image Content , 1994 .

[96]  Davood Rafiei,et al.  On similarity-based queries for time series data , 1997, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[97]  A. Winsor Sampling techniques. , 2000, Nursing times.

[98]  Hans-Jörg Schek,et al.  A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces , 1998, VLDB.

[99]  Eamonn J. Keogh,et al.  An Enhanced Representation of Time Series Which Allows Fast and Accurate Classification, Clustering and Relevance Feedback , 1998, KDD.

[100]  Sharad Mehrotra,et al.  Efficient concurrency control in multidimensional access methods , 1999, SIGMOD '99.

[101]  Rajeev Motwani,et al.  Randomized Algorithms , 1995, SIGA.

[102]  Georges Hébrail,et al.  Interactive Interpretation of Kohonen Maps Applied to Curves , 1998, KDD.

[103]  Z. Meral Özsoyoglu,et al.  Distance-based indexing for high-dimensional metric spaces , 1997, SIGMOD '97.

[104]  Christos Faloutsos,et al.  A signature technique for similarity-based queries , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[105]  Diane Greene,et al.  An implementation and performance analysis of spatial data access methods , 1989, [1989] Proceedings. Fifth International Conference on Data Engineering.

[106]  Bradford Nichols,et al.  Pthreads programming , 1996 .

[107]  Stavros Christodoulakis,et al.  On the propagation of errors in the size of join results , 1991, SIGMOD '91.

[108]  David B. Lomet,et al.  The hB-tree: a multiattribute indexing method with good guaranteed performance , 1990, TODS.

[109]  James C. French,et al.  Clustering large datasets in arbitrary metric spaces , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[110]  M. Stonebraker,et al.  The Sequoia 2000 Benchmark , 1993, SIGMOD Conference.

[111]  Sharad Mehrotra,et al.  High dimensional feature indexing using hybrid trees , 1998, ICDE 1998.

[112]  Eamonn J. Keogh,et al.  Locally adaptive dimensionality reduction for indexing large time series databases , 2001, SIGMOD '01.

[113]  Hagit Shatkay,et al.  Approximate queries and representations for large data sequences , 1996, Proceedings of the Twelfth International Conference on Data Engineering.

[114]  Nick Roussopoulos,et al.  Cubetree: organization of and bulk incremental updates on the data cube , 1997, SIGMOD '97.

[115]  Man Hon Wong,et al.  Fast time-series searching with scaling and shifting , 1999, PODS '99.

[116]  Tzi-cker Chiueh,et al.  Content-Based Image Indexing , 1994, VLDB.

[117]  Michael Stonebraker,et al.  Efficient organization of large multidimensional arrays , 1994, Proceedings of 1994 IEEE 10th International Conference on Data Engineering.

[118]  Jeffrey F. Naughton,et al.  Practical selectivity estimation through adaptive sampling , 1990, SIGMOD '90.

[119]  Keinosuke Fukunaga,et al.  Introduction to Statistical Pattern Recognition , 1972 .

[120]  Ambuj K. Singh,et al.  Variable length queries for time series data , 2001, Proceedings 17th International Conference on Data Engineering.

[121]  David J. DeWitt,et al.  Equi-depth multidimensional histograms , 1988, SIGMOD '88.

[122]  Thomas S. Huang,et al.  Relevance feedback: a power tool for interactive content-based image retrieval , 1998, IEEE Trans. Circuits Syst. Video Technol..

[123]  Yannis E. Ioannidis,et al.  Selectivity Estimation Without the Attribute Value Independence Assumption , 1997, VLDB.

[124]  C. Mohan,et al.  ARIES/KVL: A Key-Value Locking Method for Concurrency Control of Multiaction Transactions Operating on B-Tree Indexes , 1990, VLDB.

[125]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[126]  Helen J. Wang,et al.  Online aggregation , 1997, SIGMOD '97.

[127]  Ronald Fagin,et al.  Combining Fuzzy Information from Multiple Systems , 1999, J. Comput. Syst. Sci..

[128]  Jeffrey Scott Vitter,et al.  Wavelet-based histograms for selectivity estimation , 1998, SIGMOD '98.

[129]  Ada Wai-Chee Fu,et al.  Efficient time series matching by wavelets , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[130]  Shin'ichi Satoh,et al.  The SR-tree: an index structure for high-dimensional nearest neighbor queries , 1997, SIGMOD '97.