Data space mapping for efficient I/O in large multi-dimensional databases

In this paper, we propose data space mapping techniques for storage and retrieval in multi-dimensional databases on multi-disk architectures. We identify the important factors for an efficient multi-disk searching of multi-dimensional data and develop secondary storage organization and retrieval techniques that directly address these factors. We especially focus on high dimensional data, where none of the current approaches are effective. In contrast to the current declustering techniques, storage techniques in this paper consider both inter- and intra-disk organization of the data. The data space is first partitioned into buckets, then the buckets are declustered to multiple disks while they are clustered in each disk. The queries are executed through bucket identification techniques that locate the pages. One of the partitioning techniques we discuss is especially practical for high dimensional data, and our disk and page allocation techniques are optimal with respect to number of I/O accesses and seek times. We provide experimental results that support our claims on two real high dimensional datasets.

[1]  Randeep Bhatia,et al.  Asymptotically Optimal Declustering Schemes for Range Queries , 2001, ICDT.

[2]  John S. Sobolewski,et al.  Disk allocation for Cartesian product files on multiple-disk systems , 1982, TODS.

[3]  Mikhail J. Atallah,et al.  (Almost) optimal parallel block access to range queries , 2000, PODS '00.

[4]  Hans-Peter Kriegel,et al.  The pyramid-technique: towards breaking the curse of dimensionality , 1998, SIGMOD '98.

[5]  Hanan Samet,et al.  The Design and Analysis of Spatial Data Structures , 1989 .

[6]  Ronald L. Rivest,et al.  An application of number theory to the organization of raster-graphics memory , 1982, FOCS 1982.

[7]  Shashi Shekhar,et al.  Partitioning Similarity Graphs: A Framework for Declustering Problems , 1996, Inf. Syst..

[8]  Divyakant Agrawal,et al.  Clustering declustered data for efficient retrieval , 1999, CIKM '99.

[9]  Paolo Ciaccia,et al.  Dynamic Declustering Methods for Parallel Grid Files , 1996, ACPC.

[10]  Shashi Shekhar,et al.  Declustering and Load-Balancing Methods for Parallelizing Geographic Information Systems , 1998, IEEE Trans. Knowl. Data Eng..

[11]  Randeep Bhatia,et al.  Declustering using golden ratio sequences , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[12]  Randeep Bhatia,et al.  Hierarchical Declustering Schemes for Range Queries , 2000, EDBT.

[13]  Christos Faloutsos,et al.  Parallel R-trees , 1992, SIGMOD '92.

[14]  Doron Rotem,et al.  Declustering Objects for Visualization , 1993, VLDB.

[15]  Ronald L. Rivest,et al.  An application of number theory to the organization of raster-graphics memory , 1982, 23rd Annual Symposium on Foundations of Computer Science (sfcs 1982).

[16]  Antonin Guttman,et al.  R-trees: a dynamic index structure for spatial searching , 1984, SIGMOD '84.

[17]  Christos Faloutsos,et al.  Gray Codes for Partial Match and Range Queries , 1988, IEEE Trans. Software Eng..

[18]  Joel H. Saltz,et al.  Active disks: programming model, algorithms and evaluation , 1998, ASPLOS VIII.

[19]  J. T. Robinson,et al.  The K-D-B-tree: a search structure for large multidimensional dynamic indexes , 1981, SIGMOD '81.

[20]  Joel H. Saltz,et al.  Study of scalable declustering algorithms for parallel grid files , 1996, Proceedings of International Conference on Parallel Processing.

[21]  Divyakant Agrawal,et al.  Efficient disk allocation for fast similarity searching , 1998, SPAA '98.

[22]  Viktor K. Prasanna,et al.  Latin Squares for Parallel Array Access , 1993, IEEE Trans. Parallel Distributed Syst..

[23]  Sakti Pramanik,et al.  Optimal file distribution for partial match retrieval , 1988, SIGMOD '88.

[24]  Christine T. Cheng,et al.  From discrepancy to declustering: near-optimal multidimensional declustering strategies for range queries , 2002, PODS '02.

[25]  Khaled A. S. Abdel-Ghaffar,et al.  Optimal Allocation of Two-Dimensional Data , 1997, ICDT.

[26]  Khaled A. S. Abdel-Ghaffar,et al.  Optimal disk allocation for partial match queries , 1993, TODS.

[27]  Jiuqiang Liu,et al.  Latin cubes and parallel array access , 1994, Proceedings of 8th International Parallel Processing Symposium.

[28]  David J. DeWitt,et al.  A multiuser performance analysis of alternative declustering strategies , 1990, [1990] Proceedings. Sixth International Conference on Data Engineering.

[29]  Hakan Ferhatosmanoglu,et al.  Replicated declustering of spatial data , 2004, PODS '04.

[30]  Divyakant Agrawal,et al.  Concentric hyperspaces and disk allocation for fast parallel range searching , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[31]  Christos Faloutsos,et al.  Declustering using fractals , 1993, [1993] Proceedings of the Second International Conference on Parallel and Distributed Information Systems.

[32]  Oliver Günther,et al.  Multidimensional access methods , 1998, CSUR.

[33]  Michael Stonebraker,et al.  The Asilomar report on database research , 1998, SGMD.

[34]  David J. DeWitt,et al.  Hybrid-Range Partitioning Strategy: A New Declustering Strategy for Multiprocessor Database Machines , 1990, VLDB.

[35]  Jürg Nievergelt,et al.  The Grid File: An Adaptable, Symmetric Multikey File Structure , 1984, TODS.

[36]  Hans-Peter Kriegel,et al.  The X-tree : An Index Structure for High-Dimensional Data , 2001, VLDB.

[37]  Khaled A. S. Abdel-Ghaffar,et al.  Cyclic allocation of two-dimensional data , 1998, Proceedings 14th International Conference on Data Engineering.

[38]  Shashi Shekhar,et al.  Evaluation of Disk Allocation Methods for Parallelizing Spatial Queries on Grid Files , 1995, ICDE 1995.

[39]  Christian Böhm,et al.  Fast parallel similarity search in multimedia databases , 1997, SIGMOD '97.

[40]  Hans-Peter Kriegel,et al.  The R*-tree: an efficient and robust access method for points and rectangles , 1990, SIGMOD '90.

[41]  David J. DeWitt,et al.  A performance analysis of alternative multi-attribute declustering strategies , 1992, SIGMOD '92.

[42]  Kien A. Hua,et al.  A General Multidimensional Data Allocation Method for Multicomputer Database Systems , 1997, DEXA.