On multidimensional data and modern disks

With the deeply-ingrained notion that disks can efficiently access only one dimensional data, current approaches for mapping multidimensional data to disk blocks either allow efficient accesses in only one dimension, trading off the efficiency of accesses in other dimensions, or equally penalize access to all dimensions. Yet, existing technology and functions readily available inside disk firmware can identify non-contiguous logical blocks that preserve spatial locality of multidimensional datasets. These blocks, which span on the order of a hundred adjacent tracks, can be accessed with minimal positioning cost. This paper details these technologies, analyzes their trends, and shows how they can be exposed to applications while maintaining existing abstractions. The described approach can achieve the best possible access efficiency afforded by the disk technologies: sequential access along primary dimension and access with minimal positioning cost for all other dimensions. Experimental evaluation of a prototype implementation demonstrates a reduction of overall I/O time for multi-dimensional data queries between 30% and 50% when compared to existing approaches.

[1]  D. Hilbert Ueber die stetige Abbildung einer Line auf ein Flächenstück , 1891 .

[2]  D. Hilbert Über die stetige Abbildung einer Linie auf ein Flächenstück , 1935 .

[3]  Antonin Guttman,et al.  R-trees: a dynamic index structure for spatial searching , 1984, SIGMOD '84.

[4]  Christos Faloutsos,et al.  Multiattribute hashing using Gray codes , 1986, SIGMOD '86.

[5]  Jack A. Orenstein Spatial query processing in an object-oriented database system , 1986, SIGMOD '86.

[6]  David A. Patterson,et al.  Maximizing performance in a striped disk array , 1990, ISCA '90.

[7]  Margo I. Seltzer,et al.  Disk Scheduling Revisited , 1990 .

[8]  Christos Faloutsos,et al.  Declustering using fractals , 1993, [1993] Proceedings of the Second International Conference on Parallel and Distributed Information Systems.

[9]  Yale N. Patt,et al.  Scheduling algorithms for modern disk drives , 1994, SIGMETRICS 1994.

[10]  John Wilkes,et al.  An introduction to disk drive modeling , 1994, Computer.

[11]  Yale N. Patt,et al.  On-line extraction of SCSI disk drive parameters , 1995, SIGMETRICS '95/PERFORMANCE '95.

[12]  Jeffrey F. Naughton,et al.  Storage Estimation for Multidimensional Aggregates in the Presence of Hierarchies , 1996, VLDB.

[13]  Christos Faloutsos,et al.  Declustering Spatial Databases on a Multi-Computer Architecture , 1996, EDBT.

[14]  Khaled A. S. Abdel-Ghaffar,et al.  Optimal Allocation of Two-Dimensional Data , 1997, ICDT.

[15]  Jeffrey F. Naughton,et al.  An array-based algorithm for simultaneous multidimensional aggregates , 1997, SIGMOD '97.

[16]  Khaled A. S. Abdel-Ghaffar,et al.  Efficient retrieval of multidimensional datasets through parallel I/O , 1998, Proceedings. Fifth International Conference on High Performance Computing (Cat. No. 98EX238).

[17]  Jeffrey F. Naughton,et al.  Caching multidimensional queries using chunks , 1998, SIGMOD '98.

[18]  Oliver Günther,et al.  Multidimensional access methods , 1998, CSUR.

[19]  Laks V. S. Lakshmanan,et al.  Snakes and sandwiches: optimal clustering strategies for a data warehouse , 1999, SIGMOD '99.

[20]  Gregory R. Ganger,et al.  Automated Disk Drive Characterization , 1999 .

[21]  Remzi H. Arpaci-Dusseau,et al.  Microbenchmark-based Extraction of Local and Global Disk Characteristics , 1999 .

[22]  Randeep Bhatia,et al.  Declustering using golden ratio sequences , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[23]  Remzi H. Arpaci-Dusseau,et al.  Micro-Benchmark Based Extraction of Local and Global Disk , 2000 .

[24]  Erhard Rahm,et al.  Multi-Dimensional Database Allocation for Parallel Data Warehouses , 2000, VLDB.

[25]  Erich Schikuta,et al.  Improving the Performance of High-Energy Physics Analysis through Bitmap Indices , 2000, DEXA.

[26]  Mikhail J. Atallah,et al.  (Almost) optimal parallel block access to range queries , 2000, PODS '00.

[27]  Christos Faloutsos,et al.  Analysis of the Clustering Properties of the Hilbert Space-Filling Curve , 2001, IEEE Trans. Knowl. Data Eng..

[28]  George G. Gorbatenko,et al.  PERFORMANCE of TWO-DIMENSIONAL DATA MODELS for I/O LIMITED NON-NUMERIC APPLICATIONS , 2002 .

[29]  Gregory R. Ganger,et al.  Track-Aligned Extents: Matching Access Patterns to Disk Drive Characteristics , 2002, FAST.

[30]  Peter Z. Kunszt,et al.  Data Mining the SDSS SkyServer Database , 2002, WDAS.

[31]  Tiankai Tu,et al.  High Resolution Forward And Inverse Earthquake Modeling on Terascale Computers , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[32]  Gregory R. Ganger,et al.  Exposing and Exploiting Internal Parallelism in MEMS-based Storage (CMU-CS-03-125) , 2003 .

[33]  Divyakant Agrawal,et al.  Tabular Placement of Relational Data on MEMS-based Storage Devices , 2003, VLDB.

[34]  David R. O'Hallaron,et al.  High Resolution Forward And Inverse Earthquake Modeling on Terascale Computers , 2003, SC.

[35]  Erik Riedel,et al.  More Than an Interface - SCSI vs. ATA , 2003, FAST.

[36]  Kwan-Liu Ma,et al.  A Parallel Visualization Pipeline for Terascale Earthquake Simulations , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[37]  Anastasia Ailamaki,et al.  Clotho: Decoupling memory page layout from storage organization , 2004, VLDB.

[38]  Sam Lightstone,et al.  Automated design of multidimensional clustering tables for relational databases , 2004, VLDB.

[39]  Hamid Pirahesh,et al.  Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals , 1996, Data Mining and Knowledge Discovery.

[40]  Anastasia Ailamaki,et al.  Atropos: A Disk Array Volume Manager for Orchestrated Use of Disks , 2004, FAST.

[41]  Richard P. Mount The Office of Science Data-Management Challenge , 2005 .

[42]  Anastasia Ailamaki,et al.  MultiMap: Preserving disk locality for multidimensional datasets , 2007, 2007 IEEE 23rd International Conference on Data Engineering.