Optimizing retrieval and processing of multi-dimensional scientific datasets

We have developed the Active Data Repository (ADR), an infrastructure that integrates storage, retrieval, and processing of large multi-dimensional scientific datasets on distributed memory parallel machines with multiple disks attached to each node. In earlier work, we proposed three strategies for processing range queries within the ADR framework. Our experimental results show that the relative performance of the strategies changes under varying application characteristics and machine configurations. In this work we investigate approaches to guide and automate the selection of the best strategy for a given application and machine configuration. We describe analytical models to predict the relative performance of the strategies where input data elements are uniformly distributed in the attribute space of the output dataset, restricting the output dataset to be a regular d-dimensional array.

[1]  Antonin Guttman,et al.  R-trees: a dynamic index structure for spatial searching , 1984, SIGMOD '84.

[2]  Marianne Winslett,et al.  Server-Directed Collective I/O in Panda , 1995, Proceedings of the IEEE/ACM SC95 Conference.

[3]  Joel H. Saltz,et al.  Infrastructure for Building Parallel Database Systems for Multi-Dimensional Data , 1999, IPPS/SPDP.

[4]  Joel H. Saltz,et al.  A Performance Prediction Framework for Data Intensive Applications on Large Scale Parallel Machines , 1998, LCR.

[5]  Joel H. Saltz,et al.  Titan: a high-performance remote-sensing database , 1997, Proceedings 13th International Conference on Data Engineering.

[6]  David J. DeWitt,et al.  Parallel database systems: the future of high performance database systems , 1992, CACM.

[7]  David Kotz,et al.  The galley parallel file system , 1997, ICS '96.

[8]  Doron Rotem,et al.  Multiprocessor Join Scheduling , 1993, IEEE Trans. Knowl. Data Eng..

[9]  Dror G. Feitelson,et al.  The Vesta parallel file system , 1996, TOCS.

[10]  Miron Livny,et al.  The Case for Enhanced Abstract Data Types , 1997, VLDB.

[11]  T. Kurc,et al.  Querying Very Large Multi-dimensional Datasets in ADR , 1999, ACM/IEEE SC 1999 Conference (SC'99).

[12]  Chialin Chang Cost Models for Query Processing Strategies in the Active DataRepository , 1999 .

[13]  Christos Faloutsos,et al.  Declustering using fractals , 1993, [1993] Proceedings of the Second International Conference on Parallel and Distributed Information Systems.

[14]  David J. DeWitt,et al.  Tradeoffs in Processing Complex Join Queries via Hashing in Multiprocessor Database Machines , 1990, VLDB.

[15]  David Kotz,et al.  Disk-directed I/O for MIMD multiprocessors , 1994, OSDI '94.

[16]  Joel H. Saltz,et al.  Scalability Analysis of Declustering Methods for Multidimensional Range Queries , 1998, IEEE Trans. Knowl. Data Eng..

[17]  Joel H. Saltz,et al.  Jovian: a framework for optimizing parallel I/O , 1994, Proceedings Scalable Parallel Libraries Conference.

[18]  David J. DeWitt,et al.  The EXODUS Extensible DBMS Project: An Overview , 1989 .

[19]  Rajeev Thakur,et al.  Passion: Optimized I/O for Parallel Applications , 1996, Computer.

[20]  Rajeev Thakur,et al.  An Extended Two-Phase Method for Accessing Sections of Out-of-Core Arrays , 1996, Sci. Program..

[21]  Andrew A. Chien,et al.  PPFS: a high performance portable parallel file system , 1995, ICS '95.

[22]  Michael Stonebraker,et al.  The Implementation of Postgres , 1990, IEEE Trans. Knowl. Data Eng..

[23]  Joel H. Saltz,et al.  Query Planning for Range Queries with User-defined Aggregation onMulti-dimensional Scientific Datasets , 1999 .

[24]  Joel H. Saltz,et al.  Digital dynamic telepathology-the Virtual Microscope , 1998, AMIA.

[25]  Joel H. Saltz,et al.  Coupling Multiple Simulations via a High Performance Customizable Database System , 1999, PPSC.