Querying Very Large Multi-dimensional Datasets in ADR

Applications that make use of very large scientific datasets have become an increasingly important subset of scientific applications. In these applications, datasets are often multi-dimensional, i.e., data items are associated with points in a multi-dimensional attribute space, and access to data items is described by range queries. The basic processing involves mapping input data items to output data items, and some form of aggregation of all the input data items that project to the each output data item. We have developed an infrastructure, called the Active Data Repository (ADR), that integrates storage, retrieval and processing of multi-dimensional datasets on distributed-memory parallel architectures with multiple disks attached to each node. In this paper we address efficient execution of range queries on distributed memory parallel machines within ADR framework. We present three potential strategies, and evaluate them under different application scenarios and machine configurations. We present experimental results on the scalability and performance of the strategies on a 128-node IBM SP.

[1]  David J. DeWitt,et al.  Tradeoffs in Processing Complex Join Queries via Hashing in Multiprocessor Database Machines , 1990, VLDB.

[2]  David Kotz,et al.  Disk-directed I/O for MIMD multiprocessors , 1994, OSDI '94.

[3]  Joel H. Saltz,et al.  Scalability Analysis of Declustering Methods for Multidimensional Range Queries , 1998, IEEE Trans. Knowl. Data Eng..

[4]  Joel H. Saltz,et al.  Titan: a high-performance remote-sensing database , 1997, Proceedings 13th International Conference on Data Engineering.

[5]  David Kotz,et al.  The galley parallel file system , 1997, ICS '96.

[6]  Doron Rotem,et al.  Multiprocessor Join Scheduling , 1993, IEEE Trans. Knowl. Data Eng..

[7]  Miron Livny,et al.  Zoo: a desktop experiment management environment , 1997, SIGMOD '97.

[8]  Joel H. Saltz,et al.  Infrastructure for Building Parallel Database Systems for Multi-Dimensional Data , 1999, IPPS/SPDP.

[9]  Joel H. Saltz,et al.  T2: a customizable parallel database for multi-dimensional data , 1998, SGMD.

[10]  Joel H. Saltz,et al.  Coupling Multiple Simulations via a High Performance Customizable Database System , 1999, PPSC.

[11]  Michael Stonebraker,et al.  An overview of the Sequoia 2000 project , 1992, Digest of Papers COMPCON Spring 1992.

[12]  Joel H. Saltz,et al.  Query Planning for Range Queries with User-defined Aggregation onMulti-dimensional Scientific Datasets , 1999 .

[13]  Joel H. Saltz,et al.  Interoperability of data parallel runtime libraries , 1997, Proceedings 11th International Parallel Processing Symposium.

[14]  Miron Livny,et al.  The Case for Enhanced Abstract Data Types , 1997, VLDB.

[15]  Rajeev Thakur,et al.  An Extended Two-Phase Method for Accessing Sections of Out-of-Core Arrays , 1996, Sci. Program..

[16]  Christos Faloutsos,et al.  Declustering using fractals , 1993, [1993] Proceedings of the Second International Conference on Parallel and Distributed Information Systems.

[17]  Joel H. Saltz,et al.  Jovian: a framework for optimizing parallel I/O , 1994, Proceedings Scalable Parallel Libraries Conference.

[18]  David J. DeWitt,et al.  Parallel database systems: the future of high performance database systems , 1992, CACM.

[19]  Joel H. Saltz,et al.  Digital dynamic telepathology-the Virtual Microscope , 1998, AMIA.

[20]  David J. DeWitt,et al.  The EXODUS Extensible DBMS Project: An Overview , 1989 .

[21]  Peter Baumann,et al.  Geo/Environmental and Medical Data Management in the RasDaMan System , 1997, VLDB.

[22]  Larry S. Davis,et al.  The Design and Evaluation of a High-Performance Earth Science Database , 1998, Parallel Comput..

[23]  Dror G. Feitelson,et al.  The Vesta parallel file system , 1996, TOCS.

[24]  David J. DeWitt,et al.  Building a scaleable geo-spatial DBMS: technology, implementation, and evaluation , 1997, SIGMOD '97.

[25]  Marianne Winslett,et al.  Server-Directed Collective I/O in Panda , 1995, Proceedings of the IEEE/ACM SC95 Conference.

[26]  Joel H. Saltz,et al.  A Performance Prediction Framework for Data Intensive Applications on Large Scale Parallel Machines , 1998, LCR.

[27]  Christos Faloutsos,et al.  Fractals for secondary key retrieval , 1989, PODS.

[28]  Michael Stonebraker,et al.  The Implementation of Postgres , 1990, IEEE Trans. Knowl. Data Eng..

[29]  Rajeev Thakur,et al.  Passion: Optimized I/O for Parallel Applications , 1996, Computer.

[30]  Andrew A. Chien,et al.  PPFS: a high performance portable parallel file system , 1995, ICS '95.

[31]  Joel H. Saltz,et al.  Tuning the performance of I/O-intensive parallel applications , 1996, IOPADS '96.