Algebraic Optimization of Computations over Scientific Databases

Although scientific data analysis increasingly requires access to and manipulation of large quantities of data, current database technology fails to meet the needs of scientific processing. Shortcomings include data modeling facilities for scientific data types, physical storage structures for these types, and scientific analysis operations on data objects. Database systems for scientific users must address these shortcomings. A database system can offer numerous functionality improvements over the current combinations of scientific programs and file systems commonly used in scientific data analysis. Unfortunately, the inclusion of a database layer between the application and the file system holding the application's data can result in degraded performance. To overcome acceptance problems among scientists, scientific databases must provide performance comparable to, and functionality superior to, current systems used by scientists. Algebraic query optimization is one of many techniques used within database systems to improve performance. This technique has not been explored for scientific data types and operations. I have proposed expanding the concept of a database query to include numeric computations over scientific databases, thereby allowing algebraic query optimization to be applied to the full scientific computation and data access operations. This research introduces an integrated algebra that includes traditional database operators for pattern matching and search as well as numeric operators for scientific analysis. The use of a single integrated algebra enables automatic optimization of computations, realizing all of the benefits provided by optimization in traditional database systems. To experiment with this integrated algebra, a prototype system has been implemented for use at the University of Colorado's Space Grant College. The prototype supports many basic scientific operations such as interpolation and digital filtering, in addition to standard relational operations. I identify a set of transformation rules for this algebra, and show that these transformations can be used to achieve significant performance improvements. The results from the prototype demonstrate that scientific database computations can be effectively optimized and permit performance gains that could not be realized without the integration of scientific operators into database systems. These results suggest that future scientific database systems will be expected to be based on integrated retrieval and computational algebras.

[1]  Harry K. T. Wong,et al.  The role of time in information processing: a survey , 1982, SGAR.

[2]  Arie Shoshani,et al.  Statistical Databases: Characteristics, Problems, and some Solutions , 1982, VLDB.

[3]  Arie Shoshani,et al.  Characteristics of Scientific Databases , 1984, VLDB.

[4]  Richard T. Snodgrass,et al.  A taxonomy of time databases , 1985, SIGMOD Conference.

[5]  Arie Shoshani,et al.  Statistical and Scientific Database Issues , 1985, IEEE Transactions on Software Engineering.

[6]  Doron Rotem,et al.  Simple Random Sampling from Relational Databases , 1986, VLDB.

[7]  Arie Segev,et al.  Physical organization of temporal data , 1987, 1987 IEEE Third International Conference on Data Engineering.

[8]  Arie Shoshani,et al.  Logical modeling of temporal data , 1987, SIGMOD '87.

[9]  Arie Shoshani,et al.  The Representation of a Temporal Data Model in the Relational Environment , 1988, SSDBM.

[10]  Doron Rotem,et al.  Random Sampling from B+ Trees , 1989, VLDB.

[11]  Arie Segev,et al.  Event-Join Optimization in Temporal Relational Databases , 1989, VLDB.

[12]  Hamid Pirahesh,et al.  Extensible query processing in starburst , 1989, SIGMOD '89.

[13]  David J. DeWitt,et al.  The EXODUS Extensible DBMS Project: An Overview , 1989 .

[14]  Won Kim,et al.  Object-Oriented Approach to Managing Statistical and Scientific Databases , 1990, SSDBM.

[15]  John L. Pfaltz,et al.  Summary of the final report of the NSF workshop on scientific database management , 1990, SGMD.

[16]  Doron Rotem,et al.  Random Sampling from Database Files: A Survey , 1990, SSDBM.

[17]  Russ Rew,et al.  NetCDF: an interface for scientific data access , 1990, IEEE Computer Graphics and Applications.

[18]  Arie Segev,et al.  A Framework for Query Optimization in Temporal Databases , 1990, SSDBM.

[19]  Michael Stonebraker,et al.  Database systems: achievements and opportunities , 1990, SGMD.

[20]  Ping Xu,et al.  Random sampling from hash files , 1990, SIGMOD '90.

[21]  Leonore Neugebauer Optimization and evaluation of database queries including embedded interpolation procedures , 1991, SIGMOD '91.

[22]  H. Gunadhi,et al.  Query processing algorithms for temporal intersection joins , 1991, [1991] Proceedings. Seventh International Conference on Data Engineering.

[23]  Goetz Graefe,et al.  Extensible Query Optimization and Parallel Execution in Volcano , 1991, Query Processing for Advanced Database Systems.

[24]  David J. DeWitt,et al.  An Evaluation of Non-Equijoin Algorithms , 1991, VLDB.

[25]  Goetz Graefe,et al.  The Volcano optimizer generator: extensibility and efficient search , 1993, Proceedings of IEEE 9th International Conference on Data Engineering.

[26]  Goetz Graefe,et al.  Experiences building the open OODB query optimizer , 1993, SIGMOD Conference.

[27]  Goetz Graefe,et al.  Volcano - An Extensible and Parallel Query Evaluation System , 1994, IEEE Trans. Knowl. Data Eng..