R* optimizer validation and performance evaluation for local queries

Few database query optimizer models have been validated against actual performance. This paper presents the methodology and results of a thorough validation of the optimizer and evaluation of the performance of the experimental distributed relational database management system R*, which inherited and extended to a distributed environment the optimization algorithms of System R. Optimizer estimated costs and actual R* resources consumed were written to database tables using new SQL commands, permitting automated control from SQL application programs of test data collection and reduction. A number of tests were run over a wide variety of dynamically-created test databases, SQL queries, and system parameters. The results for single-table access, sorting, and local 2-table joins are reported here. The tests confirmed the accuracy of the majority of the I/O cost model, the significant contribution of CPU cost to total cost, and the need to model CPU cost in more detail than was done in System R. The R* optimizer now retains cost components separately and estimates the number of CPU instructions, including those for applying different kinds of predicates. The sensitivity of I/O cost to buffer space motivated the development of more detailed models of buffer utilization unclustered index scans and nested-loop joins often benefit from pages remaining in the buffers, whereas concurrent scans of the data pages and the index pages for multiple tables during joins compete for buffer share. Without an index on the join column of the inner table, the optimizer correctly avoids the nested-loop join, confirming the need for merge-scan joins. When the join column of the inner is indexed, the optimizer overestimates the cost of the nested-loop join, whose actual performance is very sensitive to three parameters that are extremely difficult to estimate (1) the join (result) cardinality, (2) the outer table's cardinality, and (3) the number of buffer pages available to store the inner table. Suggestions are given for improved database statistics, prefetch and page replacement strategies for the buffer manager, and the use of temporary indexes and Bloom filters (hashed semijoins) to reduce access of unneeded data.

[1]  Sumon YusufPhotoshare Care , 1890, The Hospital.

[2]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[3]  Frank P. Palermo,et al.  A Data Base Search Problem , 1974 .

[4]  Eugene Wong,et al.  Decomposition—a strategy for query processing , 1976, TODS.

[5]  Guy M. Lohman,et al.  Differential files: their application to the maintenance of large databases , 1976, TODS.

[6]  M. W. Blasgen,et al.  Storage and Access in Relational Data Bases , 1977, IBM Syst. J..

[7]  Michael Stonebraker,et al.  Distributed query processing in a relational data base system , 1978, SIGMOD Conference.

[8]  Paolo Tiberio,et al.  Considerations in developing a design tool for a relational DBMS , 1979, COMPSAC.

[9]  Alan R. Hevner,et al.  Query Processing in Distributed Database System , 1979, IEEE Transactions on Software Engineering.

[10]  Patricia G. Selinger,et al.  Access path selection in a relational database management system , 1979, SIGMOD '79.

[11]  S. B. Yao,et al.  Optimization of query evaluation algorithms , 1979, ACM Trans. Database Syst..

[12]  Michael Stonebraker,et al.  Retrospection on a database system , 1980, TODS.

[13]  Patricia G. Selinger,et al.  Access Path Selection in Distributed Database Management Systems , 1980, ICOD.

[14]  Michael Stonebraker,et al.  Analysis of Distributed Data Base Processing Strategies , 1980, VLDB.

[15]  Patricia G. Selinger,et al.  Support for repetitive transactions and ad hoc queries in System R , 1981, TODS.

[16]  Michael Stonebraker,et al.  Operating system support for database management , 1981, CACM.

[17]  Eugene Wong,et al.  Query processing in a system for distributed databases (SDD-1) , 1981, TODS.

[18]  Giovanni Maria Sacco,et al.  A Mechanism for Managing the Buffer Pool in a Relational Database System Using the Hot Set Model , 1982, VLDB.

[19]  Wesley W. Chu,et al.  Optimal Query Processing for Distributed Database Systems , 1982, IEEE Transactions on Computers.

[20]  Larry Kerschberg,et al.  Query optimization in star computer networks , 1982, TODS.

[21]  S. Ceri,et al.  Distributed Database Research at the Politechnico of Milano. , 1982 .

[22]  Clement T. Yu,et al.  On the design of a query processing strategy in a distributed database environment , 1983, SIGMOD '83.

[23]  S. B. Yao,et al.  Optimization Algorithms for Distributed Queries , 1986, IEEE Transactions on Software Engineering.

[24]  Dean Daniels,et al.  Optimization of Nested Queries in a Distributed Relational Database , 1984, VLDB.

[25]  Kjell Bratbergsengen,et al.  Hashing Methods and Relational Algebra Operations , 1984, VLDB.

[26]  Wolfgang Effelsberg,et al.  Principles of database buffer management , 1984, TODS.

[27]  Michael Stonebraker,et al.  Implementation techniques for main memory database systems , 1984, SIGMOD '84.

[28]  Paolo Tiberio,et al.  Estimating the cost of updates in a relational database , 1985, TODS.