Access Cost Estimation for Physical Database Design

Abstract In this work we propose models for access cost estimation that are suitable in the physical design of a relational database when a set of secondary indexes has to be built on some attributes of the relations. The models are tailored to deal with distinct kinds of queries (partial-match, interval, join, etc.) and are based on a measure of association, the clustering factor , which applies between an attribute and the physical location of records in a file as well as between two (sets of) attributes. The use of clustering factors and the value selectivity of a query (i.e. how many distinct values satisfy a query) allow design time models to be derived without previously needing to estimate the record selectivities (i.e. how many records satisfy a query) or the corresponding access costs of all the query instances that can occur at run time. In practice, unlike previous approaches to the problem, run time models are derived by specializing design time models, rather than vice versa. Estimation of access costs with alternative ordering criteria is also considered, and a model is proposed that allows the primary attribute to be chosen wothout the need to sort the tuples. The proposed models achieve a good tradeoff between accuracy and simplicity, without being based on restrictive assumptions as to data, and easily allow the design process to take advantage of semantic information about the application domain even if the data are not yet loaded in the database.

[1]  Dina Bitton,et al.  Estimating Block Accessses when Attributes are Correlated , 1986, VLDB.

[2]  Paolo Ciaccia Block Access Estimation for Clustered Data , 1993, IEEE Trans. Knowl. Data Eng..

[3]  Wen-Chi Hou,et al.  Statistical estimators for aggregate relational algebra queries , 1991, TODS.

[4]  Stavros Christodoulakis,et al.  Estimating Block Selectivities , 1984, Inf. Syst..

[5]  Fabio Grandi,et al.  Block Access Estimation for Clustered Data Using a Finite LRU Buffer , 1993, IEEE Trans. Software Eng..

[6]  Stavros Christodoulakis,et al.  Implications of certain assumptions in database performance evauation , 1984, TODS.

[7]  Won Kim,et al.  A new way to compute the product and join of relations , 1980, SIGMOD '80.

[8]  Pai-Cheng Chu Estimating Block Selectivities for Physical Database Design , 1992, IEEE Trans. Knowl. Data Eng..

[9]  Paolo Tiberio,et al.  An Approximation Algorithm for Secondary Index Selection in Relational Database Physical Design , 1985, Comput. J..

[10]  Kenneth F. Siler A stochastic evaluation model for database organizations in data retrieval systems , 1976, CACM.

[11]  Sonia Bergamaschi,et al.  Choice of the optimal number of blocks for data access by an index , 1986, Inf. Syst..

[12]  S. B. Yao,et al.  Approximating block accesses in database organizations , 1977, CACM.

[13]  Dario Maio,et al.  On the Optimal Ordering of Multiple-Field Tables , 1994, Data Knowl. Eng..

[14]  L. A. Goodman,et al.  Measures of association for cross classifications , 1979 .

[15]  Kyu-Young Whang,et al.  A linear-time probabilistic counting algorithm for database applications , 1990, TODS.

[16]  Alfonso F. Cardenas Analysis and performance of inverted data base structures , 1975, CACM.

[17]  Stavros Christodoulakis,et al.  Estimating record selectivities , 1983, Inf. Syst..

[18]  Dina Bitton,et al.  A general framework for computing block accesses , 1987, Inf. Syst..

[19]  David J. DeWitt,et al.  Duplicate record elimination in large data files , 1983, TODS.

[20]  Douglas Comer,et al.  The difficulty of optimum index selection , 1978, TODS.

[21]  Paolo Tiberio,et al.  Considerations in developing a design tool for a relational DBMS , 1979, COMPSAC.