Domains and Active Domains: What This Distinction Implies for the Estimation of Projection Sizes in Relational Databases

Database optimizers require statistical information about data distributions in order to evaluate result sizes and access plan costs for processing user queries. In this context, we consider the problem of estimating the size of the projections of a database relation, when measures on attribute domain cardinalities are maintained in the system. Our main theoretical contribution is a new formal model, the AD (active domain) model, which is valid under the hypotheses of attribute independence and uniform distribution of attribute values, derived considering the difference between the time-invariant domain (the set of values that an attribute can assume) and the time-dependent ("active") domain (the set of values that are actually assumed, at a certain time). Early models developed under the same assumptions are shown to be formally incorrect. Since the AD model is computationally highly demanding, we also introduce an approximate, easy-to-compute model, the A/sup 2/D (approximate active domain) model that, unlike previous approximations, yields low errors on all the parameter space of the active domain cardinalities. Finally, we extend the A/sup 2/D model to the case of nonuniform distributions and present experimental results confirming the good behavior of the model. >

[1]  Dennis McLeod,et al.  On estimating the cardinality of the projection of a database relation , 1989, TODS.

[2]  To-Yat Cheung A Statistical Model for Estimating the Number of Records in a Relational Database , 1982, Inf. Process. Lett..

[3]  David J. DeWitt,et al.  Duplicate record elimination in large data files , 1983, TODS.

[4]  George Kingsley Zipf,et al.  Human behavior and the principle of least effort , 1949 .

[5]  Fabio Grandi,et al.  Block Access Estimation for Clustered Data Using a Finite LRU Buffer , 1993, IEEE Trans. Software Eng..

[6]  Edward A. Bender,et al.  Central and Local Limit Theorems Applied to Asymptotic Enumeration , 1973, J. Comb. Theory A.

[7]  Kyu-Young Whang,et al.  A linear-time probabilistic counting algorithm for database applications , 1990, TODS.

[8]  Patricia G. Selinger,et al.  Access path selection in a relational database management system , 1979, SIGMOD '79.

[9]  Stefano Ceri,et al.  Distributed Databases: Principles and Systems , 1984 .

[10]  Dario Maio,et al.  Access Cost Estimation for Physical Database Design , 1993, Data Knowl. Eng..

[11]  Paolo Ciaccia Block Access Estimation for Clustered Data , 1993, IEEE Trans. Knowl. Data Eng..

[12]  Danièle Gardy,et al.  On the effect of join operations on relation sizes , 1989, TODS.

[13]  Donald Ervin Knuth,et al.  The Art of Computer Programming , 1968 .

[14]  Wen-Chi Hou,et al.  Statistical estimators for aggregate relational algebra queries , 1991, TODS.

[15]  Stavros Christodoulakis,et al.  Estimating record selectivities , 1983, Inf. Syst..

[16]  T. H. Merrett,et al.  Distribution Models Of Relations , 1979, Fifth International Conference on Very Large Data Bases, 1979..

[17]  Paolo Ciaccia,et al.  Optimization Strategies for Relational Queries , 1989, IEEE Transactions on Software Engineering.

[18]  Danièle Gardy,et al.  On the sizes of projections: a generating function approach , 1984, Inf. Syst..

[19]  Guy M. Lohman,et al.  Index scans using a finite LRU buffer: a validated I/O model , 1989, ACM Trans. Database Syst..

[20]  Larry Kerschberg,et al.  Query optimization in star computer networks , 1982, TODS.

[21]  E. Jaynes On the rationale of maximum-entropy methods , 1982, Proceedings of the IEEE.

[22]  Matthias Jarke,et al.  Query Optimization in Database Systems , 1984, CSUR.

[23]  Stavros Christodoulakis,et al.  On the propagation of errors in the size of join results , 1991, SIGMOD '91.

[24]  Michael Stonebraker,et al.  The effect of join selectives on optimal nesting order , 1987, SGMD.

[25]  Dario Maio,et al.  On the complexity of finding bounds for projection cardinalities in relational databases , 1992, Inf. Syst..

[26]  Silvio Salza,et al.  Evaluating the size of queries on relational databases with non-uniform distribution and stochastic dependence , 1989, SIGMOD '89.

[27]  Stavros Christodoulakis,et al.  Optimal histograms for limiting worst-case error propagation in the size of join results , 1993, TODS.

[28]  E. T. Jaynes,et al.  Where do we Stand on Maximum Entropy , 1979 .

[29]  Clifford A. Lynch,et al.  Selectivity Estimation and Query Optimization in Large Databases with Highly Skewed Distribution of Column Values , 1988, VLDB.

[30]  Tommaso Mostardi,et al.  Estimating the size of relational SP-Theta-J operation results: an analytical approach , 1990, Inf. Syst..

[31]  Michael V. Mannino,et al.  Statistical profile estimation in database systems , 1988, CSUR.

[32]  Stavros Christodoulakis,et al.  Implications of certain assumptions in database performance evauation , 1984, TODS.

[33]  Paolo Tiberio,et al.  On Estimating Access Costs in Relational Databases , 1984, Inf. Process. Lett..

[34]  Sushil Jajodia,et al.  A note on estimating the cardinality of the projection of a database relation , 1991, TODS.

[35]  Roger King,et al.  A model of data distribution based on texture analysis , 1985, SIGMOD '85.