Estimating record selectivities

Abstract In this paper we examine the problem of modelling data base contents and user requests. This modelling is necessary in analytic data base performance evaluation studies in order to estimate the number of records of a file that have to be retrieved in response to user(s) requests. The cpu, io, and telecommunication costs of the system are directly or indirectly expressed in terms of these quantities. We first show that certain assumptions-used for modelling data base contents, data placement on devices and user requests often are not satisfied in actual data base environments. Thereafter we provide more detailed modelling techniques based on a multivariate statistical model, and we demonstrate their use in improving data base performance.

[1]  E. F. Codd,et al.  A relational model of data for large shared data banks , 1970, CACM.

[2]  Peter M. Neely Comparison of several algorithms for computation of means, standard deviations and correlation coefficients , 1966, CACM.

[3]  James B. Rothnie,et al.  Attribute based file organization in a paged memory environment , 1974, CACM.

[4]  S. Christodoulakis A Multivariate Statistical Model for Data Base Performance Evaluation , 1982 .

[5]  Michael Hammer,et al.  A heuristic approach to attribute partitioning , 1979, SIGMOD '79.

[6]  Toby J. Teorey,et al.  Application of an analytical model to evaluate storage structures , 1976, SIGMOD '76.

[7]  Athanasios Papoulis,et al.  Probability, Random Variables and Stochastic Processes , 1965 .

[8]  Stavros Christodoulakis,et al.  Estimating selectivities in data bases , 1982 .

[9]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[10]  P. Bruce Berra,et al.  Minimum cost selection of secondary indexes for formatted files , 1977, TODS.

[11]  Irving L. Traiger,et al.  System R: relational approach to database management , 1976, TODS.

[12]  Toby J. Teorey,et al.  Network database evaluation using analytical modeling , 1978, AFIPS National Computer Conference.

[13]  William Palin Elderton Frequency curves and correlation , 1928 .

[14]  G. Sebestyen,et al.  An Algorithm for Non-Parametric Pattern Recognition , 1966, IEEE Trans. Electron. Comput..

[15]  Anne Putkonen On the selection of the access path in inverted database organization , 1979, Inf. Syst..

[16]  Calvin C. Gotlieb,et al.  A unifying model of physical databases , 1982, TODS.

[17]  Julius T. Tou,et al.  Pattern Recognition Principles , 1974 .

[18]  T. W. Anderson,et al.  An Introduction to Multivariate Statistical Analysis , 1959 .

[19]  Alfred V. Aho,et al.  Optimal partial-match retrieval when fields are independently specified , 1979, ACM Trans. Database Syst..

[20]  Mario Schkolnick,et al.  The Optimal Selection of Secondary Indices for Files , 1975, Inf. Syst..

[21]  Earl E. Swartzlander,et al.  Introduction to Mathematical Techniques in Pattern Recognition , 1973 .

[22]  Billy G. Claybrook,et al.  Efficient algorithms for answering queries with unsorted multilists , 1978, Inf. Syst..

[23]  Eugene Wong,et al.  Query processing in sdd-i: a system for distributed databases , 1979 .

[24]  T. W. Anderson An Introduction to Multivariate Statistical Analysis , 1959 .

[25]  Alfonso F. Cardenas Analysis and performance of inverted data base structures , 1975, CACM.

[26]  P. Cooper Statistical classification with quadratic forms , 1963 .

[27]  Patricia G. Selinger,et al.  Access path selection in a relational database management system , 1979, SIGMOD '79.

[28]  Philippe Richard,et al.  Evaluation of the size of a query expressed in relational algebra , 1981, SIGMOD '81.