A linear-time probabilistic counting algorithm for database applications

We present a probabilistic algorithm for counting the number of unique values in the presence of duplicates. This algorithm has O(q) time complexity, where q is the number of values including duplicates, and produces an estimation with an arbitrary accuracy prespecified by the user using only a small amount of space. Traditionally, accurate counts of unique values were obtained by sorting, which has O(q log q) time complexity. Our technique, called linear counting, is based on hashing. We present a comprehensive theoretical and experimental analysis of linear counting. The analysis reveals an interesting result: A load factor (number of unique values/hash table size) much larger than 1.0 (e.g., 12) can be used for accurate estimation (e.g., 1% of error). We present this technique with two important applications to database problems: namely, (1) obtaining the column cardinality (the number of unique values in a column of a relation) and (2) obtaining the join selectivity (the number of unique values in the join column resulting from an unconditional join divided by the number of unique join column values in the relation to he joined). These two parameters are important statistics that are used in relational query optimization and physical database design.

[1]  Doron Rotem,et al.  Simple Random Sampling from Relational Databases , 1986, VLDB.

[2]  Dina Bitton,et al.  Estimating Block Accessses when Attributes are Correlated , 1986, VLDB.

[3]  Neil C. Rowe,et al.  Antisampling for Estimation: An Overview , 1985, IEEE Transactions on Software Engineering.

[4]  Gio Wiederhold,et al.  Separability —An Approach to Physical Database Design , 1984, IEEE Transactions on Computers.

[5]  Gio Wiederhold,et al.  Estimating block accesses in database organizations: a closed noniterative formula , 1983, CACM.

[6]  David J. DeWitt,et al.  Duplicate record elimination in large data files , 1983, TODS.

[7]  Erol Gelenbe,et al.  The Size of Projections of Relations Satisfying a Functional Dependency , 1982, VLDB.

[8]  Gio Wiederhold,et al.  Separability - an approach to physical data base design , 1981, VLDB 1981.

[9]  Robert Demolombe,et al.  Estimation of the Number of Tuples Satisfying a Query Expressed in Predicate Calculus Language , 1980, VLDB.

[10]  Eugene Wong,et al.  Query Processing In A Relational Database Management System , 1979, Fifth International Conference on Very Large Data Bases, 1979..

[11]  Clement T. Yu,et al.  Performance analysis of three related assignment problems , 1979, SIGMOD '79.

[12]  T. G. Price,et al.  Access path selection in a relational database management system , 1979, SIGMOD '79.

[13]  Samuel D. Conte,et al.  Elementary Numerical Analysis: An Algorithmic Approach , 1975 .

[14]  William Feller,et al.  An Introduction to Probability Theory and Its Applications , 1967 .

[15]  W. Feller An Introduction to Probability Theory and Its Applications , 1959 .

[16]  J. Wolfowitz,et al.  An Introduction to the Theory of Statistics , 1951, Nature.

[17]  B. Sherman,et al.  A Random Variable Related to the Spacing of Sample Values , 1950 .

[18]  Kyu-Young Whang,et al.  Approximating the number of unique values of an attribute without sorting , 1987, Inf. Syst..

[19]  H. M. Taylor,et al.  ESTIMATING BLOCK ACCESSES WHEN ATTRIBUTES ARE CORRELATED , 1986 .

[20]  Ramez Elmasri,et al.  The Structural Model for Database Design , 1979, ER.

[21]  Norman L. Johnson,et al.  Urn models and their application , 1977 .