Probabilistic counting

We present here a class of probabilistic algorithms with which one can estimate the number of distinct elements in a collection of data (typically a large file stored on disk) in a single pass, using only 0(1) auxiliary storage and 0(1) operations per element. We precisely quantify the accuracy-storage trade-offs: for instance a typical accuracy of about 5% can be achieved using only 256 binary words, even for very large files. The algorithms are totally insensitive to the replicative structure of the elements in the file. They are particularly adapted to data base systems in the context of query optimization and can be implemented in a decentralized manner (thus making them also useful for distributed data base applications).