Random sampling with a reservoir

We introduce fast algorithms for selecting a random sample of <italic>n</italic> records without replacement from a pool of <italic>N</italic> records, where the value of <italic>N</italic> is unknown beforehand. The main result of the paper is the design and analysis of Algorithm Z; it does the sampling in one pass using constant space and in <italic>O</italic>(<italic>n</italic>(1 + log(<italic>N/n</italic>))) expected time, which is optimum, up to a constant factor. Several optimizations are studied that collectively improve the speed of the naive version of the algorithm by an order of magnitude. We give an efficient Pascal-like implementation that incorporates these modifications and that is suitable for general use. Theoretical and empirical results indicate that Algorithm Z outperforms current methods by a significant margin.