Cheaper by the Dozen: Batched Algorithms

While computing power and memory size have been steadily increasing as predicted by Moore’s Law, they are still dwarfed by the size of massive data sets resultant from a number of applications. Many problems arising from astrophysics, computational biology, telecommunications, and the Internet often have an amount of accompanying data in the terabyte range. The analysis of this data by classical algorithms is often prohibitively expensive. Thus new ideas are necessary to create algorithms to deal with these massive data sets. In this paper we develop the idea of batching, processing several queries at a time, for more efficient algorithms for several query problems. The advantages of our algorithms, over the classical approach of putting the massive dataset into a data structure, are threefold: improved asymptotic performance, significantly smaller data structures, and a number of I/O’s which is linear in the size of the massive dataset. We use two techniques, query data structures and sampling, in the design of our batched algorithms. In addition, we believe that batched algorithms have many practical implications. Consider a webpage that answers queries on a large data set. Instead of answering these queries one at a time, which can result in a substantial bottleneck, we wait for several queries to accumulate, and then apply a batched algorithm that can answer them significantly faster. To illustrate the idea of batched algorithms, we consider the dictionary problem. Suppose we begin with n unsorted items. If we have only one query, it does not make sense to place the n items in a data structure; the best we can do is the brute force method of comparing the query with all n items. Now suppose we have b queries. If b is large enough and we have enough space, it makes sense to build a data structure such as a binary tree or perfect hash table. However, if 1 < b << n, we can do better. We simply sort the list of the b