Fast Computation of Sparse Datacubes

Datacube queries compute aggregates over database relations at a variety of granularities, and they constitute an important class of decision support queries. Real-world data is frequently sparse, and hence efficiently computing datacubes over large sparse relations is important. We show that current techniques for computing datacubes over sparse relations do not scale well with the number of CUBE BY attributes, especially when the relation is much larger than main memory. We propose a novel algorithm for the fast computation of datacubes over sparse relations, and demonstrate the efficiency of our algorithm using synthetic, benchmark and real-world data sets. When the relation fits in memory, our technique performs multiple in-memory sorts, and does not incur any I/O beyond the input of the relation and the output of the datacube itself. When the relation does not fit in memory, a divideand-conquer strategy divides the problem of computing the datacube into several simpler computations of sub-datacubes. Often, all but one of the sub-datacubes can be computed in memory and our in-memory solution applies. In that case, the total I/O overhead is linear in the number of CUBE BY attributes. We demonstrate with an implementation that the CPU cost of our algorithm is dominated by the I/O cost for sparse relations. ‘The research of Kenneth A.