Beyond simple aggregates: indexing for summary queries

Database queries can be broadly classified into two categories: reporting queries and aggregation queries. The former retrieves a collection of records from the database that match the query's conditions, while the latter returns an aggregate, such as count, sum, average, or max (min), of a particular attribute of these records. Aggregation queries are especially useful in business intelligence and data analysis applications where users are interested not in the actual records, but some statistics of them. They can also be executed much more efficiently than reporting queries, by embedding properly precomputed aggregates into an index. However, reporting and aggregation queries provide only two extremes for exploring the data. Data analysts often need more insight into the data distribution than what those simple aggregates provide, and yet certainly do not want the sheer volume of data returned by reporting queries. In this paper, we design indexing techniques that allow for extracting a statistical summary of all the records in the query. The summaries we support include frequent items, quantiles, various sketches, and wavelets, all of which are of central importance in massive data analysis. Our indexes require linear space and extract a summary with the optimal or near-optimal query cost.

[1]  Hanan Samet,et al.  Foundations of multidimensional and metric data structures , 2006, Morgan Kaufmann series in data management systems.

[2]  Chris Jermaine,et al.  Scalable approximate query processing with the DBO engine , 2007, SIGMOD '07.

[3]  Hamid Pirahesh,et al.  Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals , 1996, Data Mining and Knowledge Discovery.

[4]  Helen J. Wang,et al.  Online aggregation , 1997, SIGMOD '97.

[5]  Allan Grønlund Jørgensen,et al.  Range selection and median: tight cell probe lower bounds and adaptive data structures , 2011, SODA '11.

[6]  Jeffrey Scott Vitter,et al.  Wavelet-based histograms for selectivity estimation , 1998, SIGMOD '98.

[7]  Sanjeev Khanna,et al.  Space-efficient online computation of quantile summaries , 2001, SIGMOD '01.

[8]  Sudipto Guha,et al.  XWAVE: optimal and approximate extended wavelets , 2004, VLDB 2004.

[9]  Pankaj K. Agarwal,et al.  Geometric Range Searching and Its Relatives , 2007 .

[10]  Mark H. Overmars,et al.  The Design of Dynamic Data Structures , 1987, Lecture Notes in Computer Science.

[11]  Noga Alon,et al.  The Space Complexity of Approximating the Frequency Moments , 1999 .

[12]  Divyakant Agrawal,et al.  An integrated efficient solution for computing frequent and top-k elements in data streams , 2006, TODS.

[13]  Vladimir Vapnik,et al.  Chervonenkis: On the uniform convergence of relative frequencies of events to their probabilities , 1971 .

[14]  Peter Sanders,et al.  Towards optimal range medians , 2011, Theor. Comput. Sci..

[15]  S. Muthukrishnan,et al.  Surfing Wavelets on Streams: One-Pass Summaries for Approximate Aggregate Queries , 2001, VLDB.

[16]  Amit Kumar,et al.  Wavelet synopses for general error metrics , 2005, TODS.

[17]  Florin Rusu,et al.  Pseudo-random number generation for sketch-based estimations , 2007, TODS.

[18]  Peter J. Haas,et al.  On synopses for distinct-value estimation under multiset operations , 2007, SIGMOD '07.

[19]  Norbert Zeh,et al.  Ordered and unordered top-K range reporting in large data sets , 2011, SODA '11.

[20]  Jeffrey Scott Vitter,et al.  Approximate computation of multidimensional aggregates of sparse data using wavelets , 1999, SIGMOD '99.

[21]  Xuemin Lin,et al.  Selecting Stars: The k Most Representative Skyline Operator , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[22]  Sudipto Guha,et al.  Fast, small-space algorithms for approximate histogram maintenance , 2002, STOC '02.

[23]  Graham Cormode,et al.  An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[24]  Jian Pei,et al.  Distance-Based Representative Skyline , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[25]  Jayadev Misra,et al.  Finding Repeated Elements , 1982, Sci. Comput. Program..

[26]  Noga Alon,et al.  Tracking join and self-join sizes in limited storage , 1999, PODS '99.

[27]  Jeffrey Scott Vitter,et al.  Dynamic Maintenance of Wavelet-Based Histograms , 2000, VLDB.

[28]  Gurmeet Singh Manku,et al.  Approximate counts and quantiles over sliding windows , 2004, PODS.

[29]  Divesh Srivastava,et al.  On computing correlated aggregates over continual data streams , 2001, SIGMOD '01.