Indexing for summary queries

Database queries can be broadly classified into two categories: reporting queries and aggregation queries. The former retrieves a collection of records from the database that match the query's conditions, while the latter returns an aggregate, such as count, sum, average, or max (min), of a particular attribute of these records. Aggregation queries are especially useful in business intelligence and data analysis applications where users are interested not in the actual records, but some statistics of them. They can also be executed much more efficiently than reporting queries, by embedding properly precomputed aggregates into an index. However, reporting and aggregation queries provide only two extremes for exploring the data. Data analysts often need more insight into the data distribution than what those simple aggregates provide, and yet certainly do not want the sheer volume of data returned by reporting queries. In this article, we design indexing techniques that allow for extracting a statistical summary of all the records in the query. The summaries we support include frequent items, quantiles, and various sketches, all of which are of central importance in massive data analysis. Our indexes require linear space and extract a summary with the optimal or near-optimal query cost. We illustrate the efficiency and usefulness of our designs through extensive experiments and a system demonstration.

[1]  Noga Alon,et al.  The space complexity of approximating the frequency moments , 1996, STOC '96.

[2]  Lu Wang,et al.  Sampling based algorithms for quantile computation in sensor networks , 2011, SIGMOD '11.

[3]  Allan Grønlund Jørgensen,et al.  Range selection and median: tight cell probe lower bounds and adaptive data structures , 2011, SODA '11.

[4]  Vladimir Vapnik,et al.  Chervonenkis: On the uniform convergence of relative frequencies of events to their probabilities , 1971 .

[5]  C. Papadimitriou,et al.  The complexity of massive data set computations , 2002 .

[6]  Xuemin Lin,et al.  Selecting Stars: The k Most Representative Skyline Operator , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[7]  S. Muthukrishnan,et al.  Data streams: algorithms and applications , 2005, SODA '03.

[8]  Jeffrey Scott Vitter,et al.  Optimal External Memory Interval Management , 2003, SIAM J. Comput..

[9]  Peter Sanders,et al.  Towards optimal range medians , 2011, Theor. Comput. Sci..

[10]  Jeffrey Considine,et al.  Spatio-temporal aggregation using sketches , 2004, Proceedings. 20th International Conference on Data Engineering.

[11]  Gaston H. Gonnet,et al.  Handbook Of Algorithms And Data Structures , 1984 .

[12]  Graham Cormode,et al.  An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[13]  Rajeev Motwani,et al.  Random sampling for histogram construction: how much is enough? , 1998, SIGMOD '98.

[14]  David P. Woodruff,et al.  Space-Efficient Estimation of Statistics Over Sub-Sampled Streams , 2012, PODS '12.

[15]  Divesh Srivastava,et al.  On computing correlated aggregates over continual data streams , 2001, SIGMOD '01.

[16]  Sanjeev Khanna,et al.  Space-efficient online computation of quantile summaries , 2001, SIGMOD '01.

[17]  Hamid Pirahesh,et al.  Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals , 1996, Data Mining and Knowledge Discovery.

[18]  Helen J. Wang,et al.  Online aggregation , 1997, SIGMOD '97.

[19]  Piotr Indyk,et al.  Maintaining stream statistics over sliding windows: (extended abstract) , 2002, SODA '02.

[20]  Divyakant Agrawal,et al.  An integrated efficient solution for computing frequent and top-k elements in data streams , 2006, TODS.

[21]  Piotr Indyk,et al.  Maintaining Stream Statistics over Sliding Windows , 2002, SIAM J. Comput..

[22]  Jeffrey Scott Vitter,et al.  Algorithms and Data Structures for External Memory , 2008, Found. Trends Theor. Comput. Sci..

[23]  Hanan Samet,et al.  Foundations of multidimensional and metric data structures , 2006, Morgan Kaufmann series in data management systems.

[24]  J. Ian Munro,et al.  Selection and sorting with limited storage , 1978, 19th Annual Symposium on Foundations of Computer Science (sfcs 1978).

[25]  Ke Yi,et al.  Beyond simple aggregates: indexing for summary queries , 2011, PODS.

[26]  Gurmeet Singh Manku,et al.  Approximate counts and quantiles over sliding windows , 2004, PODS.

[27]  Norbert Zeh,et al.  Ordered and unordered top-K range reporting in large data sets , 2011, SODA '11.

[28]  Gerhard Weikum,et al.  ACM Transactions on Database Systems , 2005 .

[29]  Graham Cormode,et al.  Mergeable summaries , 2012, PODS '12.

[30]  Pankaj K. Agarwal,et al.  Geometric Range Searching and Its Relatives , 2007 .

[31]  S. Muthukrishnan,et al.  How to Summarize the Universe: Dynamic Maintenance of Quantiles , 2002, VLDB.

[32]  Divyakant Agrawal,et al.  Medians and beyond: new aggregation techniques for sensor networks , 2004, SenSys '04.

[33]  Jian Pei,et al.  Distance-Based Representative Skyline , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[34]  Moses Charikar,et al.  Finding frequent items in data streams , 2002, Theor. Comput. Sci..

[35]  Peter J. Haas,et al.  On synopses for distinct-value estimation under multiset operations , 2007, SIGMOD '07.

[36]  Chris Jermaine,et al.  Scalable approximate query processing with the DBO engine , 2008, TODS.

[37]  Jayadev Misra,et al.  Finding Repeated Elements , 1982, Sci. Comput. Program..

[38]  Francesco Buccafurri,et al.  Enhancing histograms by tree-like bucket indices , 2007, The VLDB Journal.

[39]  Noga Alon,et al.  Tracking join and self-join sizes in limited storage , 1999, PODS '99.

[40]  Surajit Chaudhuri,et al.  Effective use of block-level sampling in statistics estimation , 2004, SIGMOD '04.