A novel, low-latency algorithm for multiple Group-By query optimization

Data summarization is essential for users to interact with data. Current state of the art algorithms to optimize its most general form, the multiple Group By queries, have limitations in scalability. In this paper, we propose a novel algorithm, Top-Down Splitting, that scales to hundreds or even thousands of attributes and queries, and that quickly and efficiently produces optimized query execution plans. We analyze the complexity of our algorithm, and evaluate, empirically, its scalability and effectiveness through an experimental campaign. Results show that our algorithm is remarkably faster than alternatives in prior works, while generally producing better solutions. Ultimately, our algorithm reduces up to 34% the query execution time, when compared to un-optimized plans.

[1]  Raymond T. Ng,et al.  Iceberg-cube computation with PC clusters , 2001, SIGMOD '01.

[2]  Arian Baer Two parallel approaches to network data analysis , 2011 .

[3]  Andrew Rau-Chaplin,et al.  Computing Partial Data Cubes , 2003 .

[4]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[5]  Jeffrey F. Naughton,et al.  On the Computation of Multidimensional Aggregates , 1996, VLDB.

[6]  Raghu Ramakrishnan,et al.  Bottom-up computation of sparse and Iceberg CUBE , 1999, SIGMOD '99.

[7]  Hamid Pirahesh,et al.  Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals , 1996, Data Mining and Knowledge Discovery.

[8]  Zhimin Chen,et al.  Efficient computation of multiple group by queries , 2005, SIGMOD '05.

[9]  Kenneth A. Ross,et al.  Adaptive Aggregation on Chip Multiprocessors , 2007, VLDB.

[10]  Pietro Michiardi,et al.  Efficient and Self-Balanced ROLLUP Aggregates for Large-Scale Data Summarization , 2015, 2015 IEEE International Congress on Big Data.

[11]  Kenneth A. Ross,et al.  Fast Computation of Sparse Datacubes , 1997, VLDB.

[12]  RamakrishnanRaghu,et al.  Bottom-up computation of sparse and Iceberg CUBE , 1999 .

[13]  Pietro Michiardi,et al.  On the design space of MapReduce ROLLUP aggregates , 2014, EDBT/ICDT Workshops.

[14]  Jeffrey F. Naughton,et al.  Sampling-Based Estimation of the Number of Distinct Values of an Attribute , 1995, VLDB.

[15]  Sunita Sarawagi,et al.  On computing the data cube , 1996 .

[16]  Jeffrey D. Ullman,et al.  Implementing data cubes efficiently , 1996, SIGMOD '96.

[17]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[18]  Surajit Chaudhuri,et al.  Effective use of block-level sampling in statistics estimation , 2004, SIGMOD '04.

[19]  Yanpei Chen,et al.  Interactive Analytical Processing in Big Data Systems: A Cross-Industry Study of MapReduce Workloads , 2012, Proc. VLDB Endow..