Efficiently processing deterministic approximate aggregation query on massive data

In actual applications, aggregation is an important operation to return statistical characterizations of subset of the data set. On massive data, approximate aggregation often is preferable for its better timeliness and responsiveness. This paper focuses on deterministic approximate aggregation to return running aggregate within progressive deterministic error interval. The existing methods either return approximate results with fixed accuracy, or return online approximate aggregate with probabilistic confidence interval, or incur a high I/O cost on massive data. This paper proposes LDA algorithm to compute deterministic approximate aggregate on massive data efficiently. LDA utilizes selection attribute lattice of hierarchical structure to distribute tuples and obtain a horizontal partitioning of the table. In each partition, each selection attribute is kept in column file and each ranking attribute is transposed to bit-slices. Given the selection condition, only relevant partitions are involved to compute the running aggregate. The compact storage scheme based on Z-order space filling curve is proposed to reduce the management cost of the partitions. An error reduction method is devised to reduce the error interval when computing running aggregate. The extensive experimental results on synthetic and real data sets show that LDA has a significant performance advantage over the existing algorithms.

[1]  Barzan Mozafari,et al.  A Handbook for Building an Approximate Query Engine , 2015, IEEE Data Eng. Bull..

[2]  Wolfgang Lehner,et al.  Sample synopses for approximate answering of group-by queries , 2009, EDBT '09.

[3]  Ashish Gupta,et al.  Aggregate-Query Processing in Data Warehousing Environments , 1995, VLDB.

[4]  Jignesh M. Patel,et al.  DAQ: A New Paradigm for Approximate Query Processing , 2015, Proc. VLDB Endow..

[5]  Yannis E. Ioannidis,et al.  Approximate Query Answering using Histograms , 1999, IEEE Data Eng. Bull..

[6]  Mong-Li Lee,et al.  ICICLES: Self-Tuning Samples for Approximate Query Answering , 2000, VLDB.

[7]  Bin Wu,et al.  Wander Join: Online Aggregation for Joins , 2016, SIGMOD Conference.

[8]  Barzan Mozafari,et al.  CliffGuard: A Principled Framework for Finding Robust Database Designs , 2015, SIGMOD Conference.

[9]  Robert B. Miller,et al.  Response time in man-computer conversational transactions , 1899, AFIPS Fall Joint Computing Conference.

[10]  Kyuseok Shim,et al.  Approximate query processing using wavelets , 2001, The VLDB Journal.

[11]  Beng Chin Ooi,et al.  Distributed Online Aggregation , 2009, Proc. VLDB Endow..

[12]  Oliver Günther,et al.  Multidimensional access methods , 1998, CSUR.

[13]  Jeffrey D. Ullman,et al.  Implementing data cubes efficiently , 1996, SIGMOD '96.

[14]  Jianzhong Li,et al.  Efficiently processing (p,ε)-approximate join aggregation on massive data , 2014, Inf. Sci..

[15]  Chris Jermaine,et al.  Online aggregation for large MapReduce jobs , 2011, Proc. VLDB Endow..

[16]  Ion Stoica,et al.  BlinkDB: queries with bounded errors and bounded response times on very large data , 2012, EuroSys '13.

[17]  Hamid Pirahesh,et al.  Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals , 1996, Data Mining and Knowledge Discovery.

[18]  Helen J. Wang,et al.  Online aggregation , 1997, SIGMOD '97.

[19]  Chris Jermaine,et al.  Scalable approximate query processing with the DBO engine , 2007, SIGMOD '07.

[20]  Peter J. Haas,et al.  Ripple joins for online aggregation , 1999, SIGMOD '99.

[21]  Patrick E. O'Neil,et al.  Improved query performance with variant indexes , 1997, SIGMOD '97.

[22]  Minos N. Garofalakis,et al.  Approximate Query Processing: Taming the TeraBytes , 2001, VLDB.

[23]  Won Kim,et al.  On optimizing an SQL-like nested query , 1982, TODS.

[24]  Carlo Zaniolo,et al.  The analytical bootstrap: a new method for fast error estimation in approximate query processing , 2014, SIGMOD Conference.

[25]  Surajit Chaudhuri,et al.  Sample + Seek: Approximating Aggregates with Distribution Precision Guarantee , 2016, SIGMOD Conference.

[26]  Ameet Talwalkar,et al.  Knowing when you're wrong: building fast and reliable approximate query processing systems , 2014, SIGMOD Conference.

[27]  Viswanath Poosala,et al.  Congressional samples for approximate answering of group-by queries , 2000, SIGMOD '00.

[28]  Jiawei Han,et al.  Answering top-k queries with multi-dimensional selections: the ranking cube approach , 2006, VLDB.

[29]  Sharad Mehrotra,et al.  Progressive approximate aggregate queries with a multi-resolution tree structure , 2001, SIGMOD '01.

[30]  Jianzhong Li,et al.  Bit transposition for very large scientific and statistical databases , 1986, Algorithmica.

[31]  Hee-Kap Ahn,et al.  A survey on multidimensional access methods , 2001 .

[32]  Clement T. Yu,et al.  Deep Web Query Interface Understanding and Integration , 2012, Deep Web Query Interface Understanding and Integration.