Efficiently processing (p,ε)-approximate join aggregation on massive data

Abstract Join aggregation is an important operation in database systems to return aggregate information on the join of two or several tables. Compared with exact query, it is a better choice in many cases to return approximate result satisfying a user-specified confidence interval in a much faster response time. It is found that none of previous works can efficiently process approximate join aggregation on massive data with arbitrary accuracy. This paper proposes a novel algorithm p e -AJA ( ( p , e ) -Approximate Join Aggregation) to obtain approximate join aggregate result with arbitrary confidence interval efficiently. Two data structures of low space overhead, JRS and JPIPT, are presented in this paper. p e -AJA first makes use of JRS to return a quick response. If the approximate result computed by JRS does not satisfy the given confidence interval, JPIPT is exploited to obtain enough random join tuples. This paper presents a novel sampling algorithm to acquire random JPIPT tuples of specified size and devises its correctness proof. A tuple fetching method is proposed to retrieve join tuples by the sampled JPIPT tuples in one-pass sequential scan on joined tables. The construction and maintenance algorithms of JPIPT and JRS are provided also in this paper. The experimental results show that p e -AJA obtains 3 times to 2 orders of magnitude speedup over the existing algorithms and runs 1 to 4 orders of magnitude faster than exact query.

[1]  Ramani Duraiswami,et al.  Fast optimal bandwidth selection for kernel density estimation , 2006, SDM.

[2]  Wolfgang Lehner,et al.  Sample synopses for approximate answering of group-by queries , 2009, EDBT '09.

[3]  Peter J. Haas,et al.  Large-sample and deterministic confidence intervals for online aggregation , 1997, Proceedings. Ninth International Conference on Scientific and Statistical Database Management (Cat. No.97TB100150).

[4]  Ashish Gupta,et al.  Aggregate-Query Processing in Data Warehousing Environments , 1995, VLDB.

[5]  Raghunath Othayoth Nambiar,et al.  Shaping the Landscape of Industry Standard Benchmarks: Contributions of the Transaction Processing Performance Council (TPC) , 2011, TPCTC.

[6]  Jeffrey Scott Vitter,et al.  Random sampling with a reservoir , 1985, TOMS.

[7]  Jianzhong Li,et al.  PI-Join: Efficiently processing join queries on massive data , 2012, Knowledge and Information Systems.

[8]  Minos N. Garofalakis,et al.  Approximate Query Processing: Taming the TeraBytes , 2001, VLDB.

[9]  Fei Xu,et al.  The DBO database system , 2008, SIGMOD Conference.

[10]  Jeffrey D. Ullman,et al.  Implementing data cubes efficiently , 1996, SIGMOD '96.

[11]  Sharad Mehrotra,et al.  Progressive approximate aggregate queries with a multi-resolution tree structure , 2001, SIGMOD '01.

[12]  Viswanath Poosala,et al.  Congressional Samples for Approximate Answering of Group-By Queries , 2000, SIGMOD Conference.

[13]  Michael Stonebraker,et al.  C-Store: A Column-oriented DBMS , 2005, VLDB.

[14]  Radko Mesiar,et al.  Aggregation functions: Construction methods, conjunctive, disjunctive and mixed classes , 2011, Inf. Sci..

[15]  Surajit Chaudhuri,et al.  Dynamic sample selection for approximate query processing , 2003, SIGMOD '03.

[16]  József Dombi,et al.  On a certain class of aggregative operators , 2013, Inf. Sci..

[17]  Yannis E. Ioannidis,et al.  Approximate Query Answering using Histograms , 1999, IEEE Data Eng. Bull..

[18]  Mong-Li Lee,et al.  ICICLES: Self-Tuning Samples for Approximate Query Answering , 2000, VLDB.

[19]  Chris Jermaine,et al.  A disk-based join with probabilistic guarantees , 2005, SIGMOD '05.

[20]  Don-Lin Yang,et al.  Efficient approaches for materialized views selection in a data warehouse , 2007, Inf. Sci..

[21]  Martin L. Kersten,et al.  Database Architecture Optimized for the New Bottleneck: Memory Access , 1999, VLDB.

[22]  Rajeev Motwani,et al.  Overcoming limitations of sampling for aggregation queries , 2001, Proceedings 17th International Conference on Data Engineering.

[23]  P. Billingsley,et al.  Probability and Measure , 1980 .

[24]  Marek Gagolewski On the relationship between symmetric maxitive, minitive, and modular aggregation operators , 2013, Inf. Sci..

[25]  Helen J. Wang,et al.  Online aggregation , 1997, SIGMOD '97.

[26]  Jianzhong Li,et al.  TJJE: An efficient algorithm for top-k join on massive data , 2013, Inf. Sci..

[27]  Kyuseok Shim,et al.  Approximate query processing using wavelets , 2001, The VLDB Journal.

[28]  Jeffrey F. Naughton,et al.  A scalable hash ripple join algorithm , 2002, SIGMOD '02.

[29]  Radko Mesiar,et al.  Aggregation functions: Means , 2011, Inf. Sci..

[30]  Sridhar Ramaswamy,et al.  Join synopses for approximate query answering , 1999, SIGMOD '99.

[31]  Jeffrey D. Ullman,et al.  Index selection for OLAP , 1997, Proceedings 13th International Conference on Data Engineering.

[32]  Yon Dohn Chung,et al.  An efficient method for maintaining data cubes incrementally , 2010, Inf. Sci..

[33]  Surajit Chaudhuri,et al.  Optimized stratified sampling for approximate query processing , 2007, TODS.

[34]  E. Parzen On Estimation of a Probability Density Function and Mode , 1962 .

[35]  Peter J. Haas,et al.  Ripple joins for online aggregation , 1999, SIGMOD '99.

[36]  Rajeev Motwani,et al.  On random sampling over joins , 1999, SIGMOD '99.