Improving Online Aggregation Performance for Skewed Data Distribution

Online aggregation is a commonly-used technique to response aggregation queries with the refined approximate answers (within an estimated confidence interval) quickly. However, we observe that low selectivity and inappropriate sample proportion significantly affect the online aggregation performance when the data distribution is skewed. To overcome this problem, we propose a Partition-based Online Aggregation System called POAS. In POAS, the side effect of low selectivity can be reduced by efficient pruning of unneeded data due to the partition and shuffle strategies, and the appropriate sample proportion can be achieved as far as possible by drawing samples (tuples) from relevant partitions with dynamic sample size. Moreover, POAS applies some statistical approaches to calculate estimates from relevant partitions. We have implemented POAS and conducted an extensive experiments study on the TPC-H benchmark for skewed data distribution. Our results demonstrate the efficiency and effectiveness of POAS.

[1]  Joseph M. Hellerstein,et al.  Online aggregation and continuous query support in MapReduce , 2010, SIGMOD Conference.

[2]  Beng Chin Ooi,et al.  Continuous sampling for online aggregation over multiple queries , 2010, SIGMOD Conference.

[3]  Peter J. Haas,et al.  Large-sample and deterministic confidence intervals for online aggregation , 1997, Proceedings. Ninth International Conference on Scientific and Statistical Database Management (Cat. No.97TB100150).

[4]  Jeffrey F. Naughton,et al.  A scalable hash ripple join algorithm , 2002, SIGMOD '02.

[5]  Chris Jermaine,et al.  Online aggregation for large MapReduce jobs , 2011, Proc. VLDB Endow..

[6]  Beng Chin Ooi,et al.  Distributed Online Aggregation , 2009, Proc. VLDB Endow..

[7]  Rajeev Motwani,et al.  Overcoming limitations of sampling for aggregation queries , 2001, Proceedings 17th International Conference on Data Engineering.

[8]  Joos-Hendrik Böse,et al.  Beyond online aggregation: parallel and incremental data mining with online Map-Reduce , 2010, MDAC '10.

[9]  George Candea,et al.  A Scalable, Predictable Join Operator for Highly Concurrent Data Warehouses , 2009, Proc. VLDB Endow..

[10]  Adam Jacobs,et al.  The pathologies of big data , 2009, Commun. ACM.

[11]  Rares Vernica,et al.  Hyracks: A flexible and extensible foundation for data-intensive computing , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[12]  Helen J. Wang,et al.  Online aggregation , 1997, SIGMOD '97.

[13]  Anwar M. Ghuloum,et al.  ViewpointFace the inevitable, embrace parallelism , 2009, CACM.

[14]  Peter J. Haas,et al.  Ripple joins for online aggregation , 1999, SIGMOD '99.

[15]  Gita Gopal,et al.  The Architecture , 2022 .