Continuous sampling for online aggregation over multiple queries

In this paper, we propose an online aggregation system called COSMOS (Continuous Sampling for Multiple queries in an Online aggregation System), to process multiple aggregate queries efficiently. In COSMOS, a dataset is first scrambled so that sequentially scanning the dataset gives rise to a stream of random samples for all queries. Moreover, COSMOS organizes queries into a dissemination graph to exploit the dependencies across queries. In this way, aggregates of queries closer to the root (source of data flow) can potentially be used to compute the aggregates of descendent/dependent queries. COSMOS applies some statistical approach to combine answers from ancestor nodes to generate the online aggregates for a node. COSMOS also offers a partitioning strategy to further salvage intermediate answers. We have implemented COSMOS and conducted an extensive experimental study in PostgreSQL. Our results on the TPC-H benchmark show the efficiency and effectiveness of COSMOS.

[1]  Surajit Chaudhuri,et al.  Dynamic sample selection for approximate query processing , 2003, SIGMOD '03.

[2]  Krithi Ramamritham,et al.  Materialized view selection and maintenance using multi-query optimization , 2000, SIGMOD '01.

[3]  Peter J. Haas,et al.  Ripple joins for online aggregation , 1999, SIGMOD '99.

[4]  George Candea,et al.  A Scalable, Predictable Join Operator for Highly Concurrent Data Warehouses , 2009, Proc. VLDB Endow..

[5]  Gita Gopal,et al.  The Architecture , 2022 .

[6]  Chris Jermaine,et al.  Online maintenance of very large random samples , 2004, SIGMOD '04.

[7]  C. Read,et al.  Handbook of the normal distribution , 1982 .

[8]  Helen J. Wang,et al.  Online aggregation , 1997, SIGMOD '97.

[9]  Rajeev Motwani,et al.  Overcoming limitations of sampling for aggregation queries , 2001, Proceedings 17th International Conference on Data Engineering.

[10]  Viswanath Poosala,et al.  Congressional samples for approximate answering of group-by queries , 2000, SIGMOD '00.

[11]  Peter J. Haas,et al.  Large-sample and deterministic confidence intervals for online aggregation , 1997, Proceedings. Ninth International Conference on Scientific and Statistical Database Management (Cat. No.97TB100150).

[12]  Nick Roussopoulos,et al.  DynaMat: a dynamic view management system for data warehouses , 1999, SIGMOD '99.

[13]  Divyakant Agrawal,et al.  pCube: Update-efficient online aggregation with progressive feedback and error bounds , 2000, Proceedings. 12th International Conference on Scientific and Statistica Database Management.

[14]  Adam Jacobs,et al.  The pathologies of big data , 2009, Commun. ACM.

[15]  Prasan Roy,et al.  Efficient and extensible algorithms for multi query optimization , 1999, SIGMOD '00.

[16]  Chris Jermaine,et al.  A disk-based join with probabilistic guarantees , 2005, SIGMOD '05.

[17]  Beng Chin Ooi,et al.  Distributed Online Aggregation , 2009, Proc. VLDB Endow..

[18]  Divyakant Agrawal,et al.  Flexible Data Cubes for Online Aggregation , 2001, ICDT.

[19]  Jeffrey F. Naughton,et al.  A scalable hash ripple join algorithm , 2002, SIGMOD '02.