Online aggregation provides continuous estimates to the final result of a computation during the actual processing. The user can stop the computation as soon as the estimate is accurate enough, typically early in the execution, or can let the processing terminate and obtain the exact result. In this demonstration, we introduce a general framework for parallel online aggregation in which estimation does not incur overhead on top of the actual processing. We define a generic interface to express any estimation model that abstracts completely the execution details. We design multiple sampling-based estimators suited for parallel online aggregation and implement them inside the framework. Demonstration participants are shown how estimates to general SQL aggregation queries over terabytes of TPC-H data are generated during the entire processing. Due to parallel execution, the estimate converges to the correct result in a matter of seconds even for the most difficult queries. The behavior of the estimators is evaluated under different operating regimes of the distributed cluster used in the demonstration.
[1]
Helen J. Wang,et al.
Online aggregation
,
1997,
SIGMOD '97.
[2]
Joseph M. Hellerstein,et al.
CONTROL: continuous output and navigation technology with refinement on-line
,
1998,
SIGMOD '98.
[3]
Fei Xu,et al.
The DBO database system
,
2008,
SIGMOD Conference.
[4]
Fei Xu,et al.
Turbo-Charging Estimate Convergence in DBO
,
2009,
Proc. VLDB Endow..
[5]
Beng Chin Ooi,et al.
Distributed Online Aggregation
,
2009,
Proc. VLDB Endow..
[6]
Chris Jermaine,et al.
Online aggregation for large MapReduce jobs
,
2011,
Proc. VLDB Endow..
[7]
Yu Cheng,et al.
GLADE: big data analytics made easy
,
2012,
SIGMOD Conference.
[8]
Ion Stoica,et al.
Blink and It's Done: Interactive Queries on Very Large Data
,
2012,
Proc. VLDB Endow..
[9]
Carlo Zaniolo,et al.
Early Accurate Results for Advanced Analytics on MapReduce
,
2012,
Proc. VLDB Endow..
[10]
Florin Rusu,et al.
PF-OLA: a high-performance framework for parallel online aggregation
,
2012,
Distributed and Parallel Databases.