Composite subset measures

Measures are numeric summaries of a collection of data records produced by applying aggregation functions. Summarizing a collection of subsets of a large dataset, by computing a measure for each subset in the (typically, user-specified) collection is a fundamental problem. The multidimensional data model, which treats records as points in a space defined by dimension attributes, offers a natural space of data subsets to be considered as summarization candidates, and traditional SQL and OLAP constructs, such as GROUP BY and CUBE, allow us to compute measures for subsets drawn from this space. However, GROUP BY only allows us to summarize a limited collection of subsets, and CUBE summarizes all subsets in this space. Further, they restrict the measure used to summarize a data subset to be a one-step aggregation, using functions such as SUM, of field-values in the data records.In this paper, we introduce composite subset measures, computed by aggregating not only data records but also the measures of other related subsets. We allow summarization of naturally related regions in the multidimensional space, offering more flexibility than either GROUP BY or CUBE in the choice of what data subsets to summarize. Thus, our framework allows more meaningful summaries to be computed for a targeted collection of data subsets.We propose an algebra called AW-RA and an equivalent pictorial language called aggregation workflows. Aggregation workflows allow for intuitive expression of composite measure queries, and the underlying algebra is designed to facilitate efficient multiscan execution. We describe an evaluation framework based on multiple passes of sorting and scanning over the original dataset. In each pass, several measures are evaluated simultaneously, and dependencies between these measures and containment relationships between the underlying subsets of data are orchestrated to reduce the memory footprint of the computation. We present a performance evaluation that demonstrates the benefits of our approach.

[1]  Abhinav Gupta,et al.  Spreadsheets in RDBMS for OLAP , 2003, SIGMOD '03.

[2]  Jeffrey D. Ullman,et al.  Index selection for OLAP , 1997, Proceedings 13th International Conference on Data Engineering.

[3]  GhemawatSanjay,et al.  The Google file system , 2003 .

[4]  Vinod Yegneswaran,et al.  Toward a Query Language for Network Attack Data , 2006, 22nd International Conference on Data Engineering Workshops (ICDEW'06).

[5]  Kenneth A. Ross,et al.  Complex Aggregation at Multiple Granularities , 1998, EDBT.

[6]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[7]  Goetz Graefe,et al.  Query evaluation techniques for large databases , 1993, CSUR.

[8]  Zheng Huang,et al.  Mass spectrum labeling: theory and practice , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[9]  Michael H. Böhlen,et al.  Efficient computation of subqueries in complex OLAP , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[10]  Vinod Yegneswaran,et al.  Characteristics of internet background radiation , 2004, IMC '04.

[11]  Rob Pike,et al.  Interpreting the data: Parallel analysis with Sawzall , 2005, Sci. Program..

[12]  Damianos Chatziantoniou Evaluation of ad hoc OLAP: in-place computation , 1999, Proceedings. Eleventh International Conference on Scientific and Statistical Database Management.

[13]  Kyuseok Shim,et al.  Optimizing Queries with Aggregate Views , 1996, EDBT.

[14]  Éva Tardos,et al.  An approximation algorithm for the generalized assignment problem , 1993, Math. Program..

[15]  PikeRob,et al.  Interpreting the data , 2005 .

[16]  Jeffrey F. Naughton,et al.  On the Computation of Multidimensional Aggregates , 1996, VLDB.

[17]  Theodore Johnson,et al.  The MD-join: an operator for complex OLAP , 2001, Proceedings 17th International Conference on Data Engineering.

[18]  Zhimin Chen,et al.  Efficient computation of multiple group by queries , 2005, SIGMOD '05.

[19]  Hamid Pirahesh,et al.  Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals , 1996, Data Mining and Knowledge Discovery.

[20]  Jeffrey F. Naughton,et al.  Materialized View Selection for Multidimensional Datasets , 1998, VLDB.

[21]  Ashish Gupta,et al.  Aggregate-Query Processing in Data Warehousing Environments , 1995, VLDB.

[22]  Per-Åke Larson,et al.  Eager Aggregation and Lazy Aggregation , 1995, VLDB.

[23]  Theodore Johnson,et al.  Extending complex ad-hoc OLAP , 1999, CIKM '99.

[24]  Kenneth A. Ross,et al.  Querying Multiple Features of Groups in Relational Databases , 1996, VLDB.