Answering ad hoc aggregate queries from data streams using prefix aggregate trees

AbstractIn some business applications such as trading management in financial institutions, it is required to accurately answer ad hoc aggregate queries over data streams. Materializing and incrementally maintaining a full data cube or even its compression or approximation over a data stream is often computationally prohibitive. On the other hand, although previous studies proposed approximate methods for continuous aggregate queries, they cannot provide accurate answers. In this paper, we develop a novel prefix aggregate tree (PAT) structure for online warehousing data streams and answering ad hoc aggregate queries. Often, a data stream can be partitioned into the historical segment, which is stored in a traditional data warehouse, and the transient segment, which can be stored in a PAT to answer ad hoc aggregate queries. The size of a PAT is linear in the size of the transient segment, and only one scan of the data stream is needed to create and incrementally maintain a PAT. Although the query answering using PAT costs more than the case of a fully materialized data cube, the query answering time is still kept linear in the size of the transient segment. Our extensive experimental results on both synthetic and real data sets illustrate the efficiency and the scalability of our design.

[1]  Paul S. Bradley,et al.  Compressed data cubes for OLAP aggregate query approximation on continuous dimensions , 1999, KDD '99.

[2]  Hongjun Lu,et al.  Condensed cube: an effective approach to reducing data cube size , 2002, Proceedings 18th International Conference on Data Engineering.

[3]  Graham Cormode,et al.  What's hot and what's not: tracking most frequent items dynamically , 2003, TODS.

[4]  Kenneth A. Ross,et al.  Fast Computation of Sparse Datacubes , 1997, VLDB.

[5]  Won Suk Lee,et al.  Finding recent frequent itemsets adaptively over online data streams , 2003, KDD '03.

[6]  Philip S. Yu,et al.  A Regression-Based Temporal Pattern Mining Scheme for Data Streams , 2003, VLDB.

[7]  Divesh Srivastava,et al.  On computing correlated aggregates over continual data streams , 2001, SIGMOD '01.

[8]  Stephen G. Warren,et al.  Edited synoptic cloud reports from ships and land stations over the globe , 1996 .

[9]  Alberto O. Mendelzon,et al.  Temporal Queries in OLAP , 2000, VLDB.

[10]  Yannis Sismanis,et al.  Dwarf: shrinking the PetaCube , 2002, SIGMOD '02.

[11]  Richard M. Karp,et al.  A simple algorithm for finding frequent elements in streams and bags , 2003, TODS.

[12]  Mark Sullivan,et al.  Quasi-cubes: exploiting approximations in multidimensional databases , 1997, SGMD.

[13]  Inderpal Singh Mumick,et al.  Maintenance of data cubes and summary tables in a warehouse , 1997, SIGMOD '97.

[14]  Jennifer Widom,et al.  On-line warehouse view maintenance , 1997, SIGMOD '97.

[15]  Nick Roussopoulos,et al.  Cubetree: organization of and bulk incremental updates on the data cube , 1997, SIGMOD '97.

[16]  Piotr Indyk,et al.  Maintaining stream statistics over sliding windows: (extended abstract) , 2002, SODA '02.

[17]  Piotr Indyk,et al.  Maintaining Stream Statistics over Sliding Windows , 2002, SIAM J. Comput..

[18]  Philip S. Yu,et al.  Mining Frequent Patterns in Data Streams at Multiple Time Granularities , 2002 .

[19]  Yixin Chen,et al.  Multi-Dimensional Regression Analysis of Time-Series Data Streams , 2002, VLDB.

[20]  Surajit Chaudhuri,et al.  An overview of data warehousing and OLAP technology , 1997, SGMD.

[21]  Sunita Sarawagi Indexing OLAP Data , 1997, IEEE Data Eng. Bull..

[22]  Rajeev Motwani,et al.  Approximate Frequency Counts over Data Streams , 2012, VLDB.

[23]  Rajeev Rastogi,et al.  Processing complex aggregate queries over data streams , 2002, SIGMOD '02.

[24]  Divesh Srivastava,et al.  Answering Queries with Aggregation Using Views , 1996, VLDB.

[25]  Jian Pei,et al.  Efficient computation of Iceberg cubes with complex measures , 2001, SIGMOD '01.

[26]  Jian Pei,et al.  Mining frequent patterns without candidate generation , 2000, SIGMOD '00.

[27]  Hamid Pirahesh,et al.  Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals , 1996, Data Mining and Knowledge Discovery.

[28]  Jennifer Widom,et al.  Models and issues in data stream systems , 2002, PODS.

[29]  Jennifer Widom,et al.  Research problems in data warehousing , 1995, CIKM '95.

[30]  Raghu Ramakrishnan,et al.  Bottom-up computation of sparse and Iceberg CUBE , 1999, SIGMOD '99.

[31]  Jennifer Widom,et al.  Making views self-maintainable for data warehousing , 1996, Fourth International Conference on Parallel and Distributed Information Systems.

[32]  Werner Nutt,et al.  Rewriting aggregate queries using views , 1999, PODS.

[33]  Laks V. S. Lakshmanan,et al.  QC-trees: an efficient summary structure for semantic OLAP , 2003, SIGMOD '03.

[34]  Jennifer Widom,et al.  Continuous queries over data streams , 2001, SGMD.

[35]  Laks V. S. Lakshmanan,et al.  Quotient Cube: How to Summarize the Semantics of a Data Cube , 2002, VLDB.

[36]  Divesh Srivastava,et al.  Answering Queries Using Views. , 1999, PODS 1995.

[37]  Nick Roussopoulos,et al.  Cubetree: Organization of and Bulk Updates on the Data Cube , 1997, SIGMOD Conference.

[38]  Jeffrey D. Ullman,et al.  Implementing data cubes efficiently , 1996, SIGMOD '96.

[39]  Divesh Srivastava,et al.  Answering Queries Using Views. , 1999, PODS 1995.

[40]  Gurmeet Singh Manku,et al.  Approximate counts and quantiles over sliding windows , 2004, PODS.

[41]  B. Iyer,et al.  Data Cube Approximation and Histograms via Wavelets Extended Abstract , 1998 .

[42]  Divesh Srivastava,et al.  Finding Hierarchical Heavy Hitters in Data Streams , 2003, VLDB.

[43]  Kenneth A. Ross,et al.  Optimizing selections over datacubes , 2000, Proceedings. 12th International Conference on Scientific and Statistica Database Management.

[44]  V. S. Subrahmanian,et al.  Maintaining views incrementally , 1993, SIGMOD Conference.

[45]  Hongjun Lu,et al.  False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams , 2004, VLDB.

[46]  Xintao Wu,et al.  Using Loglinear Models to Compress Datacube , 2000, Web-Age Information Management.

[47]  Jeffrey F. Naughton,et al.  An array-based algorithm for simultaneous multidimensional aggregates , 1997, SIGMOD '97.