Remembrance of Streams Past: Overload-Sensitive Management of Archived Streams

This paper studies Data Stream Management Systems that combine real-time data streams with historical data, and hence access incoming streams and archived data simultaneously. A significant problem for these systems is the I/O cost of fetching historical data which inhibits processing of the live data streams. Our solution is to reduce the I/O cost for accessing the archive by retrieving only a reduced (summarized or sampled) version of the historical data. This paper does not propose new summarization or sampling techniques, but rather a framework in which multiple resolutions of summarization/sampling can be generated efficiently. The query engine can select the appropriate level of summarization to use depending on the resources currently available. The central research problem studied is whether to generate the multiple representations of archived data eagerly upon data-arrival, lazily at query-time, or in a hybrid fashion. Concrete techniques for each approach are presented, which are tied to a specific data reduction technique (random sampling). The tradeoffs among the three approaches are studied both analytically and experimentally.

[1]  Michael J. Franklin,et al.  Streaming Queries over Streaming Data , 2002, VLDB.

[2]  Surajit Chaudhuri,et al.  An overview of data warehousing and OLAP technology , 1997, SGMD.

[3]  Abraham Silberschatz,et al.  A Multi-Resolution Relational Data Model , 1992, VLDB.

[4]  Frederick Reiss,et al.  TelegraphCQ: Continuous Dataflow Processing for an Uncertain World , 2003, CIDR.

[5]  Sharad Mehrotra,et al.  Progressive approximate aggregate queries with a multi-resolution tree structure , 2001, SIGMOD '01.

[6]  Samuel Madden,et al.  Continuously adaptive continuous queries over streams , 2002, SIGMOD '02.

[7]  Joseph M. Hellerstein,et al.  Using state modules for adaptive query processing , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[8]  David J. DeWitt,et al.  NiagaraCQ: a scalable continuous query system for Internet databases , 2000, SIGMOD '00.

[9]  Michael Stonebraker,et al.  Load Shedding in a Data Stream Manager , 2003, VLDB.

[10]  Joseph M. Hellerstein,et al.  Informix under CONTROL: Online Query Processing , 2000, Data Mining and Knowledge Discovery.

[11]  Gerhard Weikum,et al.  The LHAM log-structured history data access method , 2000, The VLDB Journal.

[12]  Mark H. Overmars,et al.  The Design of Dynamic Data Structures , 1987, Lecture Notes in Computer Science.

[13]  Michael Stonebraker,et al.  Monitoring Streams - A New Class of Data Management Applications , 2002, VLDB.

[14]  Peter J. Haas,et al.  The New Jersey Data Reduction Report , 1997 .

[15]  Lukasz Golab,et al.  Issues in data stream management , 2003, SGMD.

[16]  David J. DeWitt,et al.  NiagaraCQ: a scalable continuous query system for Internet databases , 2000, SIGMOD 2000.