Chain: operator scheduling for memory minimization in data stream systems

In many applications involving continuous data streams, data arrival is bursty and data rate fluctuates over time. Systems that seek to give rapid or real-time query responses in such an environment must be prepared to deal gracefully with bursts in data arrival without compromising system performance. We discuss one strategy for processing bursty streams --- adaptive, load-aware scheduling of query operators to minimize resource consumption during times of peak load. We show that the choice of an operator scheduling strategy can have significant impact on the run-time system memory usage. We then present Chain scheduling, an operator scheduling strategy for data stream systems that is near-optimal in minimizing run-time memory usage for any collection of single-stream queries involving selections, projections, and foreign-key joins with stored relations. Chain scheduling also performs well for queries with sliding-window joins over multiple streams, and multiple queries of the above types. A thorough experimental evaluation is provided where we demonstrate the potential benefits of Chain scheduling, compare it with competing scheduling strategies, and validate our analytical conclusions.

[1]  Jennifer Widom,et al.  Models and issues in data stream systems , 2002, PODS.

[2]  Douglas Stott Parker,et al.  The Tangram stream query processing system , 1989, [1989] Proceedings. Fifth International Conference on Data Engineering.

[3]  Sally Floyd,et al.  Wide area traffic: the failure of Poisson modeling , 1995, TNET.

[4]  Douglas B. Terry,et al.  Continuous queries over append-only databases , 1992, SIGMOD '92.

[5]  Samuel Madden,et al.  Continuously adaptive continuous queries over streams , 2002, SIGMOD '02.

[6]  Jon M. Kleinberg,et al.  Bursty and Hierarchical Structure in Streams , 2002, Data Mining and Knowledge Discovery.

[7]  David J. DeWitt,et al.  Efficient mid-query re-optimization of sub-optimal query execution plans , 1998, SIGMOD '98.

[8]  Patrick Valduriez,et al.  Memory-adaptive scheduling for large query execution , 1998, CIKM '98.

[9]  Anne Rogers,et al.  Hancock: a language for extracting signatures from data streams , 2000, KDD '00.

[10]  Joseph M. Hellerstein,et al.  Eddies: continuously adaptive query processing , 2000, SIGMOD '00.

[11]  Jeffrey F. Naughton,et al.  Rate-based query optimization for streaming information sources , 2002, SIGMOD '02.

[12]  David J. DeWitt,et al.  Memory allocation strategies for complex decision support queries , 1998, CIKM '98.

[13]  Walter Willinger,et al.  On the Self-Similar Nature of Ethernet Traffic ( extended version ) , 1995 .

[14]  Michael J. Franklin,et al.  Streaming Queries over Streaming Data , 2002, VLDB.

[15]  Mark Sullivan,et al.  Tribeca: A Stream Database Manager for Network Traffic Analysis , 1996, VLDB.

[16]  Patrick Valduriez,et al.  SVP: A Model Capturing Sets, Lists, Streams, and Parallelism , 1992, Very Large Data Bases Conference.

[17]  Joseph M. Hellerstein,et al.  Using state modules for adaptive query processing , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[18]  Samuel Madden,et al.  Java support for data-intensive systems: experiences building the telegraph dataflow system , 2001, SGMD.

[19]  Michael J. Franklin,et al.  Dynamic Pipeline Scheduling for Improving Interactive Query Performance , 2001, VLDB.

[20]  Walter Willinger,et al.  A Bibliographical Guide to Self-Similar Traffic and Performance Modeling for Modern High-Speed Netwo , 1996 .

[21]  Sally Floyd,et al.  Wide-area traffic: the failure of Poisson modeling , 1994 .

[22]  Walter Willinger,et al.  Long-Range Dependence and Data Network Traffic , 2001 .

[23]  Michael Stonebraker,et al.  Monitoring Streams - A New Class of Data Management Applications , 2002, VLDB.

[24]  Laurent Amsaleg,et al.  Cost-based query scrambling for initial delays , 1998, SIGMOD '98.

[25]  Benoît Dageville,et al.  SQL Memory Management in Oracle9i , 2002, VLDB.

[26]  Walter Willinger,et al.  On the self-similar nature of Ethernet traffic , 1993, SIGCOMM '93.

[27]  Jeffrey F. Naughton,et al.  Evaluating window joins over unbounded streams , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[28]  A. N. Wilschut,et al.  Dataflow query execution in a parallel main-memory environment , 1991, [1991] Proceedings of the First International Conference on Parallel and Distributed Information Systems.

[29]  Laurent Amsaleg,et al.  Dynamic Query Operator Scheduling for Wide-Area Remote Access , 1998, Distributed and Parallel Databases.

[30]  Alon Y. Halevy,et al.  An adaptive query execution system for data integration , 1999, SIGMOD '99.