Leveraging run time knowledge about event rates to improve memory utilization in wide area data stream filtering

The dQUOB system conceptualization of data streams as database and its SQL interface to data streams is an intuitive way for users to think about their data needs in a large scale application containing hundreds if not thousands of data streams. Experience with dQUOB has shown the need for more aggressive memory management to achieve the scalability we desire. This paper addresses the problem with a two-fold solution. The first one is replacement of the existing first-come first-served scheduling algorithm with an earliest job first algorithm which we demonstrate to yield better average service time. The second one is an introspection algorithm that sets and adapts the sizes of join windows in response to the knowledge acquired at runtime about event rates. In addition to the potential for significant improvements in memory utilization, the algorithm presented here also provides a means by which the user can reason about join window sizes. Wide area measurements demonstrate the adaptive capability required by the introspection technique.

[1]  Joel H. Saltz,et al.  Object-Relational Queries into Multidimensional Databases with the Active Data Repository , 1999, Parallel Process. Lett..

[2]  Ian T. Foster,et al.  The data grid: Towards an architecture for the distributed management and analysis of large scientific datasets , 2000, J. Netw. Comput. Appl..

[3]  Calton Pu,et al.  Continual Queries for Internet Scale Event-Driven Information Delivery , 1999, IEEE Trans. Knowl. Data Eng..

[4]  Jennifer Widom,et al.  Continuous queries over data streams , 2001, SGMD.

[5]  Joseph M. Hellerstein,et al.  Eddies: continuously adaptive query processing , 2000, SIGMOD '00.

[6]  Karsten Schwan,et al.  Application-Dependent Dynamic Monitoring of Distributed and Parallel Systems , 1993, IEEE Trans. Parallel Distributed Syst..

[7]  Karsten Schwan,et al.  dQCOB: managing large data flows using dynamic embedded queries , 2000, Proceedings the Ninth International Symposium on High-Performance Distributed Computing.

[8]  Karsten Schwan,et al.  Using languages for capture, analysis and display of performance information for parallel and distributed applications , 1990, Proceedings. 1990 International Conference on Computer Languages.

[9]  John A. Reed,et al.  Development of an intelligent monitoring and control system for a heterogeneous numerical propulsion system simulation , 1995, Proceedings of Simulation Symposium.

[10]  Karsten Schwan,et al.  Realizing distributed computational laboratories , 1999 .

[11]  Gregor von Laszewski,et al.  Distance Visualization: Data Exploration on the Grid , 1999, Computer.

[12]  Ben Y. Zhao,et al.  OceanStore: an architecture for global-scale persistent storage , 2000, SIGP.

[13]  Jeffrey S. Vetter,et al.  Autopilot: adaptive control of distributed applications , 1998, Proceedings. The Seventh International Symposium on High Performance Distributed Computing (Cat. No.98TB100244).

[14]  Douglas B. Terry,et al.  Continuous queries over append-only databases , 1992, SIGMOD '92.

[15]  Samuel Madden,et al.  Fjording the stream: an architecture for queries over streaming sensor data , 2002, Proceedings 18th International Conference on Data Engineering.

[16]  Karsten Schwan,et al.  Event services for high performance computing , 2000, Proceedings the Ninth International Symposium on High-Performance Distributed Computing.

[17]  Richard T. Snodgrass,et al.  A relational approach to monitoring complex systems , 1988, TOCS.

[18]  Francine Berman,et al.  Applying scheduling and tuning to on-line parallel tomography , 2001, SC '01.