Query-Aware Partitioning for Monitoring Massive Network Data Streams

Data stream management systems (DSMS) are gaining acceptance for applications that need to process very large volumes of data in real time. The load generated by such applications frequently exceeds by far the computation capabilities of a single centralized server. In particular, a single-server instance of our DSMS, Gigascope, cannot keep up with the processing demands of the new OC-786 networks, which can generate more than 100 million packets per second. In this paper, we explore a mechanism for the distributed processing of very high speed data streams. Existing distributed DSMSs employ two mechanisms for distributing the load across the participating machines: partitioning of the query execution plans and partitioning of the input data stream in a query-independent fashion. However, for a large class of queries, both approaches fail to reduce the load as compared to centralized system, and can even lead to an increase in the load. In this paper we present an alternative approach - query-aware data stream partitioning that allows for more efficient scaling. We have developed methods for analyzing any given query node to determine a partition strategy, reconcile potentially conflicting requirements that different queries in a query set place on partitioning, and to choose an optimal partitioning which minimizes overall communication costs..

[1]  Jennifer Widom,et al.  Models and issues in data stream systems , 2002, PODS.

[2]  Theodore Johnson,et al.  Query-Aware Sampling for Data Streams , 2007, 2007 IEEE 23rd International Conference on Data Engineering Workshop.

[3]  Chun Zhang,et al.  Automating physical database design in a parallel database , 2002, SIGMOD '02.

[4]  Donald Kossmann,et al.  The state of the art in distributed query processing , 2000, CSUR.

[5]  George Varghese,et al.  On scalable attack detection in the network , 2007, TNET.

[6]  David Maier,et al.  No pane, no gain: efficient evaluation of sliding-window aggregates over data streams , 2005, SGMD.

[7]  Daniel J. Abadi,et al.  An Integration Framework for Sensor Networks and Data Stream Management Systems , 2004, VLDB.

[8]  Frederick Reiss,et al.  TelegraphCQ: Continuous Dataflow Processing for an Uncertain World , 2003, CIDR.

[9]  Michael Stonebraker,et al.  Contract-Based Load Management in Federated Distributed Systems , 2004, NSDI.

[10]  David J. DeWitt,et al.  Parallel database systems: the future of high performance database systems , 1992, CACM.

[11]  Theodore Johnson,et al.  Gigascope: a stream database for network applications , 2003, SIGMOD '03.

[12]  Divesh Srivastava,et al.  Holistic UDAFs at streaming speeds , 2004, SIGMOD '04.

[13]  Per-Åke Larson,et al.  Data reduction by partial preaggregation , 2002, Proceedings 18th International Conference on Data Engineering.

[14]  Ying Xing,et al.  Scalable Distributed Stream Processing , 2003, CIDR.

[15]  Michael Stonebraker,et al.  Aurora: a new model and architecture for data stream management , 2003, The VLDB Journal.

[16]  S. Muthukrishnan,et al.  Data streams: algorithms and applications , 2005, SODA '03.

[17]  Andrew Heybey,et al.  Tribeca: A System for Managing Large Databases of Network Traffic , 1998, USENIX Annual Technical Conference.

[18]  Jennifer Widom,et al.  STREAM: The Stanford Stream Data Manager , 2003, IEEE Data Eng. Bull..

[19]  Tore Risch,et al.  Customizable Parallel Execution of Scientific Stream Queries , 2005, VLDB.

[20]  Ying Xing,et al.  The Design of the Borealis Stream Processing Engine , 2005, CIDR.