Scalable Splitting of Massive Data Streams

Scalable execution of continuous queries over massive data streams often requires splitting input streams into parallel sub-streams over which query operators are executed in parallel. Automatic stream splitting is in general very difficult, as the optimal parallelization may depend on application semantics. To enable application specific stream splitting, we introduce splitstream functions where the user specifies non-procedural stream partitioning and replication. For high-volume streams, the stream splitting itself becomes a performance bottleneck. A cost model is introduced that estimates the performance of splitstream functions with respect to throughput and CPU usage. We implement parallel splitstream functions, and relate experimental results to cost model estimates. Based on the results, a splitstream function called autosplit is proposed, which scales well for high degrees of parallelism, and is robust for varying proportions of stream partitioning and replication. We show how user defined parallelization using autosplit provides substantially improved scalability (L = 64) over previously published results for the Linear Road Benchmark.

[1]  Michael Stonebraker,et al.  Contract-Based Load Management in Federated Distributed Systems , 2004, NSDI.

[2]  Ryan Newton,et al.  XStream: a Signal-Oriented Data Stream Management System , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[3]  Patrick Valduriez,et al.  Principles of Distributed Database Systems, Second Edition , 1999 .

[4]  Zahir Tari,et al.  On the Move to Meaningful Internet Systems 2006: CoopIS, DOA, GADA, and ODBASE, OTM Confederated International Conferences, CoopIS, DOA, GADA, and ODBASE 2006, Montpellier, France, October 29 - November 3, 2006. Proceedings, Part I , 2006, OTM Conferences.

[5]  Ying Xing,et al.  Dynamic load distribution in the Borealis stream processor , 2005, 21st International Conference on Data Engineering (ICDE'05).

[6]  Tore Risch,et al.  Processing High-Volume Stream Queries on a Supercomputer , 2006, 22nd International Conference on Data Engineering Workshops (ICDEW'06).

[7]  Theodore Johnson,et al.  Query-Aware Partitioning for Monitoring Massive Network Data Streams , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[8]  Patrick Valduriez,et al.  Principles of Distributed Database Systems , 1990 .

[9]  Theodore Johnson,et al.  Gigascope: a stream database for network applications , 2003, SIGMOD '03.

[10]  Michael Stonebraker,et al.  Linear Road: A Stream Data Management Benchmark , 2004, VLDB.

[11]  Douglas Stott Parker,et al.  Map-reduce-merge: simplified relational data processing on large clusters , 2007, SIGMOD '07.

[12]  Tore Risch,et al.  Functional Data Integration in a Distributed Mediator System , 2004 .

[13]  Patrick Valduriez,et al.  Principles of Distributed Database Systems, Third Edition , 2011 .

[14]  Alexandra Poulovassilis,et al.  The Functional Approach to Data Management , 2004, Springer Berlin Heidelberg.

[15]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[16]  Shyam Antony,et al.  Thread Cooperation in Multicore Architectures for Frequency Counting over Multiple Data Streams , 2009, Proc. VLDB Endow..

[17]  Karl Aberer,et al.  Toward Massive Query Optimization in Large-Scale Distributed Stream Systems , 2008, Middleware.

[18]  Tore Risch,et al.  Using stream queries to measure communication performance of a parallel computing environment , 2007, 27th International Conference on Distributed Computing Systems Workshops (ICDCSW'07).

[19]  Navendu Jain,et al.  Design, implementation, and evaluation of the linear road bnchmark on the stream processing core , 2006, SIGMOD Conference.

[20]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[21]  Beng Chin Ooi,et al.  Efficient Dynamic Operator Placement in a Locally Distributed Continuous Query System , 2006, OTM Conferences.

[22]  Torben Bach Pedersen,et al.  Highly scalable trip grouping for large-scale collective transportation systems , 2008, EDBT '08.

[23]  Jingren Zhou,et al.  SCOPE: easy and efficient parallel processing of massive data sets , 2008, Proc. VLDB Endow..

[24]  Elke A. Rundensteiner,et al.  Run-time operator state spilling for memory intensive long-running queries , 2006, SIGMOD Conference.

[25]  Patrick Valduriez,et al.  Principles of distributed database systems (2nd ed.) , 1999 .

[26]  Joseph M. Hellerstein,et al.  Flux: an adaptive partitioning operator for continuous query systems , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[27]  Tore Risch,et al.  Customizable Parallel Execution of Scientific Stream Queries , 2005, VLDB.