The Design of the Borealis Stream Processing Engine

Borealis is a second-generation distributed stream processing engine that is being developed at Brandeis University, Brown University, and MIT. Borealis inherits core stream processing functionality from Aurora [14] and distribution functionality from Medusa [51]. Borealis modifies and extends both systems in non-trivial and critical ways to provide advanced capabilities that are commonly required by newly-emerging stream processing applications. In this paper, we outline the basic design and functionality of Borealis. Through sample real-world applications, we motivate the need for dynamically revising query results and modifying query specifications. We then describe how Borealis addresses these challenges through an innovative set of features, including revision records, time travel, and control lines. Finally, we present a highly flexible and scalable QoS-based optimization model that operates across server and sensor networks and a new fault-tolerance model with flexible consistency-availability trade-offs.

[1]  Robert G. Gallager,et al.  A Minimum Delay Routing Algorithm Using Distributed Computation , 1977, IEEE Trans. Commun..

[2]  Peter M G Apers,et al.  Data allocation in distributed database systems , 1988, TODS.

[3]  Jennifer Widom,et al.  A Syntax and Semantics for Set-Oriented Production Rules in Relational Database Systems (Extended Abstract). , 1989, ACM SIGMOD Conference.

[4]  Tad Hogg,et al.  Spawn: A Distributed Computational Economy , 1992, IEEE Trans. Software Eng..

[5]  Mukesh Singhal,et al.  Load distributing for locally distributed systems , 1992, Computer.

[6]  Jeff Magee,et al.  Scalable, adaptive load sharing for distributed systems , 1993, IEEE Parallel & Distributed Technology: Systems & Applications.

[7]  Thierry Coupaye,et al.  Active rules for the software engineering platform GOODSTEP , 1993 .

[8]  Anthony P. Reeves,et al.  Strategies for Dynamic Load Balancing on Highly Parallel Computers , 1993, IEEE Trans. Parallel Distributed Syst..

[9]  Inderpal Singh Mumick,et al.  The Stanford Data Warehousing Project , 1995 .

[10]  Raymond Reiter,et al.  On Specifying Database Updates , 1995, J. Log. Program..

[11]  Dennis Shasha,et al.  The dangers of replication and a solution , 1996, SIGMOD '96.

[12]  Michael Stonebraker,et al.  Mariposa: a wide-area distributed database system , 1996, The VLDB Journal.

[13]  Eric Simon,et al.  The A-RDL System , 1996, Active Database Systems: Triggers and Rules For Advanced Database Processing.

[14]  Shahram Ghandeharizadeh,et al.  Heraclitus: elevating deltas to be first-class citizens in a database programming language , 1996, TODS.

[15]  Minos N. Garofalakis,et al.  Multi-dimensional resource scheduling for parallel queries , 1996, SIGMOD '96.

[16]  Donald F. Ferguson,et al.  Economic models for allocating resources in computer systems , 1996 .

[17]  David R. Karger,et al.  Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the World Wide Web , 1997, STOC '97.

[18]  Sushil Jajodia,et al.  An adaptive data replication algorithm , 1997, TODS.

[19]  Helen J. Wang,et al.  Online aggregation , 1997, SIGMOD '97.

[20]  Peter J. Haas,et al.  Interactive data Analysis: The Control Project , 1999, Computer.

[21]  Norman C. Hutchinson,et al.  Deciding when to forget in the Elephant file system , 1999, SOSP.

[22]  Joseph M. Hellerstein,et al.  Eddies: continuously adaptive query processing , 2000, SIGMOD '00.

[23]  Joseph M. Hellerstein,et al.  Online dynamic reordering , 2000, The VLDB Journal.

[24]  Eric A. Brewer,et al.  Lessons from Giant-Scale Services , 2001, IEEE Internet Comput..

[25]  Michael Stonebraker,et al.  Monitoring Streams - A New Class of Data Management Applications , 2002, VLDB.

[26]  Joseph M. Hellerstein,et al.  Partial results for online query processing , 2002, SIGMOD '02.

[27]  Pavlin Radoslavov,et al.  Topology-informed Internet replica placement , 2002, Comput. Commun..

[28]  Michael Stonebraker,et al.  Aurora: a data stream management system , 2003, SIGMOD '03.

[29]  Michael Stonebraker,et al.  A Comparison of Stream-Oriented High-Availability Algorithms , 2003 .

[30]  R. Motwani,et al.  Query Processing, Approximation, and Resource Management in a Data Stream Management System , 2003, CIDR.

[31]  Rajeev Motwani,et al.  Chain: operator scheduling for memory minimization in data stream systems , 2003, SIGMOD '03.

[32]  Michael Stonebraker,et al.  Load Shedding in a Data Stream Manager , 2003, VLDB.

[33]  Michael Stonebraker,et al.  The Aurora and Medusa Projects , 2003, IEEE Data Eng. Bull..

[34]  Frederick Reiss,et al.  TelegraphCQ: continuous dataflow processing , 2003, SIGMOD '03.

[35]  Wei Hong,et al.  The design of an acquisitional query processor for sensor networks , 2003, SIGMOD '03.

[36]  Jennifer Widom,et al.  CQL: A Language for Continuous Queries over Streams and Relations , 2003, DBPL.

[37]  Michael Stonebraker,et al.  Operator Scheduling in a Data Stream Manager , 2003, VLDB.

[38]  Ying Xing,et al.  Scalable Distributed Stream Processing , 2003, CIDR.

[39]  Vipin Kumar,et al.  Graph partitioning for high-performance scientific simulations , 2003 .

[40]  David Maier,et al.  Exploiting Punctuation Semantics in Continuous Data Streams , 2003, IEEE Trans. Knowl. Data Eng..

[41]  Jennifer Widom,et al.  STREAM: The Stanford Stream Data Manager , 2003, IEEE Data Eng. Bull..

[42]  David Maier,et al.  Applying Punctuation Schemes to Queries Over Continuous Data Streams , 2003, IEEE Data Engineering Bulletin.

[43]  Abhinandan Das,et al.  Approximate join processing over data streams , 2003, SIGMOD '03.

[44]  Frederick Reiss,et al.  TelegraphCQ: Continuous Dataflow Processing for an Uncertain World , 2003, CIDR.

[45]  Michael Stonebraker,et al.  Linear Road: A Stream Data Management Benchmark , 2004, VLDB.

[46]  Michael Stonebraker,et al.  Contract-Based Load Management in Federated Distributed Systems , 2004, NSDI.

[47]  Michael Stonebraker,et al.  Retrospective on Aurora , 2004, The VLDB Journal.

[48]  Rajeev Motwani,et al.  Load shedding for aggregation queries over data streams , 2004, Proceedings. 20th International Conference on Data Engineering.

[49]  Eric A. Brewer,et al.  Highly available, fault-tolerant, parallel dataflows , 2004, SIGMOD '04.

[50]  Jennifer Widom,et al.  Flexible time management in data stream systems , 2004, PODS.

[51]  Michael Stonebraker,et al.  Availability-Consistency Trade-Offs in a Fault-Tolerant Stream Processing System , 2004 .

[52]  Michael Stonebraker,et al.  High-availability algorithms for distributed stream processing , 2005, 21st International Conference on Data Engineering (ICDE'05).

[53]  Ying Xing,et al.  Dynamic load distribution in the Borealis stream processor , 2005, 21st International Conference on Data Engineering (ICDE'05).

[54]  Qiang Chen,et al.  Aurora : a new model and architecture for data stream management ) , 2006 .