Scalable Distributed Stream Processing

Stream processing fits a large class of new applications for which conventional DBMSs fall short. Because many stream-oriented systems are inherently geographically distributed and because distribution offers scalable load management and higher availability, future stream processing systems will operate in a distributed fashion. They will run across the Internet on computers typically owned by multiple cooperating administrative domains. This paper describes the architectural challenges facing the design of large-scale distributed stream processing systems, and discusses novel approaches for addressing load management, high availability, and federated operation issues. We describe two stream processing systems, Aurora* and Medusa, which are being designed to explore complementary solutions to these challenges. This paper discusses the architectural issues facing the design of large-scale distributed stream processing systems. We begin in Section 2 with a brief description of our centralized stream processing system, Aurora [4]. We then discuss two complementary efforts to extend Aurora to a distributed environment: Aurora* and Medusa. Aurora* assumes an environment in which all nodes fall under a single administrative domain. Medusa provides the infrastructure to support federated operation of nodes across administrative boundaries. After describing the architectures of these two systems in Section 3, we consider three design challenges common to both: infrastructures and protocols supporting communication amongst nodes (Section 4), load sharing in response to variable network conditions (Section 5), and high availability in the presence of failures (Section 6). We also discuss high-level policy specifications employed by the two systems in Section 7. For all of these issues, we believe that the push-based nature of stream-based applications not only raises new challenges but also offers the possibility of new domain-specific solutions.

[1]  Edward D. Lazowska,et al.  Adaptive load sharing in homogeneous distributed systems , 1986, IEEE Transactions on Software Engineering.

[2]  Fred Douglis,et al.  Process Migration in the Sprite Operating System , 1987, ICDCS.

[3]  K. Eric Drexler,et al.  Markets and computation: agoric open systems , 1988 .

[4]  Andreas Reuter,et al.  Transaction Processing: Concepts and Techniques , 1992 .

[5]  David J. DeWitt,et al.  Parallel database systems: the future of high performance database systems , 1992, CACM.

[6]  W. Zhu,et al.  Load Balancing and Workstation Autonomy on Amoeba , 1995 .

[7]  Witold Litwin,et al.  LH*—a scalable, distributed data structure , 1996, TODS.

[8]  Gustavo Alonso,et al.  Providing High Availability in Very Large Worklflow Management Systems , 1996, EDBT.

[9]  Luc Bouganim,et al.  Dynamic Load Balancing in Hierarchical Parallel Database Systems , 1996, VLDB.

[10]  Michael Stonebraker,et al.  Mariposa: a wide-area distributed database system , 1996, The VLDB Journal.

[11]  David R. Karger,et al.  Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the World Wide Web , 1997, STOC '97.

[12]  Srinivasan Seshan,et al.  An integrated congestion management architecture for Internet hosts , 1999, SIGCOMM '99.

[13]  Mehul A. Shah,et al.  Adaptive Query Processing: Technology in Evolution , 2000, IEEE Data Eng. Bull..

[14]  David J. DeWitt,et al.  NiagaraCQ: a scalable continuous query system for Internet databases , 2000, SIGMOD '00.

[15]  Dejan S. Milojicic,et al.  Process migration , 1999, ACM Comput. Surv..

[16]  Srinivasan Seshan,et al.  The Congestion Manager , 2001, RFC.

[17]  David R. Karger,et al.  Chord: A scalable peer-to-peer lookup service for internet applications , 2001, SIGCOMM '01.

[18]  Michael Stonebraker,et al.  Monitoring Streams - A New Class of Data Management Applications , 2002, VLDB.

[19]  Michael Stonebraker,et al.  Aurora: a new model and architecture for data stream management , 2003, The VLDB Journal.