Stormy: an elastic and highly available streaming service in the cloud

In recent years, new highly scalable storage systems have significantly contributed to the success of Cloud Computing. Systems like Dynamo or Bigtable have underpinned their ability to handle tremendous amounts of data and scale to a very large number of nodes. Although these systems are designed the store data, the fundamental architectural properties and the techniques used (e.g., request routing, replication and load balancing) can also be applied to data streaming systems. In this paper, we present Stormy, a distributed stream processing service for continuous data processing. Stormy is based on proven techniques from existing Cloud storage systems that are adapted to efficiently execute streaming workloads. The primary design focus lies in providing a scalable, elastic, and fault-tolerant framework for continuous data processing, while at the same time optimizing resource utilization and increasing cost efficiency. Stormy is able to process any kind of streaming workloads, thus, covering a wide range of use cases ranging from realtime data analytics to long-term data aggregation jobs.

[1]  Courtney Bobko A Storm is Coming , 2010 .

[2]  Patrick Valduriez,et al.  StreamCloud: A Large Scale Data Streaming System , 2010, 2010 IEEE 30th International Conference on Distributed Computing Systems.

[3]  Stanley B. Zdonik,et al.  Staying FIT: Efficient Load Shedding Techniques for Distributed Stream Processing , 2007, VLDB.

[4]  Prashant Malik,et al.  Cassandra: a decentralized structured storage system , 2010, OPSR.

[5]  David R. Karger,et al.  Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the World Wide Web , 1997, STOC '97.

[6]  Ying Xing,et al.  Dynamic load distribution in the Borealis stream processor , 2005, 21st International Conference on Data Engineering (ICDE'05).

[7]  Michael Stonebraker,et al.  Linear Road: A Stream Data Management Benchmark , 2004, VLDB.

[8]  Frederick Reiss,et al.  TelegraphCQ: continuous dataflow processing , 2003, SIGMOD '03.

[9]  Ying Xing,et al.  Scalable Distributed Stream Processing , 2003, CIDR.

[10]  Michael Stonebraker,et al.  Fault-tolerance in the borealis distributed stream processing system , 2008, ACM Trans. Database Syst..

[11]  Brett D. Fleisch,et al.  The Chubby lock service for loosely-coupled distributed systems , 2006, OSDI '06.

[12]  Ying Xing,et al.  The Design of the Borealis Stream Processing Engine , 2005, CIDR.

[13]  Martin Hentschel,et al.  Analytics for the real-time web , 2011, Proc. VLDB Endow..

[14]  David R. Karger,et al.  Chord: A scalable peer-to-peer lookup service for internet applications , 2001, SIGCOMM '01.

[15]  Joseph M. Hellerstein,et al.  Flux: an adaptive partitioning operator for continuous query systems , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[16]  Werner Vogels,et al.  Dynamo: amazon's highly available key-value store , 2007, SOSP.

[17]  Frank Dabek,et al.  Large-scale Incremental Processing Using Distributed Transactions and Notifications , 2010, OSDI.

[18]  Leslie Lamport,et al.  The part-time parliament , 1998, TOCS.

[19]  Tim Kraska,et al.  Cloudy , 2010, Proc. VLDB Endow..

[20]  M. Isard,et al.  Naiad : The Animating Spirit of Rivers and Streams , 2011 .

[21]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[22]  David R. Karger,et al.  Looking up data in P2P systems , 2003, CACM.

[23]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.