A Cooperative, Self-Configuring High-Availability Solution for Stream Processing

We present a collaborative, self-configuring high availability (HA) approach for stream processing that enables low-latency failure recovery while incurring small run-time overhead. Our approach relies on a novel fine-grained checkpointing model that allows query fragments at each server to be backed up at multiple other servers and recovered collectively (in parallel) when there is a failure. In this paper, we first address the problem of determining the appropriate query fragments at each server. We then discuss, for each fragment, which server to use as its backup as well as the proper checkpoint schedule. We also introduce and analyze operator-specific delta-checkpointing techniques to reduce the overall HA cost. Finally, we quantify the benefits of our approach using results from our prototype implementation and a detailed simulator.

[1]  L. Alvisi,et al.  A Survey of Rollback-Recovery Protocols , 2002 .

[2]  Eric A. Brewer,et al.  Highly available, fault-tolerant, parallel dataflows , 2004, SIGMOD '04.

[3]  C. Mohan,et al.  An efficient and flexible method for archiving a data base , 1993, SIGMOD Conference.

[4]  Michael Stonebraker,et al.  Aurora: a new model and architecture for data stream management , 2003, The VLDB Journal.

[5]  Ying Xing,et al.  The Design of the Borealis Stream Processing Engine , 2005, CIDR.

[6]  Joel F. Bartlett,et al.  A NonStop kernel , 1981, SOSP.

[7]  Michael Stonebraker,et al.  Fault-tolerance in the Borealis distributed stream processing system , 2005, SIGMOD '05.

[8]  Andreas Reuter,et al.  Transaction Processing: Concepts and Techniques , 1992 .

[9]  Ying Xing,et al.  Dynamic load distribution in the Borealis stream processor , 2005, 21st International Conference on Data Engineering (ICDE'05).

[10]  Michael Stonebraker,et al.  High-availability algorithms for distributed stream processing , 2005, 21st International Conference on Data Engineering (ICDE'05).

[11]  Joseph M. Hellerstein,et al.  Highly available fault-tolerant , 2004 .

[12]  Jennifer Widom,et al.  Models and issues in data stream systems , 2002, PODS.

[13]  Ying Xing,et al.  Scalable Distributed Stream Processing , 2003, CIDR.

[14]  Joseph M. Hellerstein,et al.  Flux: an adaptive partitioning operator for continuous query systems , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[15]  Ying Xing,et al.  Providing resiliency to load variations in distributed stream processing , 2006, VLDB.

[16]  Michael Stonebraker,et al.  Monitoring Streams - A New Class of Data Management Applications , 2002, VLDB.

[17]  Michael Stonebraker,et al.  Load Shedding in a Data Stream Manager , 2003, VLDB.

[18]  Frederick Reiss,et al.  TelegraphCQ: continuous dataflow processing , 2003, SIGMOD '03.

[19]  Robert E. Strom,et al.  Optimistic recovery in distributed systems , 1985, TOCS.

[20]  Michael Stonebraker,et al.  Fault-tolerance in the borealis distributed stream processing system , 2008, ACM Trans. Database Syst..

[21]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.