论文信息 - Progressive Recovery of Correlated Failures in Distributed Stream Processing Engines

Progressive Recovery of Correlated Failures in Distributed Stream Processing Engines

Correlated failures in large-scale clusters have significant effects on systems’ availability, especially for streaming data applications that run continuously and require low processing latency. Most stateof-the-art distributed stream processing engines (DSPEs) adopt a blocking recovery paradigm, which, upon correlated failure, would block the progress of recovery until sufficient new resources for recovery are available. As the arrival of new resources is usually progressive, a blocking paradigm fails to minimize the recovery latency. To address this problem, we propose a progressive and query-centric recovery paradigm where the recovery of the failed operators would be carefully scheduled to progressively recover the outputs of queries as early as possible based on the current availability of resources. In this work, we propose and implement a fault-tolerance framework which supports progressive recovery after correlated failures with minimum overhead during the system’s normal execution. We also formulate the new problem of recovery scheduling under correlated failures and design effective algorithms to optimize the recovery latency. The proposed methods are implemented on Apache Storm and preliminary experiments are conducted to verify their validity.

[1] Richard P. Martin,et al. Improving cluster availability using workstation validation , 2002, SIGMETRICS '02.

[2] Michael Stonebraker,et al. High-availability algorithms for distributed stream processing , 2005, 21st International Conference on Data Engineering (ICDE'05).

[3] Michael Stonebraker,et al. Fault-tolerance in the Borealis distributed stream processing system , 2005, SIGMOD '05.

[4] Yongluan Zhou,et al. Dynamic Resource Management In a Massively Parallel Stream Processing Engine , 2015, CIKM.

[5] Andrey Brito,et al. Scalable and elastic realtime click stream analysis using StreamMine3G , 2014, DEBS '14.

[6] Li Su,et al. Tolerating correlated failures in Massively Parallel Stream Processing Engines , 2015, 2016 IEEE 32nd International Conference on Data Engineering (ICDE).

[7] Raul Castro Fernandez,et al. Integrating scale out and fault tolerance in stream processing using operator state management , 2013, SIGMOD '13.

[8] Paolo Bellavista,et al. Adaptive Fault-Tolerance for Dynamic Resource Provisioning in Distributed Stream Processing Systems , 2014, EDBT.

[9] Deeparnab Chakrabarty,et al. Knapsack Problems , 2008 .

[10] Van-Anh Truong,et al. Availability in Globally Distributed Storage Systems , 2010, OSDI.