Tolerating Transient Late-Timing Faults in Cloud-Based Real-Time Stream Processing

Real-time stream processing is a frequently deployed application within Cloud datacenters that is required to provision high levels of performance and reliability. Numerous fault-tolerant approaches have been proposed to effectively achieve this objective in the presence of crash failures. However, such systems struggle with transient late-timing faults - a fault classification challenging to effectively tolerate - that manifests increasingly within large-scale distributed systems. Such faults represent a significant threat towards minimizing soft real-time execution of streaming applications in the presence of failures. This work proposes a fault-tolerant approach for QoS-aware data prediction to tolerate transient late-timing faults. The approach is capable of determining the most effective data prediction algorithm for imposed QoS constraints on a failed stream processor at run-time. We integrated our approach into Apache Storm with experiment results showing its ability to minimize stream processor end-to-end execution time by 61% compared to other fault-tolerant approaches. The approach incurs 12% additional CPU utilization while reducing network usage by 44%.

[1]  Bianca Schroeder,et al.  A Large-Scale Study of Failures in High-Performance Computing Systems , 2010, IEEE Trans. Dependable Secur. Comput..

[2]  Michael Stonebraker,et al.  High-availability algorithms for distributed stream processing , 2005, 21st International Conference on Data Engineering (ICDE'05).

[3]  Hermann Kopetz,et al.  Real-time systems , 2018, CSC '73.

[4]  Joseph M. Hellerstein,et al.  Highly available fault-tolerant , 2004 .

[5]  Antonio Pescapè,et al.  Cloud monitoring: A survey , 2013, Comput. Networks.

[6]  Alessandro Margara,et al.  Complex event processing with T-REX , 2012, J. Syst. Softw..

[7]  Carl E. Landwehr,et al.  Basic concepts and taxonomy of dependable and secure computing , 2004, IEEE Transactions on Dependable and Secure Computing.

[8]  Jignesh M. Patel,et al.  Storm@twitter , 2014, SIGMOD Conference.

[9]  Kai Wang,et al.  Highly accurate data value prediction using hybrid predictors , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[10]  Scott Shenker,et al.  Discretized streams: fault-tolerant streaming computation at scale , 2013, SOSP.

[11]  Frederick Reiss,et al.  TelegraphCQ: continuous dataflow processing , 2003, SIGMOD '03.

[12]  Michael Stonebraker,et al.  Fault-tolerance in the borealis distributed stream processing system , 2008, ACM Trans. Database Syst..

[13]  Alec Wolman,et al.  Outatime: Using Speculation to Enable Low-Latency Continuous Interaction for Mobile Cloud Gaming , 2015, MobiSys.

[14]  Eric A. Brewer,et al.  Highly available, fault-tolerant, parallel dataflows , 2004, SIGMOD '04.

[15]  Alejandro P. Buchmann,et al.  Complex Event Processing , 2009, it Inf. Technol..

[16]  Michael Stonebraker,et al.  The 8 requirements of real-time stream processing , 2005, SGMD.

[17]  Jie Xu,et al.  Timely Long Tail Identification through Agent Based Monitoring and Analytics , 2015, 2015 IEEE 18th International Symposium on Real-Time Distributed Computing.

[18]  Alec Wolman,et al.  Outatime: Using Speculation to Enable Low-Latency Continuous Interaction for Mobile Cloud Gaming , 2015, MobiSys.

[19]  Michael Stonebraker,et al.  A Comparison of Stream-Oriented High-Availability Algorithms , 2003 .

[20]  Luiz André Barroso,et al.  The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines , 2009, The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines.

[21]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[22]  Kurt Rothermel,et al.  Rollback-recovery without checkpoints in distributed event processing systems , 2013, DEBS '13.

[23]  Sean T. Allen,et al.  Storm Applied: Strategies for real-time event processing , 2015 .

[24]  Giovanni Vigna,et al.  A stateful intrusion detection system for World-Wide Web servers , 2003, 19th Annual Computer Security Applications Conference, 2003. Proceedings..

[25]  Andrey Brito,et al.  Minimizing Latency in Fault-Tolerant Distributed Stream Processing Systems , 2009, 2009 29th IEEE International Conference on Distributed Computing Systems.

[26]  Luiz André Barroso,et al.  The tail at scale , 2013, CACM.

[27]  Raul Castro Fernandez,et al.  Integrating scale out and fault tolerance in stream processing using operator state management , 2013, SIGMOD '13.

[28]  Byron Ellis Real-Time Analytics: Techniques to Analyze and Visualize Streaming Data , 2014 .

[29]  Eric Rotenberg,et al.  A study of value speculative execution and misspeculation recovery in superscalar microprocessors , 2000 .