Seer: A Lightweight Online Failure Prediction Approach

In [1], we present a lightweight online failure prediction approach, called Seer, to predict the manifestation of failures at runtime, i.e., while the system is running and before the failures occur, so that preventive and/or protective measures can proactively be taken to improve software reliability. One way Seer differs from the other related approaches is that it collects information from inside program executions, from which the existing approaches generally refrain themselves due to the typically excessive runtime overheads incurred. Seer overcomes this issue by pushing the substantial parts of the data collection task onto the hardware with the help of hardware performance counters (HPCs) – CPU resident counters that record various low level events occurring on a CPU, such as the number of instructions executed and the number of branches taken. At a very high level, Seer operates as follows: functions, called seer functions, that can reliably distinguish failing executions from passing executions are determined, these functions are then instrumented in such a way that after every invocation of a seer function, a binary prediction (i.e., passing or failing) about the future of the execution is made, the instrumented system is deployed and the sequence of predictions made by the seer functions are analyzed at runtime using fixed-length sliding windows to predict the manifestation of failures.We have evaluated Seer by conducting a series of experiments on three software systems in the presence of both single and multiple defects. At the lowest level of runtime overheads, Seer predicted the failures about 54% way through the executions (when the duration of an execution is measured as the number of function calls made in the execution) with an F-measure of 0.77 (computed by giving equal importance to precision and recall) and a runtime overhead of 1.98%, on average. At the highest level of prediction accuracies, Seer predicted the failures about 56% way through the executions with an F-measure of 0.88 and a runtime overhead of 2.67%, on average. Furthermore, Seer performed significantly better than the other online failure prediction approaches used in the empirical studies. One way we have been extending this line of work is by combining the low-level internal execution data collected by HPCs with the high-level external data, which is collected directly from outside executions, such as the number of processes and the CPU, memory, and network utilization, to further improve the quality of predictions. Another avenue we have been extensively investigating is using HPC-collected data in a related domain to detect the presence of ongoing side-channel attacks [2], [3], [4], [5] against software implementations of cryptographic applications at runtime. One type of attack we are currently interested in, is the cache-based attacks where a spy process discovers a secret key processed by a cryptographic application via creating intentional contentions in a cache memory with the victim [6], [7], [8], [9], [10], [11], [12], [13], [14], [15], [16], [17], [18], [19], [20], [21]. One approach that we have had great success with, monitors the contentions in shared resources by using HPCs and issues warnings whenever the extent to which the victim process suffers from these contentions reaches a suspicious level.

[1]  Gary M. Weiss Timeweaver: a genetic algorithm for identifying predictive patterns in sequences of events , 1999 .

[2]  Onur Aciiçmez,et al.  A Vulnerability in RSA Implementations Due to Instruction Cache Analysis and Its Demonstration on OpenSSL , 2008, CT-RSA.

[3]  Bruno Cernuschi-Frías,et al.  A nonparametric nonstationary procedure for failure prediction , 2002, IEEE Trans. Reliab..

[4]  Daniel P. Siewiorek,et al.  Error log analysis: statistical modeling and heuristic trend analysis , 1990 .

[5]  Miroslaw Malek,et al.  Using Hidden Semi-Markov Models for Effective Online Failure Prediction , 2007, 2007 26th IEEE International Symposium on Reliable Distributed Systems (SRDS 2007).

[6]  Daniel J. Bernstein,et al.  Cache-timing attacks on AES , 2005 .

[7]  Cemal Yilmaz Using Hardware Performance Counters for Fault Localization , 2010, 2010 Second International Conference on Advances in System Testing and Validation Lifecycle.

[8]  Onur Aciiçmez,et al.  Trace-Driven Cache Attacks on AES (Short Paper) , 2006, ICICS.

[9]  Jean-Pierre Seifert,et al.  A refined look at Bernstein's AES side-channel analysis , 2006, ASIACCS '06.

[10]  Kishor S. Trivedi,et al.  A methodology for detection and estimation of software aging , 1998, Proceedings Ninth International Symposium on Software Reliability Engineering (Cat. No.98TB100257).

[11]  Alessandro Orso,et al.  Applying classification techniques to remotely-collected program execution data , 2005, ESEC/FSE-13.

[12]  Gary McGraw,et al.  Exploiting Software: How to Break Code , 2004 .

[13]  K. C. Gross,et al.  Proactive detection of software aging mechanisms in performance critical computers , 2002, 27th Annual NASA Goddard/IEEE Software Engineering Workshop, 2002. Proceedings..

[14]  George Candea,et al.  Combining Visualization and Statistical Analysis to Improve Operator Confidence and Efficiency for Failure Detection and Localization , 2005, Second International Conference on Autonomic Computing (ICAC'05).

[15]  Onur Aciiçmez,et al.  Cache Based Remote Timing Attack on the AES , 2007, CT-RSA.

[16]  Naomi Benger,et al.  Recovering OpenSSL ECDSA Nonces Using the FLUSH+RELOAD Cache Side-channel Attack , 2014, IACR Cryptol. ePrint Arch..

[17]  Miroslaw Malek,et al.  A survey of online failure prediction methods , 2010, CSUR.

[18]  Cemal Yilmaz,et al.  An Approach for Isolating the Sources of Information Leakage Exploited in Cache-Based Side-Channel Attacks , 2013, 2013 IEEE Seventh International Conference on Software Security and Reliability Companion.

[19]  Cemal Yilmaz,et al.  An Approach for Classifying Program Failures , 2010, 2010 Second International Conference on Advances in System Testing and Validation Lifecycle.

[20]  Joseph L. Hellerstein,et al.  Predictive algorithms in the management of computer systems , 2002, IBM Syst. J..

[21]  Marco Chiappetta,et al.  Real time detection of cache-based side-channel attacks using hardware performance counters , 2016, Appl. Soft Comput..

[22]  Gwan S. Choi,et al.  Error and failure analysis of a UNIX server , 1998, Proceedings Third IEEE International High-Assurance Systems Engineering Symposium (Cat. No.98EX231).

[23]  Hiroshi Miyauchi,et al.  Cryptanalysis of DES Implemented on Computers with Cache , 2003, CHES.

[24]  James M. Rehg,et al.  Active learning for automatic classification of software behavior , 2004, ISSTA '04.

[25]  Amit M. Paradkar,et al.  Time will tell , 2008, 2008 ACM/IEEE 30th International Conference on Software Engineering.

[26]  Kishor S. Trivedi,et al.  A measurement-based model for estimation of resource exhaustion in operational software systems , 1999, Proceedings 10th International Symposium on Software Reliability Engineering (Cat. No.PR00443).

[27]  Luís Moura Silva,et al.  Deterministic Models of Software Aging and Optimal Rejuvenation Schedules , 2007, 2007 10th IFIP/IEEE International Symposium on Integrated Network Management.

[28]  Sebastian G. Elbaum,et al.  Anomalies as precursors of field failures , 2003, 14th International Symposium on Software Reliability Engineering, 2003. ISSRE 2003..

[29]  Joseph F. Murray,et al.  Improved disk-drive failure warnings , 2002, IEEE Trans. Reliab..

[30]  Adam A. Porter,et al.  Combining hardware and software instrumentation to classify program executions , 2010, FSE '10.

[31]  Onur Aciiçmez,et al.  New Results on Instruction Cache Attacks , 2010, CHES.

[32]  Miroslaw Malek,et al.  Predicting failures of computer systems: a case study for a telecommunication system , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[33]  Bruce Schneier,et al.  Side channel cryptanalysis of product ciphers , 2000 .

[34]  Dan Page,et al.  Theoretical Use of Cache Memory as a Cryptanalytic Side-Channel , 2002, IACR Cryptol. ePrint Arch..

[35]  Joseph Bonneau,et al.  Cache-Collision Timing Attacks Against AES , 2006, CHES.

[36]  Paul C. Kocher,et al.  Timing Attacks on Implementations of Diffie-Hellman, RSA, DSS, and Other Systems , 1996, CRYPTO.

[37]  Eric A. Brewer,et al.  Pinpoint: problem determination in large, dynamic Internet services , 2002, Proceedings International Conference on Dependable Systems and Networks.

[38]  Andreas Zeller,et al.  Why Programs Fail: A Guide to Systematic Debugging , 2005 .

[39]  Yuval Yarom,et al.  FLUSH+RELOAD: A High Resolution, Low Noise, L3 Cache Side-Channel Attack , 2014, USENIX Security Symposium.

[40]  Rui Abreu,et al.  Lightweight Automatic Error Detection by Monitoring Collar Variables , 2012, ICTSS.

[41]  Nils Smeds OpenMP Application Tuning Using Hardware Performance Counters , 2002, WOMPAT.

[42]  Ram Chillarege,et al.  Early warning of failures through alarm analysis a case study in telecom voice mail systems , 2003, 14th International Symposium on Software Reliability Engineering, 2003. ISSRE 2003..

[43]  John T. Stasko,et al.  Visualization of test information to assist fault localization , 2002, ICSE '02.

[44]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[45]  Gregg Rothermel,et al.  Supporting Controlled Experimentation with Testing Techniques: An Infrastructure and its Potential Impact , 2005, Empirical Software Engineering.

[46]  Heikki Mannila,et al.  TASA: Telecommunication Alarm Sequence Analyzer or how to enjoy faults in your network , 1996, Proceedings of NOMS '96 - IEEE Network Operations and Management Symposium.

[47]  Peter Zoeteweij,et al.  On the Performance of Fault Screeners in Software Development and Deployment , 2008, ENASE.

[48]  Cheng-Zhong Xu,et al.  Quantifying Temporal and Spatial Correlation of Failure Events for Proactive Management , 2007, 2007 26th IEEE International Symposium on Reliable Distributed Systems (SRDS 2007).

[49]  Bradley R. Schmerl,et al.  Architecture-Based Run-Time Fault Diagnosis , 2011, ECSA.

[50]  Greg Hamerly,et al.  Bayesian approaches to failure prediction for disk drives , 2001, ICML.

[51]  Ricardo Vilalta,et al.  Predicting rare events in temporal domains , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[52]  Shubhendu S. Mukherjee,et al.  Perturbation-based Fault Screening , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[53]  Joseph F. Murray,et al.  Hard drive failure prediction using non-parametric statistical methods , 2003 .

[54]  Bradley R. Schmerl,et al.  Diagnosing architectural run-time failures , 2013, 2013 8th International Symposium on Software Engineering for Adaptive and Self-Managing Systems (SEAMS).

[55]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[56]  David Leon,et al.  Finding failures by cluster analysis of execution profiles , 2001, Proceedings of the 23rd International Conference on Software Engineering. ICSE 2001.

[57]  Haw Ching Yang,et al.  Application Cluster Service Scheme for Near-Zero-Downtime Services , 2005, Proceedings of the 2005 IEEE International Conference on Robotics and Automation.

[58]  Dorothy M. Andrews,et al.  A Methodology for Analysis of Failure Prediction Data , 1985, RTSS.

[59]  Kenny C. Gross,et al.  Advanced pattern recognition for detection of complex software aging phenomena in online transaction processing servers , 2002, Proceedings International Conference on Dependable Systems and Networks.

[60]  Ying Chen,et al.  A Rough Wavelet Network Model with Genetic Algorithm and its Application to Aging Forecasting of Application Server , 2007, 2007 International Conference on Machine Learning and Cybernetics.

[61]  Gorka Irazoqui Apecechea,et al.  Wait a Minute! A fast, Cross-VM Attack on AES , 2014, RAID.

[62]  Joseph L. Hellerstein,et al.  An approach to predictive detection for service management , 1999, Integrated Network Management VI. Distributed Management for the Networked Millennium. Proceedings of the Sixth IFIP/IEEE International Symposium on Integrated Network Management. (Cat. No.99EX302).

[63]  Michael I. Jordan,et al.  Bug isolation via remote program sampling , 2003, PLDI.

[64]  Vittorio Zaccaria,et al.  AES power attack based on induced cache miss and countermeasure , 2005, International Conference on Information Technology: Coding and Computing (ITCC'05) - Volume II.

[65]  Paul C. Kocher,et al.  Differential Power Analysis , 1999, CRYPTO.

[66]  Andreas Zeller,et al.  Lightweight Defect Localization for Java , 2005, ECOOP.

[67]  Miroslaw Malek,et al.  Prediction-Based Software Availability Enhancement , 2005, Self-star Properties in Complex Information Systems.

[68]  Rogério de Lemos,et al.  Architecture-based resilience evaluation for self-adaptive systems , 2013, Computing.

[69]  Kishor S. Trivedi,et al.  An approach for estimation of software aging in a Web server , 2002, Proceedings International Symposium on Empirical Software Engineering.

[70]  Bojan Cukic,et al.  Software aging and multifractality of memory resources , 2003, 2003 International Conference on Dependable Systems and Networks, 2003. Proceedings..

[71]  Bin Wang,et al.  Automated support for classifying software failure reports , 2003, 25th International Conference on Software Engineering, 2003. Proceedings..

[72]  Bradley R. Schmerl,et al.  Rainbow: Architecture-Based Self-Adaptation with Reusable Infrastructure , 2004, Computer.

[73]  Onur Aciiçmez,et al.  Yet another MicroArchitectural Attack:: exploiting I-Cache , 2007, CSAW '07.

[74]  David Garlan,et al.  Rainbow: architecture-based self-adaptation with reusable infrastructure , 2004 .

[75]  Adi Shamir,et al.  Efficient Cache Attacks on AES, and Countermeasures , 2010, Journal of Cryptology.

[76]  Carla E. Brodley,et al.  Predictive application-performance modeling in a computational grid environment , 1999, Proceedings. The Eighth International Symposium on High Performance Distributed Computing (Cat. No.99TH8469).

[77]  Anand Sivasubramaniam,et al.  BlueGene/L Failure Analysis and Prediction Models , 2006, International Conference on Dependable Systems and Networks (DSN'06).

[78]  Cemal Yilmaz,et al.  Seer: A Lightweight Online Failure Prediction Approach , 2017, IEEE Transactions on Software Engineering.

[79]  Peter Zoeteweij,et al.  Automatic software fault localization using generic program invariants , 2008, SAC '08.

[80]  David Leon,et al.  Pursuing failure: the distribution of program failures in a profile space , 2001, ESEC/FSE-9.