A runtime fault survival method for deployed software during production runs

Runtime memory faults during production run should be more thoroughly addressed because they severely affect system availability. This paper proposes a method for mitigating memory faults during production runs of deployed software, thereby ensuring normal system operation until patches to fix the faults are delivered. Furthermore, the method helps enhance debugging efficiency by providing accurate on‐site fault information used by developers to release timely patches. The core of the method is to offer information tagging to identify runtime faults and a fault survival algorithm to provide differentiated fault mitigation according to the runtime state. We implemented ROPHE on a Linux 2.6 platform and conducted an empirical study of representative Linux applications. The results show that the average fault‐handling rate among the applications is 35.75%, whereas the RemOte runtime Protection for High‐risk Error (ROPHE) greatly improves capacity to an average of 91.94%. Specifically, the fault‐handling rates of the applications ranged widely from 7.32% to 62.96%, while ROPHE provided fault‐survival rates in the relatively narrow range of 82.35–97.44%. The experimental results show that the proposed method guarantees the same level of reliability for all applications regardless of their individual fault handling capacity. Copyright © 2016 John Wiley & Sons, Ltd.

[1]  Todd C. Miller,et al.  strlcpy and strlcat - Consistent, Safe, String Copy and Concatenation , 1999, USENIX Annual Technical Conference, FREENIX Track.

[2]  José M. Badía,et al.  Solving the block–Toeplitz least‐squares problem in parallel , 2005, Concurr. Pract. Exp..

[3]  Qi Gao,et al.  First-aid: surviving and preventing memory management bugs during production runs , 2009, EuroSys '09.

[4]  Srikanth Kandula,et al.  Flashback: A Light-weight Rollback and Deterministic Replay Extension for Software Debugging , 2004 .

[5]  Navjot Singh,et al.  Libsafe: transparent system-wide protection against buffer overflow attacks , 2002, Proceedings International Conference on Dependable Systems and Networks.

[6]  Herbert Reismann,et al.  Elastic Plates: Theory and Application , 1988 .

[7]  Robert O. Hastings,et al.  Fast detection of memory leaks and access errors , 1991 .

[8]  Yuanyuan Zhou,et al.  Triage: diagnosing production run failures at the user's site , 2007, SOSP.

[9]  Nicholas Nethercote,et al.  Valgrind: a framework for heavyweight dynamic binary instrumentation , 2007, PLDI '07.

[10]  Liu Feifei The principle and prevention of windows buffer overflow , 2012, 2012 7th International Conference on Computer Science & Education (ICCSE).

[11]  Nicholas Nethercote,et al.  How to shadow every byte of memory used by a program , 2007, VEE '07.

[12]  Jooyoung Seo,et al.  A profiling method by PCB hooking and its application for memory fault detection in embedded system operational test , 2011, Inf. Softw. Technol..

[13]  Jingbo Yuan,et al.  Identifying buffer overflow vulnerabilities based on binary code , 2011, 2011 IEEE International Conference on Computer Science and Automation Engineering.

[14]  Qin Zhao,et al.  Practical memory checking with Dr. Memory , 2011, International Symposium on Code Generation and Optimization (CGO 2011).

[15]  Aditya P. Mathur,et al.  Interface Mutation: An Approach for Integration Testing , 2001, IEEE Trans. Software Eng..

[16]  George Candea,et al.  Reducing recovery time in a small recursively restartable system , 2002, Proceedings International Conference on Dependable Systems and Networks.

[17]  Srikanth Kandula,et al.  Flashback: A Lightweight Extension for Rollback and Deterministic Replay for Software Debugging , 2004, USENIX Annual Technical Conference, General Track.

[18]  Rajiv Gupta,et al.  Architectural support for shadow memory in multiprocessors , 2009, VEE '09.