High performance robust computer systems

Although our society increasingly relies on computing systems for smooth, efficient operation; computer “errors” that interrupt our lives are commonplace. Better error and exception handling seems to be correlated with more reliable software systems [shelton00] [koopman99]. Unfortunately, robust handling of exceptional conditions is a rarity in modern software systems, and there are no signs that the situation is improving. This dissertation examines the central issues surrounding the reasons why software systems are, in general, not robust, and presents methods of resolving each issue. Although it is commonly held that building robust code is too impractical, we present methods of addressing common robustness failures in a simple, generic fashion. We develop uncomplicated checking mechanisms that can be used to detect and handle exceptional conditions before they can affect process or system state (preemptive detection). This gives a software system the information it needs to gracefully recover from the exceptional condition without the need for task restarts. The perception that computing systems can be either robust or fast (but not both) is a myth perpetuated by not only a dearth of quantitative data, but also an abundance of conventional wisdom whose truth is rooted in an era before modern superscalar processors. The advanced microarchitectural features of such processors are the key to building and understanding systems that are both fast and robust. This research provides an objective, quantitative analysis of the performance cost associated with making a software system highly robust. It develops methods by which the systems studied can be made robust for less than 5% performance overhead for nearly every case, and often much less. Studies indicate that most programmers have an incomplete understanding of how to build software systems with robust exception handling, or even the importance of good design with respect to handling errors and exceptional conditions [maxion98]. Those studies, while large in scope and thorough in analysis, contain data from students with little professional programming experience. This work presents data collected from professional programming teams that measured their expected exception handling performance against their achieved performance. The data provides an indication that despite industry experience or specifications mandating robustness, some teams could not predict the robustness response of their software, and did not build robust systems.

[1]  Daniel P. Siewiorek,et al.  Development of a benchmark to measure system robustness , 1993, FTCS-23 The Twenty-Third International Symposium on Fault-Tolerant Computing.

[2]  Roy A. Maxion,et al.  Improving software robustness with dependability cases , 1998, Digest of Papers. Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing (Cat. No.98CB36224).

[3]  Todd M. Austin,et al.  Efficient detection of all pointer and array access errors , 1994, PLDI '94.

[4]  Nancy G. Leveson,et al.  An investigation of the Therac-25 accidents , 1993, Computer.

[5]  Trevor N. Mudge,et al.  Instruction fetching: Coping with code bloat , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[6]  Yi-Min Wang,et al.  Xept: a software instrumentation method for exception handling , 1997, Proceedings The Eighth International Symposium on Software Reliability Engineering.

[7]  Cristina V. Lopes,et al.  A study on exception detection and handling using aspect-oriented programming , 2000, Proceedings of the 2000 International Conference on Software Engineering. ICSE 2000 the New Millennium.

[8]  John B. Goodenough,et al.  Exception handling: issues and a proposed notation , 1975, CACM.

[9]  Frank Feather,et al.  Fault-free performance validation of avionic multiprocessors , 1986 .

[10]  Daniel P. Siewiorek,et al.  Measuring Software Dependability by Robustness Benchmarking , 1997, IEEE Trans. Software Eng..

[11]  Hanspeter Mössenböck,et al.  Zero-Overhead Exeption Handling Using Metaprogramming , 1997, SOFSEM.

[12]  Narain H. Gehani,et al.  Exceptional C or C with exceptions , 1992, Softw. Pract. Exp..

[13]  Gurindar S. Sohi,et al.  The use of multithreading for exception handling , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[14]  Philip Koopman,et al.  The Exception Handling Effectiveness of POSIX Operating Systems , 2000, IEEE Trans. Software Eng..

[15]  Ravishankar K. Iyer,et al.  Measuring Fault Tolerance with the FTAPE Fault Injection Tool , 1995, MMB.

[16]  Henry M. Levy,et al.  Hardware and software support for efficient exception handling , 1994, ASPLOS VI.

[17]  Cecília M. F. Rubira,et al.  An exception handling mechanism for developing dependable object-oriented software based on a meta-level approach , 1999, Proceedings 10th International Symposium on Software Reliability Engineering (Cat. No.PR00443).

[18]  John Paul Shen,et al.  Completion time multiple branch prediction for enhancing trace cache performance , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[19]  Alan Eustace,et al.  ATOM - A System for Building Customized Program Analysis Tools , 1994, PLDI.

[20]  Robert Sedgewick,et al.  Algorithms in C , 1990 .

[21]  Cecília M. F. Rubira,et al.  An exception handling software architecture for developing fault-tolerant software , 2000, Proceedings. Fifth IEEE International Symposium on High Assurance Systems Engineering (HASE 2000).

[22]  Brian Marick,et al.  The craft of software testing , 1994 .

[23]  Michael R. Lyu Software Fault Tolerance , 1995 .

[24]  Gregory R. Ganger,et al.  Modeling and performance of MEMS-based storage devices , 2000, SIGMETRICS '00.

[25]  Anup K. Ghosh,et al.  An approach to testing COTS software for robustness to operating system exceptions and errors , 1999, Proceedings 10th International Symposium on Software Reliability Engineering (Cat. No.PR00443).

[26]  Timothy Kong,et al.  Efficient memory access checking , 1993, FTCS-23 The Twenty-Third International Symposium on Fault-Tolerant Computing.

[27]  Calton Pu,et al.  Optimistic incremental specialization: streamlining a commercial operating system , 1995, SOSP.

[28]  Richard M. Fujimoto,et al.  PROCEEDINGS OF THE 1997 WINTER SIMULATION CONFERENCE , 1997 .

[29]  Alexander Romanovsky An exception handling framework for N-version programming in object-oriented systems , 2000, Proceedings Third IEEE International Symposium on Object-Oriented Real-Time Distributed Computing (ISORC 2000) (Cat. No. PR00607).

[30]  Eric Rotenberg,et al.  Trace cache: a low latency approach to high bandwidth instruction fetching , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[31]  Algirdas Avizienis,et al.  The N-Version Approach to Fault-Tolerant Software , 1985, IEEE Transactions on Software Engineering.

[32]  Thomas E. Hull,et al.  Exception handling in scientific computing , 1988, TOMS.

[33]  Timothy Kong,et al.  Concurrent Detection of Software and Hardware Data-Access Faults , 1997, IEEE Trans. Computers.

[34]  Henrique Madeira,et al.  Xception: A Technique for the Experimental Evaluation of Dependability in Modern Computers , 1998, IEEE Trans. Software Eng..

[35]  Daniel P. Siewiorek,et al.  Fault Injection Experiments Using FIAT , 1990, IEEE Trans. Computers.

[36]  Peter A. Buhr,et al.  Advanced Exception Handling Mechanisms , 2000, IEEE Trans. Software Eng..

[37]  Marc J. Balcer,et al.  The category-partition method for specifying and generating fuctional tests , 1988, CACM.

[38]  A. D. Swain,et al.  Handbook of human-reliability analysis with emphasis on nuclear power plant applications. Final report , 1983 .

[39]  Barton P. Miller,et al.  An Empirical Study of the Reliability of Operating System Utilities , 1989 .

[40]  Nina Edelweiss,et al.  Workflow modeling: exception and failure handling representation , 1998, Proceedings SCCC'98. 18th International Conference of the Chilean Society of Computer Science (Cat. No.98EX212).

[41]  Pattie Maes Concepts and experiments in computational reflection , 1987, OOPSLA 1987.