Certification of Computational Results

We describe a conceptually novel and powerful technique to achieve fault detection and fault tolerance in hardware and software systems. When used for software fault detection, this new technique uses time and software redundancy and can be outlined as follows. In the initial phase, a program is run to solve a problem and store the result. In addition, this program leaves behind a trail of data which we call a certification trail. In the second phase, another program is run which solves the original problem again. This program however, has access to the certification trail left by the first program. Because of the availability of the certification trail, the second phase can be performed by a less complex program and can execute more quickly. In the final phase, the two results are compared and if they agree the results are accepted as correct; otherwise an error is indicated. An essential aspect of this approach is that the second program must always generate either an error indication or a correct output even when the certification trail it receives from the first program is incorrect. We formalize the certification trail approach to fault tolerance and illustrate realizations of it by considering algorithms for the following problems: convex hull, sorting, and shortest path. We compare the certification trail approach to other approaches to fault tolerance. >

[1]  Edsger W. Dijkstra,et al.  A note on two problems in connexion with graphs , 1959, Numerische Mathematik.

[2]  Leonidas J. Guibas,et al.  A dichromatic framework for balanced trees , 1978, 19th Annual Symposium on Foundations of Computer Science (sfcs 1978).

[3]  Barry W. Johnson Design & analysis of fault tolerant digital systems , 1988 .

[4]  M. AdelsonVelskii,et al.  AN ALGORITHM FOR THE ORGANIZATION OF INFORMATION , 1963 .

[5]  Suku Nair,et al.  General linear codes for fault-tolerant matrix operations on processor arrays , 1988, [1988] The Eighteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[6]  Andrew Chi-Chih Yao,et al.  Coherent Functions and Program Checkers (Extended Abstract) , 1990, STOC 1990.

[7]  Brian Randell System structure for software fault tolerance , 1975 .

[8]  R. Rubinfeld A mathematical theory of self-checking, self-testing and self-correcting programs , 1991 .

[9]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[10]  Gerald M. Masson,et al.  Certification trails for data structures , 1991, [1991] Digest of Papers. Fault-Tolerant Computing: The Twenty-First International Symposium.

[11]  Ronitt Rubinfeld,et al.  Self-testing/correcting for polynomials and for approximate functions , 1991, STOC '91.

[12]  Manuel Blum,et al.  Self-testing/correcting with applications to numerical problems , 1990, STOC '90.

[13]  Liming Chen,et al.  N-VERSION PROGRAMMINC: A FAULT-TOLERANCE APPROACH TO RELlABlLlTY OF SOFTWARE OPERATlON , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing, 1995, ' Highlights from Twenty-Five Years'..

[14]  Jacob A. Abraham,et al.  Fault-Tolerant FFT Networks , 1988, IEEE Trans. Computers.

[15]  Manuel Blum,et al.  Checking the correctness of memories , 1991, [1991] Proceedings 32nd Annual Symposium of Foundations of Computer Science.

[16]  Leonid A. Levin,et al.  Checking computations in polylogarithmic time , 1991, STOC '91.

[17]  Manuel Blum,et al.  Designing programs that check their work , 1989, STOC '89.

[18]  Robert E. Tarjan,et al.  Applications of Path Compression on Balanced Trees , 1979, JACM.

[19]  Michael Ian Shamos,et al.  Computational geometry: an introduction , 1985 .

[20]  Kang G. Shin,et al.  Design and Evaluation of a Fault-Tolerant Multiprocessor Using Hardware Recovery Blocks , 1984, IEEE Transactions on Computers.

[21]  Ronald L. Graham,et al.  An Efficient Algorithm for Determining the Convex Hull of a Finite Planar Set , 1972, Inf. Process. Lett..

[22]  Jacob A. Abraham,et al.  Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.

[23]  Robert S. Swarz,et al.  The theory and practice of reliable system design , 1982 .

[24]  Algirdas Avizienis,et al.  The N-Version Approach to Fault-Tolerant Software , 1985, IEEE Transactions on Software Engineering.

[25]  Andrew Chi-Chih Yao,et al.  Coherent Functions and Program ( extended abstract ) Checkers , .

[26]  Hermann Kopetz,et al.  Fault tolerance, principles and practice , 1990 .