ACEDR: Automatic Compiler Error Detection and Recovery for COTS CPU and Caches

Recently there has been an increasing demand for more powerful processors for the next-generation space missions, such as communication and earth observation. The challenge is how to improve the reliability of the processor under the “single event effects” in orbit. We have previously proposed a new way of implementing any traditional software error detection and correction techniques at instruction level, capable of covering both the CPU and caches of “commercial off the shelf” processors. In this paper, a novel way of evaluation of the software protection is presented, based on a theoretical model and software injection experiments to predict the reliability of the whole processing architecture. The fault injection will evaluate the ability of the protection code to detect and recover errors in addition to the accuracy of the reliability models, by comparing the reliability of the theoretical predictions to the reliability of the injection experiments. Automatic compiler error detection and recovery improves the reliability of the system by reducing the error rate of “single event upsets.” In some benchmarks, the error rate was reduced to less than 1%. This research has been tested in two machines; Intel core i5-3470 and a Raspberry Pi 3. On the first processor, the overhead was less than 15%, and on the second one, the overhead was less than 17%. This research can also be ported to multiple high level languages, with the ability to cover multiple instructions and datatypes.

[1]  Carl E. Landwehr,et al.  Basic concepts and taxonomy of dependable and secure computing , 2004, IEEE Transactions on Dependable and Secure Computing.

[2]  Adam Piotrowski Automatic installation of software-based fault tolerance algorithms in programs generated by GCC compiler , 2010, Proceedings of the 17th International Conference Mixed Design of Integrated Circuits and Systems - MIXDES 2010.

[3]  C. P. Bridges,et al.  Modelling processor reliability using LLVM compiler fault injection , 2018, 2018 IEEE Aerospace Conference.

[4]  Andras Vajda Multi-core and Many-core Processor Architectures , 2011 .

[5]  Jing Yu,et al.  Compiler Optimizations for Fault Tolerance Software Checking , 2007, 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007).

[6]  Algirdas Avizienis,et al.  The N-Version Approach to Fault-Tolerant Software , 1985, IEEE Transactions on Software Engineering.

[7]  R. Baumann Soft errors in advanced semiconductor devices-part I: the three radiation sources , 2001 .

[8]  Yun Zhang,et al.  DAFT: decoupled acyclic fault tolerance , 2010, PACT '10.

[9]  Arnaldo Carvalho de Melo,et al.  The New Linux ’ perf ’ Tools , 2010 .

[10]  Robert W. Horst,et al.  Multiple instruction issue in the NonStop cyclone processor , 1990, ISCA '90.

[11]  Edward J. McCluskey,et al.  Error detection by duplicated instructions in super-scalar processors , 2002, IEEE Trans. Reliab..

[12]  Jason E. Fritts,et al.  MediaBench II video: expediting the next generation of video systems research , 2005, IS&T/SPIE Electronic Imaging.

[13]  David I. August,et al.  SWIFT: software implemented fault tolerance , 2005, International Symposium on Code Generation and Optimization.

[14]  Ramon-Chips SURVEY OF PROCESSORS FOR SPACE , 2013 .

[15]  Vasileios Porpodas,et al.  CASTED: Core-Adaptive Software Transient Error Detection for Tightly Coupled Cores , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[16]  Stephen P. Crago,et al.  Software-based fault tolerance for the Maestro many-core processor , 2011, 2011 Aerospace Conference.

[17]  C. P. Bridges,et al.  Compiler extensions towards reliable multicore processors , 2017, 2017 IEEE Aerospace Conference.

[18]  M. Tremblay,et al.  Support for fault tolerance in VLSI processors , 1989, IEEE International Symposium on Circuits and Systems,.

[19]  Lorenzo Alvisi,et al.  Modeling the effect of technology trends on the soft error rate of combinational logic , 2002, Proceedings International Conference on Dependable Systems and Networks.

[20]  Jeong-A Lee,et al.  A Self-Checking TMR Voter for Increased Reliability Consensus Voting in FPGAs , 2018, IEEE Transactions on Nuclear Science.

[21]  Timothy J. Slegel,et al.  IBM's S/390 G5 microprocessor design , 1999, IEEE Micro.

[22]  John L. Henning SPEC CPU2000: Measuring CPU Performance in the New Millennium , 2000, Computer.

[23]  James E. Smith,et al.  Configurable isolation: building high availability systems with commodity multi-core processors , 2007, ISCA '07.

[24]  Cheng Wang,et al.  Compiler-Managed Software-based Redundant Multi-Threading for Transient Fault Detection , 2007, International Symposium on Code Generation and Optimization (CGO'07).

[25]  Richard M. Stallman,et al.  Using and Porting the GNU Compiler Collection , 2000 .

[26]  James L. Walsh,et al.  Field testing for cosmic ray soft errors in semiconductor memories , 1996, IBM J. Res. Dev..

[27]  David I. August,et al.  Automatic Instruction-Level Software-Only Recovery , 2006, IEEE Micro.

[28]  Amin Ansari,et al.  Shoestring: probabilistic soft error reliability on the cheap , 2010, ASPLOS XV.

[29]  Vasileios Porpodas,et al.  DRIFT: Decoupled CompileR-Based Instruction-Level Fault-Tolerance , 2013, LCPC.