The current trend in commercial processors is producing multi-core architectures which pose both an opportunity and a challenge for future space based processing. The opportunity is how to leverage multi-core processors for high intensity computing applications and thus provide an order of magnitude increase in onboard processing capability with less size, mass, and power. The challenge is to provide the requisite safety and reliability in an extremely challenging radiation environment. The objective is to advance from multiple single processor systems typically flown to a fault tolerant multi-core system. Software based methods for multi-core processor fault tolerance to single event effects (SEEs) causing interrupts or ‘bit-flips’ are investigated and we propose to utilize additional cores and memory resources together with newly developed software protection techniques. This work also assesses the optimal trade space between reliability and performance. Our work is based on the modern compiler “LLVM” as it is ported to many architectures, where we implement optimization passes that enable automatic addition of protection techniques including N-modular redundancy (NMR) and error detection and correction (EDAC) at assembly/instruction level to languages supported. The optimization passes modify the intermediate representation of the source code meaning it could be applied for any high level language, and any processor architecture supported by the LLVM framework. In our initial experiments, we implement separately triple modular redundancy (TMR) and error detection and correction codes including (Hamming, BCH) at instruction level. We combine these two methods for critical applications, where we first TMR our instructions, and then use EDAC as a further measure, when TMR is not able to correct the errors originating from the SEE. Our initial experiments show good performance (about 10% overhead) when protecting the memory of code using double error detection single error correction hamming code and TMR (Triple modular redundancy), further work is needed to improve the performance when protecting the memory of code using the BCH code. This work would be highly valuable, both to satellites/space but also in general computing such as in in aircraft, automotive, server farms, and medical equipment (or anywhere that needs safety critical performance) as hardware gets smaller and more susceptible.
[1]
Amin Ansari,et al.
Shoestring: probabilistic soft error reliability on the cheap
,
2010,
ASPLOS XV.
[2]
Vasileios Porpodas,et al.
DRIFT: Decoupled CompileR-Based Instruction-Level Fault-Tolerance
,
2013,
LCPC.
[3]
Vikram S. Adve,et al.
The LLVM Compiler Framework and Infrastructure Tutorial
,
2004,
LCPC.
[4]
David I. August,et al.
SWIFT: software implemented fault tolerance
,
2005,
International Symposium on Code Generation and Optimization.
[5]
Edward J. McCluskey,et al.
Error detection by duplicated instructions in super-scalar processors
,
2002,
IEEE Trans. Reliab..
[6]
Satish Narayanasamy,et al.
Respec: efficient online multiprocessor replayvia speculation and external determinism
,
2010,
ASPLOS XV.
[7]
Stephen P. Crago,et al.
Software-based fault tolerance for the Maestro many-core processor
,
2011,
2011 Aerospace Conference.
[8]
David R. Butenhof.
Programming with POSIX threads
,
1993
.
[9]
Yun Zhang,et al.
DAFT: decoupled acyclic fault tolerance
,
2010,
PACT '10.
[10]
Jing Yu,et al.
Compiler Optimizations for Fault Tolerance Software Checking
,
2007,
16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007).
[11]
D. K. Arvind,et al.
Languages and Compilers for Parallel Computing
,
2014,
Lecture Notes in Computer Science.