Improving Multi-Core System Dependability with Asymmetrically Reliable Cores

An emerging problem facing future high performance multi-core processors is transient faults caused by radiation, noise and other factors. These faults will likely make future multi-core processors less reliable as chip features shrink and the number of cores increase. To address this problem, we propose a new and practical systems approach of managing and allocating reliability according to software process requirements. The asymmetric multi-core architecture is based on cores with differing reliabilities. Critical and non-critical software components are identified and matched with the higher reliability cores. We show that by using asymmetrically reliable cores the overall system failure rate can be reduced by several times when critical processes can be isolated and executed by higher reliability cores, while offering the same or better overall performance, power utilization and chip area as symmetric cores.

[1]  Norman P. Jouppi,et al.  Core architecture optimization for heterogeneous chip multiprocessors , 2006, 2006 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[2]  Koushik Chakraborty,et al.  Adapting to Intermittent Faults in Future Multicore Systems , 2007, 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007).

[3]  Huiyang Zhou,et al.  A case for fault tolerance and performance enhancement using chip multi-processors , 2006, IEEE Computer Architecture Letters.

[4]  C. Evans-Pughe Live fast, die young [nanometer-scale IC life expectancy] , 2004 .

[5]  Lei Zhang,et al.  Fault tolerance mechanism in chip many-core processors , 2007 .

[6]  Radu Marculescu Networks-on-chip: the quest for on-chip fault-tolerant communication , 2003, IEEE Computer Society Annual Symposium on VLSI, 2003. Proceedings..

[7]  Nur A. Touba,et al.  Partial error masking to reduce soft error failure rate in logic circuits , 2003, Proceedings 18th IEEE Symposium on Defect and Fault Tolerance in VLSI Systems.

[8]  Frank T.-C. Tsai,et al.  Ensemble Subsurface Modeling Using Grid Computing Technology , 2007 .

[9]  M. Schunter,et al.  Architecting Dependable Systems Using Virtualization , 2007 .

[10]  James E. Smith,et al.  Configurable isolation: building high availability systems with commodity multi-core processors , 2007, ISCA '07.

[11]  Paolo Bernardi,et al.  A Hybrid Approach to Fault Detection and Correction in SoCs , 2007, 13th IEEE International On-Line Testing Symposium (IOLTS 2007).

[12]  Jinuk Luke Shin,et al.  The UltraSPARC T1 Processor: CMT Reliability , 2006, IEEE Custom Integrated Circuits Conference 2006.

[13]  Shekhar Y. Borkar,et al.  Designing reliable systems from unreliable components: the challenges of transistor variability and degradation , 2005, IEEE Micro.

[14]  Chita R. Das,et al.  Exploring Fault-Tolerant Network-on-Chip Architectures , 2006, International Conference on Dependable Systems and Networks (DSN'06).

[15]  Norman P. Jouppi,et al.  Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction , 2003, MICRO.

[16]  Wenbin Yao,et al.  Fault-Tolerance CMP Architecture based on SMT Technology , 2007, Second International Multi-Symposiums on Computer and Computational Sciences (IMSCCS 2007).