On-line fault classification and handling in IEEE1687 based fault management system for complex SoCs

Semiconductor products manufactured with latest and emerging processes are increasingly prone to wear out and aging. While the fault occurrence rate in such systems increases, the fault tolerance techniques are becoming even more expensive and one cannot rely on them alone. In addition to mitigating/correcting the faults, the system may systematically monitor, detect, localize, diagnose and classify them (manage faults). As a result of such fault management approach, the system may continue operating and degrade gracefully even in case if some of the system's resources become unusable due to intolerable faults. This works proposes a fault classification and handling methodology that fits to an event-driven on-line fault monitoring, signaling and management architecture based on IEEE1687 IJTAG and suitable for a modern complex SoC with many heterogeneous cores.

[1]  Walter Stechele,et al.  A low-overhead monitoring ring interconnect for MPSoC parameter optimization , 2012, 2012 IEEE 15th International Symposium on Design and Diagnostics of Electronic Circuits & Systems (DDECS).

[2]  H. Kopetz,et al.  The Evolution of Fault-Tolerant Computing , 1987, Dependable Computing and Fault-Tolerant Systems.

[3]  Farrokh Ghani Zadegan,et al.  Fault injection and fault handling: An MPSoC demonstrator using IEEE P1687 , 2014, 2014 IEEE 20th International On-Line Testing Symposium (IOLTS).

[4]  Farrokh Ghani Zadegan,et al.  Design automation for IEEE P1687 , 2011, 2011 Design, Automation & Test in Europe.

[5]  Chiranjib Bhattacharyya,et al.  Discovering Rules from Disk Events for Predicting Hard Drive Failures , 2009, 2009 International Conference on Machine Learning and Applications.

[6]  Andrea Bondavalli,et al.  Threshold-Based Mechanisms to Discriminate Transient from Intermittent Faults , 2000, IEEE Trans. Computers.

[7]  G. Poncelin,et al.  Development of a design-for-reliability method for complex systems , 2008, 2008 Annual Reliability and Maintainability Symposium.

[8]  Sergei Devadze,et al.  Invited paper: System-wide fault management based on IEEE P1687 IJTAG , 2011, 6th International Workshop on Reconfigurable Communication-Centric Systems-on-Chip (ReCoSoC).

[9]  Russell Tessier,et al.  A monitor interconnect and support subsystem for multicore processors , 2009, 2009 Design, Automation & Test in Europe Conference & Exhibition.

[10]  Baker Mohammad,et al.  Dynamic cache resizing architecture for high yield SOC , 2009, 2009 IEEE International Conference on IC Design and Technology.

[11]  Artur Jutman,et al.  Asynchronous Fault Detection in IEEE P1687 Instrument Network , 2014, 2014 IEEE 23rd North Atlantic Test Workshop.

[12]  Sergei Devadze,et al.  Effective Scalable IEEE 1687 Instrumentation Network for Fault Management , 2013, IEEE Design & Test.