Fast Online Diagnosis and Recovery of Reconfigurable Logic Fabrics Using Design Disjunction

Design disjunction is developed to offer a broad coverage, high resolution, and low overhead approach to online diagnosis and recovery of reconfigurable fabrics. Design disjunction leverages the condensed diagnosability of <inline-formula><tex-math notation="LaTeX">$T$</tex-math><alternatives> <inline-graphic xlink:type="simple" xlink:href="alzahrani-ieq1-2513762.gif"/></alternatives></inline-formula> logic resources to achieve self-recovery using partial reconfiguration in O(log <inline-formula><tex-math notation="LaTeX">$T$ </tex-math><alternatives><inline-graphic xlink:type="simple" xlink:href="alzahrani-ieq2-2513762.gif"/></alternatives></inline-formula>) steps. Reconfiguration is guided by the constructive property of <inline-formula><tex-math notation="LaTeX"> $f$</tex-math><alternatives><inline-graphic xlink:type="simple" xlink:href="alzahrani-ieq3-2513762.gif"/></alternatives> </inline-formula>-disjunctness which forms O(log <inline-formula><tex-math notation="LaTeX">$T$</tex-math> <alternatives><inline-graphic xlink:type="simple" xlink:href="alzahrani-ieq4-2513762.gif"/></alternatives></inline-formula>) resource groups at design-time. Resolution of <inline-formula><tex-math notation="LaTeX">$f$</tex-math> <alternatives><inline-graphic xlink:type="simple" xlink:href="alzahrani-ieq5-2513762.gif"/></alternatives></inline-formula> simultaneous resource faults is shown to be guaranteed when the resource groups are mutually <inline-formula> <tex-math notation="LaTeX">$f$</tex-math><alternatives> <inline-graphic xlink:type="simple" xlink:href="alzahrani-ieq6-2513762.gif"/></alternatives></inline-formula>-disjunct. This extends run-time fault resilience to a large resource space with certainty for up to <inline-formula> <tex-math notation="LaTeX">$f$</tex-math><alternatives> <inline-graphic xlink:type="simple" xlink:href="alzahrani-ieq7-2513762.gif"/></alternatives></inline-formula> faults using a decision-free resolution process that also provides a high likelihood of identifying the fault’s location to a fine granularity. Finally, design disjunction is parameterized to accommodate the low coverage issue of functional testing for which inarticulate tests can otherwise impair fault isolation. Experimental results for MCNC and ISCAS benchmarks on a Xilinx 7-series field programmable gate array (FPGA) demonstrate <inline-formula><tex-math notation="LaTeX"> $f$</tex-math><alternatives><inline-graphic xlink:type="simple" xlink:href="alzahrani-ieq8-2513762.gif"/></alternatives> </inline-formula>-diagnosability at the individual slice level with a minimum average isolation accuracy of <inline-formula><tex-math notation="LaTeX">$96.4$</tex-math><alternatives> <inline-graphic xlink:type="simple" xlink:href="alzahrani-ieq9-2513762.gif"/></alternatives></inline-formula> percent (<inline-formula> <tex-math notation="LaTeX">$94.4$</tex-math><alternatives> <inline-graphic xlink:type="simple" xlink:href="alzahrani-ieq10-2513762.gif"/></alternatives></inline-formula> percent) for <inline-formula> <tex-math notation="LaTeX">$f=1$</tex-math><alternatives> <inline-graphic xlink:type="simple" xlink:href="alzahrani-ieq11-2513762.gif"/></alternatives></inline-formula> (<inline-formula> <tex-math notation="LaTeX">$f=2$</tex-math><alternatives> <inline-graphic xlink:type="simple" xlink:href="alzahrani-ieq12-2513762.gif"/></alternatives></inline-formula>). Results have also demonstrated millisecond order recovery with a minimum increase of <inline-formula><tex-math notation="LaTeX"> $83.6$</tex-math><alternatives><inline-graphic xlink:type="simple" xlink:href="alzahrani-ieq13-2513762.gif"/></alternatives> </inline-formula> percent in fault coverage compared to <inline-formula><tex-math notation="LaTeX">$N$ </tex-math><alternatives><inline-graphic xlink:type="simple" xlink:href="alzahrani-ieq14-2513762.gif"/></alternatives></inline-formula> -modular redundancy (NMR) schemes. Recovery is achieved while incurring an average critical path delay impact of only <inline-formula><tex-math notation="LaTeX">$1.49$</tex-math><alternatives> <inline-graphic xlink:type="simple" xlink:href="alzahrani-ieq15-2513762.gif"/></alternatives></inline-formula> percent and energy cost roughly comparable to conventional two-MR approaches.

[1]  Ad J. van de Goor,et al.  Using March Tests to Test SRAMs , 1993, IEEE Des. Test Comput..

[2]  R. Dorfman The Detection of Defective Members of Large Populations , 1943 .

[3]  Jon Perez,et al.  R3TOS: A Novel Reliable Reconfigurable Real-Time Operating System for Highly Adaptive, Efficient, and Dependable Computing on FPGAs , 2013, IEEE Transactions on Computers.

[4]  Cristian Constantinescu,et al.  Trends and Challenges in VLSI Circuit Reliability , 2003, IEEE Micro.

[5]  Michael J. Wirthlin,et al.  FPGA partial reconfiguration via configuration scrubbing , 2009, 2009 International Conference on Field Programmable Logic and Applications.

[6]  Sidharth Jaggi,et al.  Non-Adaptive Group Testing: Explicit Bounds and Novel Algorithms , 2014, IEEE Trans. Inf. Theory.

[7]  Chiara Sandionigi,et al.  Autonomous Fault-Tolerant Systems onto SRAM-based FPGA Platforms , 2013, J. Electron. Test..

[8]  K.S. Morgan,et al.  SRAM FPGA Reliability Analysis for Harsh Radiation Environments , 2009, IEEE Transactions on Nuclear Science.

[9]  Hideo Ito,et al.  Detecting, diagnosing, and tolerating faults in SRAM-based field programmable gate arrays: a survey , 2003 .

[10]  Amin Karbasi,et al.  Group Testing With Probabilistic Tests: Theory, Design and Application , 2010, IEEE Transactions on Information Theory.

[11]  Hermann Kopetz,et al.  Real-time systems , 2018, CSC '73.

[12]  Carl Carmichael,et al.  Triple Module Redundancy Design Techniques for Virtex FPGAs, Application Note 197 , 2001 .

[13]  Arash Reyhani-Masoleh,et al.  Concurrent Structure-Independent Fault Detection Schemes for the Advanced Encryption Standard , 2010, IEEE Transactions on Computers.

[14]  Evan Marcus,et al.  Blueprints for high availability , 2000 .

[15]  Peter Y. K. Cheung,et al.  Fault tolerance and reliability in field-programmable gate arrays , 2010, IET Computers & Digital Techniques.

[16]  Fabrizio Lombardi,et al.  A Novel Heuristic Method for Application-Dependent Testing of a SRAM-Based FPGA Interconnect , 2013, IEEE Transactions on Computers.

[17]  Yervant Zorian,et al.  IS-FPGA : a new symmetric FPGA architecture with implicit scan , 2001, Proceedings International Test Conference 2001 (Cat. No.01CH37260).

[18]  Emanuel Knill,et al.  Non-adaptive Group Testing in the Presence of Errors , 1998, Discret. Appl. Math..

[19]  Ronald F. DeMara,et al.  Hypergraph-Cover Diversity for Maximally-Resilient Reconfigurable Systems , 2015, 2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conference on Embedded Software and Systems.

[20]  Chen Wei Tseng,et al.  Correcting Single-Event Upsets in Virtex-II Platform FPGA Configuration Memory , 2007 .

[21]  Nur A. Touba,et al.  A rapid and scalable diagnosis scheme for BIST environments with a large number of scan chains , 2000, Proceedings 18th IEEE VLSI Test Symposium.

[22]  Jörg Henkel,et al.  Test Strategies for Reliable Runtime Reconfigurable Architectures , 2013, IEEE Transactions on Computers.

[23]  Wenwei Zha,et al.  Facilitating FPGA Reconfiguration through Low-level Manipulation , 2014 .

[24]  Marco Platzner,et al.  Design and architectures for dependable embedded systems , 2011, 2011 Proceedings of the Ninth IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS).

[25]  Andrew B. Kahng,et al.  New and improved BIST diagnosis methods from combinatorial Group testing theory , 2006, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[26]  Andre Seffrin,et al.  Cellular-Array Implementations of Bio-inspired Self-healing Systems: State of the Art and Future Perspectives , 2010 .

[27]  Chiara Sandionigi,et al.  A Novel Design Methodology for Implementing Reliability-Aware Systems on SRAM-Based FPGAs , 2011, IEEE Transactions on Computers.

[28]  Tom Feist,et al.  Vivado Design Suite , 2012 .

[29]  Mahdi Cheraghchi Coding-theoretic methods for sparse recovery , 2011, 2011 49th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[30]  Edward J. McCluskey,et al.  Which concurrent error detection scheme to choose ? , 2000, Proceedings International Test Conference 2000 (IEEE Cat. No.00CH37159).

[31]  Vladimir Hahanov,et al.  Algebra-logical repair method for FPGA logic blocks , 2010, 2010 East-West Design & Test Symposium (EWDTS).

[32]  Matthew Parris,et al.  Progress in autonomous fault recovery of field programmable gate arrays , 2011, CSUR.

[33]  Adrian Stoica,et al.  Fault-tolerant evolvable hardware using field-programmable transistor arrays , 2000, IEEE Trans. Reliab..

[34]  Edward J. McCluskey,et al.  Reconfigurable architecture for autonomous self-repair , 2004, IEEE Design & Test of Computers.

[35]  Charles E. Stroud,et al.  Using roving STARs for on-line testing and diagnosis of FPGAs in fault-tolerant applications , 1999, International Test Conference 1999. Proceedings (IEEE Cat. No.99CH37034).

[36]  Jörg Henkel,et al.  Module diversification: Fault tolerance and aging mitigation for runtime reconfigurable architectures , 2013, 2013 IEEE International Test Conference (ITC).

[37]  Ronald F. DeMara,et al.  Process variation immunity of alternative 16nm HK/MG-based FPGA logic blocks , 2015, 2015 IEEE 58th International Midwest Symposium on Circuits and Systems (MWSCAS).

[38]  Carthik A. Sharma,et al.  Consensus-Based Evaluation for Fault Isolation and On-line Evolutionary Regeneration , 2005, ICES.

[39]  Cristiana Bolchini,et al.  Design Space Exploration for the Design of Reliable SRAM-Based FPGA Systems , 2008, 2008 IEEE International Symposium on Defect and Fault Tolerance of VLSI Systems.

[40]  Anthony J. Macula,et al.  A simple construction of d-disjunct matrices with certain constant weights , 1996, Discret. Math..

[41]  Mehdi Baradaran Tahoori High Resolution Application Specific Fault Diagnosis of FPGAs , 2011, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[42]  Anil Kumar,et al.  Design Space Exploration for High Availability drFPGA Based Embedded Systems , 2012, AMLTA.

[43]  Carthik A. Sharma,et al.  Self-healing reconfigurable logic using autonomous group testing , 2013, Microprocess. Microsystems.