Design Disjunction for Resilient Reconfigurable Hardware

Contemporary reconfigurable hardware devices have the capability to achieve high performance, power efficiency, and adaptability required to meet a wide range of design goals. With scaling challenges facing current complementary metal oxide semiconductor (CMOS), new concepts and methodologies supporting efficient adaptation to handle reliability issues are becoming increasingly prominent. Reconfigurable hardware and their ability to realize selforganization features are expected to play a key role in designing future dependable hardware architectures. However, the exponential increase in density and complexity of current commercial SRAM-based field-programmable gate arrays (FPGAs) has escalated the overhead associated with dynamic runtime design adaptation. Traditionally, static modular redundancy techniques are considered to surmount this limitation; however, they can incur substantial overheads in both area and power requirements. To achieve a better trade-off among performance, area, power, and reliability, this research presents design-time approaches that enable fine selection of redundancy level based on target reliability goals and autonomous adaptation to runtime demands. To achieve this goal, three studies were conducted: First, a graph and set theoretic approach, named Hypergraph-Cover Diversity (HCD), is introduced as a preemptive design technique to shift the dominant costs of resiliency to design-time. In particular, union-free hypergraphs are exploited to partition the reconfigurable resources pool into highly separable subsets of resources, each of which can be utilized by the same synthesized application netlist. The diverse implementations provide reconfiguration-based resilience throughout the system lifetime while avoiding the significant overheads associated with runtime placement and routing phases. Evaluation on a MotionJPEG image compression core using a Xilinx 7-series-based FPGA hardware platform has demonstrated the potential of the proposed FT method to achieve 37.5% area saving and iii up to 66% reduction in power consumption compared to the frequently-used TMR scheme while providing superior fault tolerance. Second, Design Disjunction based on non-adaptive group testing is developed to realize a low-overhead fault tolerant system capable of handling self-testing and self-recovery using runtime partial reconfiguration. Reconfiguration is guided by resource grouping procedures which employ non-linear measurements given by the constructive property of f -disjunctness to extend runtime resilience to a large fault space and realize a favorable range of tradeoffs. Disjunct designs are created using the mosaic convergence algorithm developed such that at least one configuration in the library evades any occurrence of up to d resource faults, where d is lower-bounded by f . Experimental results for a set of MCNC and ISCAS benchmarks have demonstrated f-diagnosability at the individual slice level with average isolation resolution of 96.4% (94.4%) for f=1 (f=2) while incurring an average critical path delay impact of only 1.49% and area cost roughly comparable to conventional 2-MR approaches. Finally, the proposed Design Disjunction method is evaluated as a design-time method to improve timing yield in the presence of large random within-die (WID) process variations for application with a moderately high production capacity. Results for a set of benchmarks show an average gain in timing yield of up to 39.42%, 36.91%, and 57.45% for total variations of 25%, 15%, and 5%, respectively. The enhanced timing yield is attained while achieving reductions in mean delay of 9.96% 6.85%, and 3.58% for the same variability levels.

[1]  Béla Bollobás,et al.  Modern Graph Theory , 2002, Graduate Texts in Mathematics.

[2]  Naresh R. Shanbhag,et al.  Error-Resilient Motion Estimation Architecture , 2008, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[3]  Anthony J. Macula,et al.  A simple construction of d-disjunct matrices with certain constant weights , 1996, Discret. Math..

[4]  Mehdi Baradaran Tahoori High Resolution Application Specific Fault Diagnosis of FPGAs , 2011, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[5]  Saharon Shelah,et al.  On problems of Moser and Hanson , 1972 .

[6]  K.J. Kuhn,et al.  Reducing Variation in Advanced Logic Technologies: Approaches to Process and Design for Manufacturability of Nanoscale CMOS , 2007, 2007 IEEE International Electron Devices Meeting.

[7]  David Blaauw,et al.  A Statistical Framework for Post-Fabrication Oxide Breakdown Reliability Prediction and Management , 2013, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[8]  Hanpei Koike,et al.  Suppression of Intrinsic Delay Variation in FPGAs using Multiple Configurations , 2008, TRETS.

[9]  Michel Pignol,et al.  COTS-based applications in space avionics , 2010, 2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010).

[10]  Yan Lin,et al.  FPGA device and architecture evaluation considering process variations , 2005, ICCAD-2005. IEEE/ACM International Conference on Computer-Aided Design, 2005..

[11]  Carthik A. Sharma,et al.  Self-healing reconfigurable logic using autonomous group testing , 2013, Microprocess. Microsystems.

[12]  Azad Naeemi,et al.  Cu/Low-$k$ Interconnect Technology Design and Benchmarking for Future Technology Nodes , 2013, IEEE Transactions on Electron Devices.

[13]  Anil Kumar,et al.  Design Space Exploration for High Availability drFPGA Based Embedded Systems , 2012, AMLTA.

[14]  V. Zolotov,et al.  Statistical clock skew analysis considering intradie-process variations , 2004, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[15]  Mihalis Psarakis,et al.  A Fault Tolerant Approach for FPGA Embedded Processors Based on Runtime Partial Reconfiguration , 2013, J. Electron. Test..

[16]  Charles E. Stroud,et al.  Online Fault Tolerance for FPGA Logic Blocks , 2007, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[17]  Peter Y. K. Cheung,et al.  A two-stage variation-aware placement method for FPGAS exploiting variation maps classification , 2012, 22nd International Conference on Field Programmable Logic and Applications (FPL).

[18]  Ricardo Reis,et al.  A Low-Cost Solution for Deploying Processor Cores in Harsh Environments , 2011, IEEE Transactions on Industrial Electronics.

[19]  Ad J. van de Goor,et al.  Using March Tests to Test SRAMs , 1993, IEEE Des. Test Comput..

[20]  Arnold L. Rosenberg A Hypergraph Model for Fault-Tolerant VLSI Processor Arrays , 1985, IEEE Transactions on Computers.

[21]  James Tschanz,et al.  Parameter variations and impact on circuits and microarchitecture , 2003, Proceedings 2003. Design Automation Conference (IEEE Cat. No.03CH37451).

[22]  B. Cline,et al.  Analysis and modeling of CD variation for statistical static timing , 2006, ICCAD '06.

[23]  Peter Y. K. Cheung,et al.  Fault tolerance and reliability in field-programmable gate arrays , 2010, IET Computers & Digital Techniques.

[24]  Jörg Henkel,et al.  Test Strategies for Reliable Runtime Reconfigurable Architectures , 2013, IEEE Transactions on Computers.

[25]  Chiara Sandionigi,et al.  Autonomous Fault-Tolerant Systems onto SRAM-based FPGA Platforms , 2013, J. Electron. Test..

[26]  K.S. Morgan,et al.  SRAM FPGA Reliability Analysis for Harsh Radiation Environments , 2009, IEEE Transactions on Nuclear Science.

[27]  Hideo Ito,et al.  Detecting, diagnosing, and tolerating faults in SRAM-based field programmable gate arrays: a survey , 2003 .

[28]  Andre Seffrin,et al.  Cellular-Array Implementations of Bio-inspired Self-healing Systems: State of the Art and Future Perspectives , 2010 .

[29]  Yervant Zorian,et al.  IS-FPGA : a new symmetric FPGA architecture with implicit scan , 2001, Proceedings International Test Conference 2001 (Cat. No.01CH37260).

[30]  Wenwei Zha,et al.  Facilitating FPGA Reconfiguration through Low-level Manipulation , 2014 .

[31]  Amin Karbasi,et al.  Group Testing With Probabilistic Tests: Theory, Design and Application , 2010, IEEE Transactions on Information Theory.

[32]  David Blaauw,et al.  Statistical timing analysis for intra-die process variations with spatial correlations , 2003, ICCAD-2003. International Conference on Computer Aided Design (IEEE Cat. No.03CH37486).

[33]  Marco Platzner,et al.  Design and architectures for dependable embedded systems , 2011, 2011 Proceedings of the Ninth IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS).

[34]  Andrew B. Kahng,et al.  New and improved BIST diagnosis methods from combinatorial Group testing theory , 2006, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[35]  P. McKinley,et al.  Fault covering problems in reconfigurable VLSI systems , 1992 .

[36]  Sachin S. Sapatnekar,et al.  Statistical timing analysis under spatial correlations , 2005, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[37]  Ming-C. Cheng,et al.  A Novel Method for Reducing Metal Variation With Statistical Static Timing Analysis , 2012, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[38]  Jinjun Xiong,et al.  FPGA Performance Optimization Via Chipwise Placement Considering Process Variations , 2006, 2006 International Conference on Field Programmable Logic and Applications.

[39]  Zoltán Füredi,et al.  Union-free Hypergraphs and Probability Theory , 1984, Eur. J. Comb..

[40]  Sani R. Nassif,et al.  High Performance CMOS Variability in the 65nm Regime and Beyond , 2007 .

[41]  Tom Feist,et al.  Vivado Design Suite , 2012 .

[42]  Edward J. McCluskey,et al.  Which concurrent error detection scheme to choose ? , 2000, Proceedings International Test Conference 2000 (IEEE Cat. No.00CH37159).

[43]  Matthew Parris,et al.  Progress in autonomous fault recovery of field programmable gate arrays , 2011, CSUR.

[44]  Chen Wei Tseng,et al.  Correcting Single-Event Upsets in Virtex-II Platform FPGA Configuration Memory , 2007 .

[45]  M. Berg,et al.  Fault tolerance implementation within SRAM based FPGA designs based upon the increased level of single event upset susceptibility , 2006, 12th IEEE International On-Line Testing Symposium (IOLTS'06).

[46]  Charles E. Stroud,et al.  Using ILA testing for BIST in FPGAs , 1996, Proceedings International Test Conference 1996. Test and Design Validity.

[47]  Cristiana Bolchini,et al.  Design Space Exploration for the Design of Reliable. , 2008, DFT 2008.

[48]  Ronald F. DeMara,et al.  Hypergraph-Cover Diversity for Maximally-Resilient Reconfigurable Systems , 2015, 2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conference on Embedded Software and Systems.

[49]  Mohab Anis,et al.  FPGA Design for Timing Yield Under Process Variations , 2010, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[50]  Emanuel Knill,et al.  Non-adaptive Group Testing in the Presence of Errors , 1998, Discret. Appl. Math..

[51]  M. D. Giles,et al.  Process Technology Variation , 2011, IEEE Transactions on Electron Devices.

[52]  J. Rose,et al.  The effect of LUT and cluster size on deep-submicron FPGA performance and density , 2000, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[53]  Farid N. Najm,et al.  An adaptive FPGA architecture with process variation compensation and reduced leakage , 2006, 2006 43rd ACM/IEEE Design Automation Conference.

[54]  Jon Perez,et al.  R3TOS-Based Autonomous Fault-Tolerant Systems , 2014, IEEE Micro.

[55]  Vaughn Betz,et al.  Should FPGAS abandon the pass-gate? , 2013, 2013 23rd International Conference on Field programmable Logic and Applications.

[56]  Adrian Stoica,et al.  Fault-tolerant evolvable hardware using field-programmable transistor arrays , 2000, IEEE Trans. Reliab..

[57]  Jörg Henkel,et al.  Module diversification: Fault tolerance and aging mitigation for runtime reconfigurable architectures , 2013, 2013 IEEE International Test Conference (ITC).

[58]  Peter Hazucha,et al.  Characterization of soft errors caused by single event upsets in CMOS processes , 2004, IEEE Transactions on Dependable and Secure Computing.

[59]  Michael J. Wirthlin,et al.  FPGA partial reconfiguration via configuration scrubbing , 2009, 2009 International Conference on Field Programmable Logic and Applications.

[60]  Narayanan Vijaykrishnan,et al.  Variation aware placement for FPGAs , 2006, IEEE Computer Society Annual Symposium on Emerging VLSI Technologies and Architectures (ISVLSI'06).

[61]  Kaushik Roy,et al.  Modeling of failure probability and statistical design of SRAM array for yield enhancement in nanoscaled CMOS , 2005, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[62]  Sidharth Jaggi,et al.  Non-Adaptive Group Testing: Explicit Bounds and Novel Algorithms , 2014, IEEE Trans. Inf. Theory.

[63]  Edward J. McCluskey,et al.  Reconfigurable architecture for autonomous self-repair , 2004, IEEE Design & Test of Computers.

[64]  Charles E. Stroud,et al.  Using roving STARs for on-line testing and diagnosis of FPGAs in fault-tolerant applications , 1999, International Test Conference 1999. Proceedings (IEEE Cat. No.99CH37034).

[65]  Carthik A. Sharma,et al.  Consensus-Based Evaluation for Fault Isolation and On-line Evolutionary Regeneration , 2005, ICES.

[66]  H.-S. Philip Wong,et al.  Efficient FPGAs using nanoelectromechanical relays , 2010, FPGA '10.

[67]  Mahdi Cheraghchi Coding-theoretic methods for sparse recovery , 2011, 2011 49th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[68]  Vladimir Hahanov,et al.  Algebra-logical repair method for FPGA logic blocks , 2010, 2010 East-West Design & Test Symposium (EWDTS).

[69]  Alan D. George,et al.  Characterization of Fixed and Reconfigurable Multi-Core Devices for Application Acceleration , 2010, TRETS.

[70]  R. Dorfman The Detection of Defective Members of Large Populations , 1943 .

[71]  Ronald F. DeMara,et al.  Non-adaptive sparse recovery and fault evasion using disjunct design configurations (abstract only) , 2014, FPGA.

[72]  Cristian Constantinescu,et al.  Trends and Challenges in VLSI Circuit Reliability , 2003, IEEE Micro.

[73]  Hermann Kopetz,et al.  Real-time systems , 2018, CSC '73.

[74]  André DeHon,et al.  Exploiting partially defective LUTs: Why you don't need perfect fabrication , 2013, 2013 International Conference on Field-Programmable Technology (FPT).

[75]  Ronald F. DeMara,et al.  Process variation immunity of alternative 16nm HK/MG-based FPGA logic blocks , 2015, 2015 IEEE 58th International Midwest Symposium on Circuits and Systems (MWSCAS).

[76]  Arash Reyhani-Masoleh,et al.  Concurrent Structure-Independent Fault Detection Schemes for the Advanced Encryption Standard , 2010, IEEE Transactions on Computers.

[77]  Marisol García-Valls,et al.  Low complexity reconfiguration for real-time data-intensive service-oriented applications , 2014, Future Gener. Comput. Syst..

[78]  H. Wong,et al.  CMOS scaling into the nanometer regime , 1997, Proc. IEEE.

[79]  Michael J. Schulte,et al.  An Overview of Reconfigurable Hardware in Embedded Systems , 2006, EURASIP J. Embed. Syst..

[80]  Grigore Rosu,et al.  Hardware Runtime Monitoring for Dependable COTS-Based Real-Time Embedded Systems , 2008, 2008 Real-Time Systems Symposium.

[81]  J. Torrellas,et al.  VARIUS: A Model of Process Variation and Resulting Timing Errors for Microarchitects , 2008, IEEE Transactions on Semiconductor Manufacturing.

[82]  Kaushik Roy,et al.  Process Variations and Process-Tolerant Design , 2007, 20th International Conference on VLSI Design held jointly with 6th International Conference on Embedded Systems (VLSID'07).

[83]  T. Kikuno,et al.  On fault tolerance of reconfigurable arrays using spare processors , 1991, [1991] Proceedings Pacific Rim International Symposium on Fault Tolerant Systems.