FATAL+: A Self-Stabilizing Byzantine Fault-tolerant Clocking Scheme for SoCs

We present concept and implementation of a self-stabilizing Byzantine fault-tolerant distributed clock generation scheme for multi-synchronous GALS architectures in critical applications. It combines a variant of a recently introduced self-stabilizing algorithm for generating low-frequency, low-accuracy synchronized pulses with a simple non-stabilizing high-frequency, high-accuracy clock synchronization algorithm. We provide thorough correctness proofs and a performance analysis, which use methods from fault-tolerant distributed computing research but also addresses hardware-related issues like metastability. The algorithm, which consists of several concurrent communicating asynchronous state machines, has been implemented in VHDL using Petrify in conjunction with some extensions, and synthetisized for an Altera Cyclone FPGA. An experimental validation of this prototype has been carried out to confirm the skew and clock frequency bounds predicted by the theoretical analysis, as well as the very short stabilization times (required for recovering after excessively many transient failures) achievable in practice.

[1]  Leonard R. Marino,et al.  General theory of metastable operation , 1981, IEEE Transactions on Computers.

[2]  Cristian Constantinescu,et al.  Trends and Challenges in VLSI Circuit Reliability , 2003, IEEE Micro.

[3]  E. Szemerédi,et al.  O(n LOG n) SORTING NETWORK. , 1983 .

[4]  Teresa H. Meng,et al.  Supply noise and CMOS synchronization errors , 1995 .

[5]  Danny Dolev,et al.  On the possibility and impossibility of achieving clock synchronization , 1984, STOC '84.

[6]  K.A. Jenkins,et al.  A clock distribution network for microprocessors , 2000, 2000 Symposium on VLSI Circuits. Digest of Technical Papers (Cat. No.00CH37103).

[7]  Mahyar R. Malekpour,et al.  A Byzantine-Fault Tolerant Self-stabilizing Protocol for Distributed Clock Synchronization Systems , 2006, SSS.

[8]  Danny Dolev,et al.  Byzantine Self-stabilizing Pulse in a Bounded-Delay Model , 2007, SSS.

[9]  Christoph Lenzen,et al.  Fault-tolerant algorithms for tick-generation in asynchronous logic , 2011, SSS.

[10]  W. Burleson,et al.  Accurate estimation of soft error rate (SER) in VLSI circuits , 2004, 19th IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems, 2004. DFT 2004. Proceedings..

[11]  Simon W. Moore,et al.  Self-timed circuitry for global clocking , 2005, 11th IEEE International Symposium on Asynchronous Circuits and Systems.

[12]  Guy Lemieux,et al.  A Survey and Taxonomy of GALS Design Styles , 2007, IEEE Design & Test of Computers.

[13]  Rama S. Bhamidipati,et al.  Challenges and Methodologies for Implementing High-Performance Network Processors , 2002 .

[14]  Matthias Függer,et al.  Fault-Tolerant Algorithms for Tick-Generation in Asynchronous Logic: Robust Pulse Generation - [Extended Abstract] , 2011, SSS.

[15]  Tom Verhoeff,et al.  Delay-insensitive codes — an overview , 1988, Distributed Computing.

[16]  C. Dike,et al.  Miller and noise effects in a synchronizing flip-flop , 1999 .

[17]  Matthias Függer,et al.  How to Speed-Up Fault-Tolerant Clock Generation in VLSI Systems-on-Chip via Pipelining , 2009, 2010 European Dependable Computing Conference.

[18]  Andreas Steininger,et al.  On the Threat of Metastability in an Asynchronous Fault-Tolerant Clock Generation Scheme , 2009, 2009 15th IEEE Symposium on Asynchronous Circuits and Systems.

[19]  Matthias Függer,et al.  Reconciling fault-tolerant distributed computing and systems-on-chip , 2011, Distributed Computing.

[20]  Cecilia Metra,et al.  Implications of clock distribution faults and issues with screening them during manufacturing testing , 2004, IEEE Transactions on Computers.

[21]  Shlomi Dolev,et al.  Self Stabilization , 2004, J. Aerosp. Comput. Inf. Commun..

[22]  Ulrich Schmid,et al.  The Theta-Model: achieving synchrony without clocks , 2009, Distributed Computing.

[23]  Jennifer L. Welch,et al.  Self-Stabilizing Clock Synchronization in the Presence of ByzantineFaults ( Preliminary Version ) Shlomi Dolevy , 1995 .

[24]  R.C. Baumann,et al.  Radiation-induced soft errors in advanced semiconductor technologies , 2005, IEEE Transactions on Device and Materials Reliability.

[25]  Ran Ginosar,et al.  Timing measurements of synchronization circuits , 2003, Ninth International Symposium on Asynchronous Circuits and Systems, 2003. Proceedings..

[26]  David J. Kinniment,et al.  Synchronization circuit performance , 2002 .

[27]  Luciano Lavagno,et al.  Logic Synthesis for Asynchronous Controllers and Interfaces , 2002 .

[28]  Daniel Marcos Chapiro,et al.  Globally-asynchronous locally-synchronous systems , 1985 .

[29]  F. Ayazi,et al.  Process and temperature compensation in a 7-MHz CMOS clock oscillator , 2006, IEEE Journal of Solid-State Circuits.

[30]  Mónico Linares Aranda,et al.  Analysis and verification of interconnected rings as clock distribution networks , 2004, GLSVLSI '04.

[31]  Eby G. Friedman,et al.  Clock distribution networks in synchronous digital integrated circuits , 2001, Proc. IEEE.

[32]  M. S. Maza,et al.  Analysis of clock distribution networks in the presence of crosstalk and groundbounce , 2001, ICECS 2001. 8th IEEE International Conference on Electronics, Circuits and Systems (Cat. No.01EX483).

[33]  Danny Dolev,et al.  Self-Stabilizing Byzantine Pulse Synchronization , 2006, ArXiv.

[34]  Sam Toueg,et al.  Optimal clock synchronization , 1985, PODC '85.

[35]  M.J. Gadlage,et al.  Digital Device Error Rate Trends in Advanced CMOS Technologies , 2006, IEEE Transactions on Nuclear Science.

[36]  Peter Hazucha,et al.  Characterization of soft errors caused by single event upsets in CMOS processes , 2004, IEEE Transactions on Dependable and Secure Computing.

[37]  Danny Dolev,et al.  Self-stabilizing Byzantine Digital Clock Synchronization , 2006, SSS.

[38]  Thomas Polzer,et al.  A Metastability-Free Multi-synchronous Communication Scheme for SoCs , 2009, SSS.

[39]  Mónico Linares Aranda,et al.  Interconnected rings and oscillators as gigahertz clock distribution nets , 2003, GLSVLSI '03.

[40]  Danny Dolev,et al.  Fast self-stabilizing byzantine tolerant digital clock synchronization , 2008, PODC '08.

[41]  Leslie Lamport,et al.  Reaching Agreement in the Presence of Faults , 1980, JACM.

[42]  Matthias Függer,et al.  Fault-Tolerant Distributed Clock Generation in VLSI Systems-on-Chip , 2006, 2006 Sixth European Dependable Computing Conference.

[43]  Nancy A. Lynch,et al.  An Upper and Lower Bound for Clock Synchronization , 1984, Inf. Control..