Fault-tolerant Distributed Systems in Hardware

Very large-scale integrated (VLSI) hardware designs can be seen as distributed systems at several levels of abstraction: from the cores in a multicore architecture down to the Boolean gates in its circuit implementation, hardware designs comprise of interacting computing nodes with non-negligible communication delays. The resulting similarities to classic large-scale distributed systems become even more accented in mission critical hardware designs that are required to operate correctly in the presence of component failures. We advocate to act on this observation and treat fault-tolerant hardware design as the task of devising suitable distributed algorithms. By means of problems related to clock generation and distribution, we show that (i) design and analysis techniques from distributed computing can provide new and provably correct mission critical hardware solutions and (ii) studying such systems reveals many interesting and challenging open problems for distributed computing.

[1]  Jennifer L. Welch,et al.  Self-Stabilizing Clock Synchronization in the Presence of ByzantineFaults ( Preliminary Version ) Shlomi Dolevy , 1995 .

[2]  Alain J. Martin,et al.  A Necessary and Sufficient Timing Assumption for Speed-Independent Circuits , 2009, 2009 15th IEEE Symposium on Asynchronous Circuits and Systems.

[3]  Prithviraj Banerjee,et al.  Fault tolerant VLSI systems , 1993 .

[4]  Edsger W. Dijkstra,et al.  Solution of a problem in concurrent programming control , 1965, CACM.

[5]  Jürgen Schlöffel,et al.  Modeling and analysis of crosstalk coupling effect on the victim interconnect using the ABCD network model , 2004, 19th IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems, 2004. DFT 2004. Proceedings..

[6]  Marcos K. Aguilera,et al.  Abortable and query-abortable objects and their efficient implementation , 2007, PODC '07.

[7]  Alain J. Martin,et al.  Quasi-Delay-Insensitive Circuits are Turing-Complete , 1995 .

[8]  Leonard R. Marino,et al.  General theory of metastable operation , 1981, IEEE Transactions on Computers.

[9]  Peter J. Ashenden,et al.  The Designer's Guide to VHDL , 1995 .

[10]  Danny Dolev,et al.  The Byzantine Generals Strike Again , 1981, J. Algorithms.

[11]  Ran Ginosar,et al.  Metastability challenges for 65nm and beyond; simulation and measurements , 2013, 2013 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[12]  Shlomi Dolev,et al.  Self Stabilization , 2004, J. Aerosp. Comput. Inf. Commun..

[13]  Christoph Lenzen,et al.  Byzantine Self-Stabilizing Clock Distribution with HEX: Implementation, Simulation, Clock Multiplication , 2013 .

[14]  Mahyar R. Malekpour A Self-Stabilizing Byzantine-Fault-Tolerant Clock Synchronization Protocol , 2009 .

[15]  Manfred Broy,et al.  Specification and development of interactive systems: focus on streams, interfaces, and refinement , 2001 .

[16]  Piotr Berman,et al.  Bit optimal distributed consensus , 1992 .

[17]  Antonio Cantoni,et al.  On the Unavoidability of Metastable Behavior in Digital Systems , 1987, IEEE Transactions on Computers.

[18]  Silvio Micali,et al.  Optimal algorithms for Byzantine agreement , 1988, STOC '88.

[19]  Hermann Kopetz,et al.  The time-triggered architecture , 1998, Proceedings First International Symposium on Object-Oriented Real-Time Distributed Computing (ISORC '98).

[20]  Christoph Lenzen,et al.  Towards Optimal Synchronous Counting , 2015, PODC.

[21]  Matthias Függer,et al.  HEX: scaling honeycombs is easier than scaling clock trees , 2013, J. Comput. Syst. Sci..

[22]  Sam Toueg,et al.  Optimal clock synchronization , 1985, PODC '85.

[23]  David J. Kinniment,et al.  Synchronization circuit performance , 2002 .

[24]  Danny Dolev,et al.  Linear Time Byzantine Self-Stabilizing Clock Synchronization , 2003, OPODIS.

[25]  Nancy A. Lynch,et al.  A Lower Bound for the Time to Assure Interactive Consistency , 1982, Inf. Process. Lett..

[26]  Danny Dolev,et al.  Self-Stabilizing Pulse Synchronization Inspired by Biological Pacemaker Networks , 2003, Self-Stabilizing Systems.

[27]  Nancy A. Lynch,et al.  An Upper and Lower Bound for Clock Synchronization , 1984, Inf. Control..

[28]  Mark Moir,et al.  Transparent Support for Wait-Free Transactions , 1997, WDAG.

[29]  Mohamed G. Gouda,et al.  Token Systems that Self-Stabilize , 1989, IEEE Trans. Computers.

[30]  Jeffrey T. Draper,et al.  DF-DICE: a scalable solution for soft error tolerant circuit design , 2006, 2006 IEEE International Symposium on Circuits and Systems.

[31]  Brian A. Coan,et al.  Modular Construction of a Byzantine Agreement Protocol with Optimal Message Bit Complexity , 1992, Inf. Comput..

[32]  Prasad Jayanti On the robustness of Herlihy's hierarchy , 1993, PODC '93.

[33]  Alain J. Martin,et al.  A Soft-error-tolerant Asynchronous Microcontroller , 2007 .

[34]  Chris J. Myers,et al.  Synthesis of Timed Circuits Based on Decomposition , 2007, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[35]  Edward A. Lee,et al.  Structure and interpretation of signals and systems , 2002 .

[36]  Gary L. Peterson,et al.  Concurrent Reading While Writing , 1983, TOPL.

[37]  Shlomi Dolev,et al.  Self-stabilizing microprocessor: analyzing and overcoming soft errors , 2006, IEEE Transactions on Computers.

[38]  Alain J. Martin Compiling communicating processes into delay-insensitive VLSI circuits , 2005, Distributed Computing.

[39]  Benjamin Barras,et al.  SPICE – Simulation Program with Integrated Circuit Emphasis , 2013 .

[40]  Danny Dolev,et al.  On Self-stabilizing Synchronous Actions Despite Byzantine Attacks , 2007, DISC.

[41]  Alain J. Martin,et al.  Asynchronous Techniques for System-on-Chip Design , 2006, Proceedings of the IEEE.

[42]  Maurice Herlihy,et al.  Impossibility and universality results for wait-free synchronization , 1988, PODC '88.

[43]  Nancy A. Lynch,et al.  Reaching approximate agreement in the presence of faults , 1986, JACM.

[44]  K.S. Stevens,et al.  Relative timing [asynchronous design] , 2003, IEEE Trans. Very Large Scale Integr. Syst..

[45]  D. J. Kinniment Synchronization and Arbitration in Digital Systems , 2008 .

[46]  Matthias Függer,et al.  Towards binary circuit models that faithfully capture physical solvability , 2015, 2015 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[47]  Maurice Herlihy,et al.  Obstruction-free synchronization: double-ended queues as an example , 2003, 23rd International Conference on Distributed Computing Systems, 2003. Proceedings..

[48]  R.C. Baumann,et al.  Radiation-induced soft errors in advanced semiconductor technologies , 2005, IEEE Transactions on Device and Materials Reliability.

[49]  Maurice Herlihy,et al.  Wait-free synchronization , 1991, TOPL.

[50]  Israel Koren,et al.  Defect tolerance in VLSI circuits: techniques and yield analysis , 1998, Proc. IEEE.

[51]  Mahyar R. Malekpour,et al.  A Byzantine-Fault Tolerant Self-stabilizing Protocol for Distributed Clock Synchronization Systems , 2006, SSS.

[52]  James H. Anderson,et al.  A new explanation of the glitch phenomenon , 1991, Acta Informatica.

[53]  Danny Dolev,et al.  Byzantine Self-stabilizing Pulse in a Bounded-Delay Model , 2007, SSS.

[54]  Vassos Hadzilacos,et al.  All of us are smarter than any of us: wait-free hierarchies are not robust , 1997, STOC '97.

[55]  Andreas Steininger,et al.  VLSI Implementation of a Distributed Algorithm for Fault-Tolerant Clock Generation , 2011, J. Electr. Comput. Eng..

[56]  Amos Israeli,et al.  Disjoint-access-parallel implementations of strong shared memory primitives , 1994, PODC '94.

[57]  Michael O. Rabin,et al.  Randomized byzantine generals , 1983, 24th Annual Symposium on Foundations of Computer Science (sfcs 1983).

[58]  Nir Shavit,et al.  Software transactional memory , 1995, PODC '95.

[59]  Cristian Constantinescu,et al.  Trends and Challenges in VLSI Circuit Reliability , 2003, IEEE Micro.

[60]  Rachid Guerraoui,et al.  Computing with Reads and Writes in the Absence of Step Contention , 2005, DISC.

[61]  Maged M. Michael,et al.  Evaluation of Blue Gene/Q hardware support for transactional memories , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[62]  Christoph Lenzen,et al.  Synchronous counting and computational algorithm design , 2013, J. Comput. Syst. Sci..

[63]  Alain J. Martin Synthesis of Asynchronous VLSI Circuits , 1991 .

[64]  Alain J. Martin The limitations to delay-insensitivity in asynchronous circuits , 1990 .

[65]  Eby G. Friedman,et al.  Clock distribution networks in synchronous digital integrated circuits , 2001, Proc. IEEE.

[66]  Manuel J. Bellido,et al.  Logic-Timing Simulation and the Degradation Delay Model , 2005 .

[67]  Michael L. Scott,et al.  Conflict Reduction in Hardware Transactions Using Advisory Locks , 2015, SPAA.

[68]  Jo C. Ebergen,et al.  A formal approach to designing delay-insensitive circuits , 1991, Distributed Computing.

[69]  Sani R. Nassif,et al.  High Performance CMOS Variability in the 65nm Regime and Beyond , 2006, 2007 IEEE International Electron Devices Meeting.

[70]  Guy Lemieux,et al.  A Survey and Taxonomy of GALS Design Styles , 2007, IEEE Design & Test of Computers.

[71]  Luciano Lavagno,et al.  On the models for asynchronous circuit behaviour with OR causality , 1996, Formal Methods Syst. Des..

[72]  Nancy A. Lynch,et al.  Impossibility of distributed consensus with one faulty process , 1983, PODS '83.

[73]  Matthias Függer,et al.  Unfaithful Glitch Propagation in Existing Binary Circuit Models , 2013, IEEE Transactions on Computers.

[74]  Maurice Herlihy,et al.  Axioms for concurrent objects , 1987, POPL '87.

[75]  David G. Messerschmitt,et al.  Synchronization in Digital System Design , 1990, IEEE J. Sel. Areas Commun..

[76]  Gadi Taubenfeld Shared Memory Synchronization , 2008, Bull. EATCS.

[77]  Rajeev Alur,et al.  A Theory of Timed Automata , 1994, Theor. Comput. Sci..

[78]  Yujie Liu,et al.  Transaction-friendly condition variables , 2014, SPAA.

[79]  Andreas Steininger,et al.  Rigorously modeling self-stabilizing fault-tolerant circuits: An ultra-robust clocking scheme for systems-on-chip☆ , 2014, J. Comput. Syst. Sci..

[80]  Maurice Herlihy,et al.  Linearizability: a correctness condition for concurrent objects , 1990, TOPL.

[81]  George J. Pappas,et al.  Discrete abstractions of hybrid systems , 2000, Proceedings of the IEEE.

[82]  Ophir Rachman,et al.  Anomalies in the Wait-Free Hierarchy , 1994, WDAG.

[83]  Adam Welc,et al.  Practical weak-atomicity semantics for java stm , 2008, SPAA '08.

[84]  Matthias Függer,et al.  Reconciling fault-tolerant distributed computing and systems-on-chip , 2011, Distributed Computing.

[85]  Lorenzo Alvisi,et al.  Modeling the effect of technology trends on the soft error rate of combinational logic , 2002, Proceedings International Conference on Dependable Systems and Networks.

[86]  Christopher J. Hughes,et al.  Performance evaluation of Intel® Transactional Synchronization Extensions for high-performance computing , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[87]  Dennis Shasha,et al.  Locking without blocking: making lock based concurrent data structure algorithms nonblocking , 1992, PODS '92.

[88]  Bratin Saha,et al.  Open nesting in software transactional memory , 2007, PPOPP.

[89]  Charles E. Molnar,et al.  Anomalous Behavior of Synchronizer and Arbiter Circuits , 1973, IEEE Transactions on Computers.

[90]  Nir Shavit,et al.  Reduced hardware transactions: a new approach to hybrid transactional memory , 2013, SPAA.

[91]  Michael L. Scott,et al.  Implementation tradeoffs in the design of flexible transactional memory support , 2010, J. Parallel Distributed Comput..

[92]  Rachid Guerraoui,et al.  Linearizability Is Not Always a Safety Property , 2014, NETYS.

[93]  P.H. Eaton,et al.  SEU and SET Modeling and Mitigation in Deep Submicron Technologies , 2007, 2007 IEEE International Reliability Physics Symposium Proceedings. 45th Annual.

[94]  Christoph Lenzen,et al.  Fault-tolerant algorithms for tick-generation in asynchronous logic , 2011, SSS.

[95]  Ran Ginosar Fourteen ways to fool your synchronizer , 2003, Ninth International Symposium on Asynchronous Circuits and Systems, 2003. Proceedings..

[96]  Alan Wood,et al.  The impact of new technology on soft error rates , 2011, 2011 International Reliability Physics Symposium.

[97]  Simon W. Moore,et al.  Self-timed circuitry for global clocking , 2005, 11th IEEE International Symposium on Asynchronous Circuits and Systems.

[98]  Michael L. Scott Transactional Semantics with Zombies , 2014 .

[99]  Nancy A. Lynch,et al.  Gradient clock synchronization , 2004, PODC '04.

[100]  Jordi Cortadella,et al.  Synchronous Elastic Circuits with Early Evaluation and Token Counterflow , 2007, 2007 44th ACM/IEEE Design Automation Conference.

[101]  Leslie Lamport,et al.  Interprocess Communication , 2020, Practical System Programming with C.

[102]  Timothy J. Slegel,et al.  Transactional Memory Architecture and Implementation for IBM System Z , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[103]  Jared Saia,et al.  Breaking the O(n2) bit barrier: scalable byzantine agreement with an adaptive adversary , 2010, PODC.

[104]  Thomas A. Henzinger,et al.  The Algorithmic Analysis of Hybrid Systems , 1995, Theor. Comput. Sci..

[105]  Danny Dolev,et al.  Self-Stabilizing Byzantine Pulse Synchronization , 2006, ArXiv.

[106]  Leslie Lamport,et al.  The temporal logic of actions , 1994, TOPL.

[107]  Matthias Függer,et al.  Experimental Validation of a Faithful Binary Circuit Model , 2015, ACM Great Lakes Symposium on VLSI.

[108]  Maurice Herlihy,et al.  Virtualizing transactional memory , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[109]  Leslie Lamport,et al.  Concurrent reading and writing , 1977, Commun. ACM.

[110]  Maurice Herlihy,et al.  On the space complexity of randomized synchronization , 1993, PODC '93.

[111]  Nancy A. Lynch,et al.  The Theory of Timed I/o Automata , 2003 .

[112]  Leslie Lamport,et al.  Reaching Agreement in the Presence of Faults , 1980, JACM.

[113]  Matthias Függer,et al.  Fault-Tolerant Distributed Clock Generation in VLSI Systems-on-Chip , 2006, 2006 Sixth European Dependable Computing Conference.

[114]  Jean-Michel Chabloz,et al.  Globally-Ratiochronous, Locally-Synchronous Systems , 2012 .

[115]  Michael L. Scott,et al.  Software partitioning of hardware transactions , 2015, PPoPP.

[116]  Rajeev Murgai,et al.  Clock distribution architectures: a comparative study , 2006, 7th International Symposium on Quality Electronic Design (ISQED'06).

[117]  Andreas Steininger,et al.  Safely Stimulating the Clock Synchronization Algorithm in Time-Triggered Systems–A Combined Formal and Experimental Approach , 2009, IEEE Transactions on Industrial Informatics.

[118]  Stephen H. Unger Asynchronous Sequential Switching Circuits with Unrestricted Input Changes , 1971, IEEE Trans. Computers.

[119]  Teresa H. Y. Meng,et al.  Synthesis of Timed Asynchronous CircuitsChris , 1993 .

[120]  M. S. Maza,et al.  Analysis of clock distribution networks in the presence of crosstalk and groundbounce , 2001, ICECS 2001. 8th IEEE International Conference on Electronics, Circuits and Systems (Cat. No.01EX483).

[121]  Peter J. Ashenden The Designer's Guide to VHDL, Volume 3, Third Edition (Systems on Silicon) (Systems on Silicon) , 2008 .

[122]  Mónico Linares Aranda,et al.  Interconnected rings and oscillators as gigahertz clock distribution nets , 2003, GLSVLSI '03.

[123]  Matthias Függer,et al.  Efficient Construction of Global Time in SoCs Despite Arbitrary Faults , 2013, 2013 Euromicro Conference on Digital System Design.

[124]  Danny Dolev,et al.  Fast self-stabilizing byzantine tolerant digital clock synchronization , 2008, PODC '08.