Mapping a Fault-Tolerant Distributed Algorithm to Systems on Chip

Systems on chip (SoC) have much in common with traditional (networked) distributed systems in that they consist of largely independent components with dedicated communication interfaces. Therefore the adoption of classic distributed algorithms for SoCs suggests itself. The implementation complexity of these algorithms, however, significantly depends on the underlying failure models. In traditional software-based solutions this is normally not an issue, such that the most unconstrained, namely the Byzantine, failure model is often applied here. Our case study of a hardware implemented tick synchronization algorithm shows, however, that in an SoC-implementation substantial hardware savings can result from restricting the failure model to benign failures (omissions, crashes). On the downside, it turns out that such restricted failure models have a fairly poor coverage with respect to the hardware faults occurring in practice, and that additional measures to enforce these restrictions may entail an implementation overhead that outweighs the gain obtained in the implementation of a simpler algorithm. As a remedy we investigate the potential of failure transformation in this context and show that this technique may indeed yield an optimized overall solution.

[1]  Gérard Le Lann,et al.  How to Implement a Time-Free Perfect Failure Detector in Partially Synchronous Systems , 2005 .

[2]  Josef Widder Distributed Computing in the Presence of Bounded Asynchrony , 2004 .

[3]  Seif Haridi,et al.  Distributed Algorithms , 1992, Lecture Notes in Computer Science.

[4]  Peter N. Marinos,et al.  Synchronization of Fault-Tolerant Clocks in the Presence of Malicious Failures , 1988, IEEE Trans. Computers.

[5]  W. Burleson,et al.  Accurate estimation of soft error rate (SER) in VLSI circuits , 2004, 19th IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems, 2004. DFT 2004. Proceedings..

[6]  Matthias Függer,et al.  Fault-Tolerant Distributed Clock Generation in VLSI Systems-on-Chip , 2006, 2006 Sixth European Dependable Computing Conference.

[7]  Ivan E. Sutherland,et al.  Micropipelines , 1989, Commun. ACM.

[8]  Bernie Mulgrew,et al.  IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems , 1998 .

[9]  Shlomi Dolev,et al.  Self-stabilizing microprocessor: analyzing and overcoming soft errors , 2006, IEEE Transactions on Computers.

[10]  Shekhar Y. Borkar,et al.  Designing reliable systems from unreliable components: the challenges of transistor variability and degradation , 2005, IEEE Micro.

[11]  Shlomi Dolev,et al.  Self Stabilization , 2004, J. Aerosp. Comput. Inf. Commun..

[12]  A.K. Somani,et al.  An all digital phase locked loop fault tolerant clock , 1991, 1991., IEEE International Sympoisum on Circuits and Systems.

[13]  Jürgen Schlöffel,et al.  Modeling and analysis of crosstalk coupling effect on the victim interconnect using the ABCD network model , 2004, 19th IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems, 2004. DFT 2004. Proceedings..

[14]  Valeriu Beiu,et al.  VLSI implementations of threshold logic-a comprehensive survey , 2003, IEEE Trans. Neural Networks.

[15]  M. S. Maza,et al.  Analysis of clock distribution networks in the presence of crosstalk and groundbounce , 2001, ICECS 2001. 8th IEEE International Conference on Electronics, Circuits and Systems (Cat. No.01EX483).

[16]  Gérard Le Lann,et al.  Failure Detection with Booting in Partially Synchronous Systems , 2005, EDCC.

[17]  Danny Dolev,et al.  On the possibility and impossibility of achieving clock synchronization , 1984, STOC '84.

[18]  Andreas Steininger,et al.  VLSI Implementation of a Fault-Tolerant Distributed Clock Generation , 2006, 2006 21st IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems.

[19]  Parameswaran Ramanathan,et al.  Transmission Delays in Hardware Clock Synchronization , 1988, IEEE Trans. Computers.

[20]  Cristian Constantinescu,et al.  Trends and Challenges in VLSI Circuit Reliability , 2003, IEEE Micro.

[21]  Kyu Ho Park,et al.  An Improved Hardware Implementation of the Fault-Tolerant Clock Synchronization Algorithm for Large Multiprocessor Systems , 1990, IEEE Trans. Computers.

[22]  Sam Toueg,et al.  Optimal clock synchronization , 1985, PODC '85.