Dynamic Fault-Tolerance and Metrics for Battery Powered, Failure-Prone Systems

Emerging VLSI technologies and platforms are giving rise tosystems with inherently high potential for runtime failure.Such failures range from intermittent electrical and mechanicalfailures at the system level, to device failures at the chip level.Techniques to provide reliable computation in the presence offailures must do so while maintaining high performance, withan eye toward energy efficiency. When possible, they shouldmaximize battery lifetime in the face of battery discharge non-linearities. This paper introduces the concept of adaptive fault-tolerance management for failure-prone systems, and a classification of local algorithms for achieving system-wide reliability.In order to judge the efficacy of the proposed algorithmsfor dynamic fault-tolerance management, a set of metrics, forcharacterizing system behavior in terms of energy efficiency,reliability, computation performance and battery lifetime, ispresented. For an example platform employed in a realistic evaluation scenario, it is shown that system configurations with the best performance and lifetime are not necessarilythose with the best combination of performance, reliability,battery lifetime and average power consumption.

[1]  Sarma B. K. Vrudhula,et al.  Battery lifetime prediction for energy-aware computing , 2002, ISLPED '02.

[2]  J. von Neumann,et al.  Probabilistic Logic and the Synthesis of Reliable Organisms from Unreliable Components , 1956 .

[3]  Pradeep K. Khosla,et al.  Modeling computational, sensing, and actuation surfaces , 2004 .

[4]  William J. Stewart,et al.  Introduction to the numerical solution of Markov Chains , 1994 .

[5]  M.D. Beaudry,et al.  PERFORMANCE RELATED RELIABILITY MEASURES FOR COMPUTING SYSTEMS , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing, 1995, ' Highlights from Twenty-Five Years'..

[6]  Michael S. Hsiao,et al.  Fast, flexible, cycle-accurate energy estimation , 2001, ISLPED '01.

[7]  Wendi B. Heinzelman,et al.  Optimal sensor management under energy and reliability constraints , 2003, 2003 IEEE Wireless Communications and Networking, 2003. WCNC 2003..

[8]  Luca Benini,et al.  Dynamic power management for portable systems , 2000, MobiCom '00.

[9]  Dejan S. Milojicic,et al.  Process migration , 1999, ACM Comput. Surv..

[10]  Bruce M. Maggs,et al.  Reconfiguring Arrays with Faults Part I: Worst-Case Faults , 1997, SIAM J. Comput..

[11]  Barry R. Borgerson,et al.  A Reliability Model for Gracefully Degrading and Standby-Sparing Systems , 1975, IEEE Transactions on Computers.

[12]  B.D. Van Veen,et al.  Beamforming: a versatile approach to spatial filtering , 1988, IEEE ASSP Magazine.

[13]  Radu Marculescu,et al.  Modeling, Analysis, and Self-Management of Electronic Textiles , 2003, IEEE Trans. Computers.

[14]  William R. Crowther,et al.  A new minicomputer/multiprocessor for the ARPA network , 1973, AFIPS National Computer Conference.

[15]  Luca Benini,et al.  A discrete-time battery model for high-level power estimation , 2000, DATE '00.