Engineering Cross-Layer Fault Tolerance in Many-Core Systems

Engineering modern many-core systems is a challenging task because of their scale and complexity. We cannot focus on ensuring their dependability without understanding its interplay with performance and energy consumption. This calls for developing new structuring mechanisms that step away from the traditional ways systems are developed such as strict layering, strong encapsulation, abstractions, hiding. The paper reports on the initial steps of a PhD work focusing on development methods and tools for architecting cross-layer fault tolerance in many-core systems in which error detection and error recovery are applied at several system layers in a concerted coordinated fashion to ensure the overall system efficiency.

[1]  Andrs Vajda Programming Many-Core Chips , 2011 .

[2]  Martin Schulz,et al.  Mechanisms and Evaluation of Cross-Layer Fault-Tolerance for Supercomputing , 2012, 2012 41st International Conference on Parallel Processing.

[3]  Hermann Kopetz,et al.  Fault tolerance, principles and practice , 1990 .

[4]  Liming Chen,et al.  N-VERSION PROGRAMMINC: A FAULT-TOLERANCE APPROACH TO RELlABlLlTY OF SOFTWARE OPERATlON , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing, 1995, ' Highlights from Twenty-Five Years'..

[5]  Coniferous softwood GENERAL TERMS , 2003 .

[6]  Laura Carnevali,et al.  Stochastic Fault Trees for cross-layer power management of WSN monitoring systems , 2009, 2009 IEEE Conference on Emerging Technologies & Factory Automation.

[7]  Brian Randell,et al.  The Evolution of the Recovery Block Concept , 1994 .

[8]  M. Tech,et al.  A Cross Layer Fault Tolerant Communication Architecture for Wireless Sensor Networks , 2014 .

[9]  Shekhar Borkar Thousand Core ChipsA Technology Perspective , 2007, DAC 2007.

[10]  Heather Quinn,et al.  Final report for CCS cross-layer reliability visioning study , 2010 .

[11]  Alexandre Yakovlev,et al.  Order Graphs and Cross-Layer Parametric Significance-Driven Modelling , 2015, 2015 15th International Conference on Application of Concurrency to System Design.

[12]  Hongyi Wu,et al.  Cross-Layer Protocol Design and Optimization for Delay/Fault-Tolerant Mobile Sensor Networks (DFT-MSN's) , 2008, IEEE Journal on Selected Areas in Communications.

[13]  Flaviu Cristian A Recovery Mechanism for Modular Software , 1979, ICSE.

[14]  Shekhar Y. Borkar,et al.  Thousand Core ChipsA Technology Perspective , 2007, 2007 44th ACM/IEEE Design Automation Conference.

[15]  Carl E. Landwehr,et al.  Basic concepts and taxonomy of dependable and secure computing , 2004, IEEE Transactions on Dependable and Secure Computing.