Aging-aware hardware-software task partitioning for reliable reconfigurable multiprocessor systems

Homogeneous multiprocessor systems with reconfigurable area (also known as Reconfigurable Multiprocessor Systems) are emerging as a popular design choice in current and future technology nodes to meet the heterogeneous computing demand of a multitude of applications enabled on these platforms. Application specific mapping decisions on such a platform involve partitioning a given application into software tasks (executed on one or more of the general purpose processors, GPPs) and the hardware tasks (realized as dedicated hardware on the reconfigurable area) to optimize and/or satisfy design constraints such as reliability, performance and design cost. Improving the reliability considering transient faults by increasing the number of checkpoints negatively impacts the reliability considering permanent faults. This trade-off is ignored in all prior studies on task mapping and scheduling. This paper proposes an optimization technique to decide the optimal number of checkpoints for the software tasks which minimizes aging of the GPPs while maximizing the transient fault-tolerance of the overall platform (GPPs and the reconfigurable area) and satisfying design cost and performance. Experiments conducted with synthetic and real-life application task graphs (cyclic and acyclic) demonstrate that the proposed technique minimizes aging and improves the platform lifetime by an average 60% as compared to the existing transient fault-aware techniques. Further, a gradient-based heuristic is proposed to minimize the design space exploration time by upto 500× with less than 5% deviation from optimal solution.

[1]  Petru Eles,et al.  Analysis and optimization of fault-tolerant embedded systems with hardened processors , 2009, 2009 Design, Automation & Test in Europe Conference & Exhibition.

[2]  Radu Marculescu,et al.  FARM: Fault-aware resource management in NoC-based multiprocessor platforms , 2011, 2011 Design, Automation & Test in Europe.

[3]  Vilas Sridharan,et al.  A study of DRAM failures in the field , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[4]  Arshad Jhumka,et al.  A dependability-driven system-level design approach for embedded systems , 2005, Design, Automation and Test in Europe.

[5]  Virendra Singh,et al.  Fault-tolerant average execution time optimization for general-purpose multi-processor system-on-chips , 2009, 2009 Design, Automation & Test in Europe Conference & Exhibition.

[6]  Jürgen Becker,et al.  A Design Methodology for Application Partitioning and Architecture Development of Reconfigurable Multiprocessor Systems-on-Chip , 2010, 2010 18th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines.

[7]  Alois Knoll,et al.  Analysis and optimization of fault-tolerant task scheduling on multiprocessor embedded systems , 2011, 2011 Proceedings of the Ninth IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS).

[8]  Huichu Liu,et al.  Technology assessment of Si and III-V FinFETs and III-V tunnel FETs from soft error rate perspective , 2012, 2012 International Electron Devices Meeting.

[9]  Martin Lukasiewycz,et al.  Reliability-Aware System Synthesis , 2007 .

[10]  Chiara Sandionigi,et al.  A Novel Design Methodology for Implementing Reliability-Aware Systems on SRAM-Based FPGAs , 2011, IEEE Transactions on Computers.

[11]  Rolf Ernst,et al.  Reliability analysis for MPSoCs with mixed-critical, hard real-time constraints , 2011, 2011 Proceedings of the Ninth IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS).

[12]  Martin Lukasiewycz,et al.  Reliability-Aware System Synthesis , 2007, 2007 Design, Automation & Test in Europe Conference & Exhibition.

[13]  Liang Chen,et al.  Shared reconfigurable fabric for multi-core customization , 2011, 2011 48th ACM/EDAC/IEEE Design Automation Conference (DAC).

[14]  Akash Kumar,et al.  A design flow for partially reconfigurable heterogeneous multi-processor platforms , 2012, 2012 23rd IEEE International Symposium on Rapid System Prototyping (RSP).

[15]  Andrew A. Chien,et al.  When is multi-version checkpointing needed? , 2013, FTXS '13.

[16]  Qiang Xu,et al.  Lifetime reliability-aware task allocation and scheduling for MPSoC platforms , 2009, 2009 Design, Automation & Test in Europe Conference & Exhibition.

[17]  Bharadwaj Veeravalli,et al.  Reliability-driven task mapping for lifetime extension of networks-on-chip based multiprocessor systems , 2013, 2013 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[18]  Jinyong Yin,et al.  A real-time fault-tolerant scheduling algorithm for software/hardware hybrid tasks , 2011, 2011 International Conference on Mechatronic Science, Electric Engineering and Computer (MEC).

[19]  J. W. McPherson,et al.  Reliability challenges for 45nm and beyond , 2006, 2006 43rd ACM/IEEE Design Automation Conference.

[20]  Wayne H. Wolf,et al.  Multiprocessor Systems-on-Chips , 2004, ISVLSI.

[21]  Muhammad Shafique,et al.  Minority-Game-based resource allocation for run-time reconfigurable multi-core processors , 2011, 2011 Design, Automation & Test in Europe.

[22]  C. Krishna,et al.  Reliability of checkpointed real-time systems using time redundancy , 1993 .

[23]  E.A. Lee,et al.  Synchronous data flow , 1987, Proceedings of the IEEE.

[24]  Donald E. Thomas,et al.  A case for lifetime-aware task mapping in embedded chip multiprocessors , 2010, 2010 IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS).

[25]  Luca Benini,et al.  NoC synthesis flow for customized domain specific multiprocessor systems-on-chip , 2005, IEEE Transactions on Parallel and Distributed Systems.

[26]  Mahmut T. Kandemir,et al.  Reliability-centric hardware/software co-design , 2005, Sixth international symposium on quality electronic design (isqed'05).

[27]  Cristiana Bolchini,et al.  Reliability-Driven System-Level Synthesis of Embedded Systems , 2010, 2010 IEEE 25th International Symposium on Defect and Fault Tolerance in VLSI Systems.

[28]  Bharadwaj Veeravalli,et al.  Communication and migration energy aware design space exploration for multicore systems with intermittent faults , 2013, 2013 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[29]  Paul Pop,et al.  Task Mapping and Bandwidth Reservation for Mixed Hard/Soft Fault-Tolerant Embedded Systems , 2010, 2010 16th IEEE Real-Time and Embedded Technology and Applications Symposium.

[30]  Sander Stuijk,et al.  SDF^3: SDF For Free , 2006, Sixth International Conference on Application of Concurrency to System Design (ACSD'06).

[31]  Wayne H. Wolf,et al.  TGFF: task graphs for free , 1998, Proceedings of the Sixth International Workshop on Hardware/Software Codesign. (CODES/CASHE'98).

[32]  Byung Kook Kim,et al.  An optimal checkpointing-strategy for real-time control systems under transient faults , 2001, IEEE Trans. Reliab..

[33]  Bharadwaj Veeravalli,et al.  Energy-aware task mapping and scheduling for reliable embedded computing systems , 2014, ACM Trans. Embed. Comput. Syst..

[34]  Ragunathan Rajkumar,et al.  R-BATCH: Task Partitioning for Fault-tolerant Multiprocessor Real-Time Systems , 2010, 2010 10th IEEE International Conference on Computer and Information Technology.

[35]  Cristian Constantinescu,et al.  Trends and Challenges in VLSI Circuit Reliability , 2003, IEEE Micro.

[36]  Nikil D. Dutt,et al.  Energy-aware cosynthesis of real-time multimedia applications on MPSoCs using heterogeneous scheduling policies , 2008, TECS.

[37]  Mahmut T. Kandemir,et al.  Reliability-aware Co-synthesis for Embedded Systems , 2004, Proceedings. 15th IEEE International Conference on Application-Specific Systems, Architectures and Processors, 2004..

[38]  Jürgen Teich,et al.  Hardware/Software Codesign: The Past, the Present, and Predicting the Future , 2012, Proceedings of the IEEE.