Facing up to the Inevitable: Intelligent Error Recovery in Massively Parallel Processing in Memory Architectures

Massively parallel “Processing-In-Memory” (PIM) architectures have been shown to yield increases in performance due to their “memory-centric” nature. However, as PIM is still a developing technology, advanced issues such as error detection and failure recovery have not yet been addressed. We describe the application of concepts found in our multi-agent system, ADE, to PIM, incorporating its mechansims for automatic and intelligent error detection, failure recovery, and dynamic system reconfiguration in the PIM architecture, enhancing architecture robustness.

[1]  Charles E. Stroud,et al.  Dynamic fault tolerance in FPGAs via partial reconfiguration , 2000, Proceedings 2000 IEEE Symposium on Field-Programmable Custom Computing Machines (Cat. No.PR00871).

[2]  Noah Treuhaft,et al.  Recovery Oriented Computing (ROC): Motivation, Definition, Techniques, and Case Studies , 2002 .

[3]  Edward J. McCluskey,et al.  Concurrent Error Detection Using Watchdog Processors - A Survey , 1988, IEEE Trans. Computers.

[4]  Miodrag Potkonjak,et al.  Algorithms for efficient runtime fault recovery on diverse FPGA architectures , 1999, Proceedings 1999 IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems (EFT'99).

[5]  Agostino Poggi,et al.  Jade - a fipa-compliant agent framework , 1999 .

[6]  Thomas L. Sterling,et al.  Microservers: a new memory semantics for massively parallel computing , 1999, ICS '99.

[7]  Peter M. Kogge,et al.  PIM Lite: On the Road Towards Relentless Multi-threading in Massively Parallel Systems , 2003 .

[8]  Peter M. Kogge,et al.  Cost / Performance Analysis of a Multithreaded PIM Architecture , 2005 .

[9]  Jaewook Shin,et al.  Mapping Irregular Applications to DIVA, a PIM-based Data-Intensive Architecture , 1999, ACM/IEEE SC 1999 Conference (SC'99).

[10]  Noah Treuhaft,et al.  Scalable Processors in the Billion-Transistor Era: IRAM , 1997, Computer.

[11]  Matthias Scheutz,et al.  ADE - Steps Towards a Distributed Development and Runtime Environment for Complex Robotic Agent Architectures , 2006 .

[12]  Matthias Scheutz,et al.  Integrating theory and practice: the agent architecture framework APOC and its development environment ADE , 2004, Proceedings of the Third International Joint Conference on Autonomous Agents and Multiagent Systems, 2004. AAMAS 2004..

[13]  Edward J. McCluskey,et al.  Which concurrent error detection scheme to choose ? , 2000, Proceedings International Test Conference 2000 (IEEE Cat. No.00CH37159).

[14]  Jay B. Brockman,et al.  PIM lite: a multithreaded processor-in-memory prototype , 2005, GLSVLSI '05.

[15]  Larry A. Bergman,et al.  A design analysis of a hybrid technology multithreaded architecture for petaflops scale computation3 , 1999, ICS '99.

[16]  Matthias Scheutz,et al.  The utility of affect expression in natural language interactions in joint human-robot tasks , 2006, HRI '06.

[17]  Fong Pong,et al.  Missing the Memory Wall: The Case for Processor/Memory Integration , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[18]  Katia P. Sycara,et al.  The RETSINA MAS Infrastructure , 2003, Autonomous Agents and Multi-Agent Systems.