Hardware Fault Containment In Scalable Shared-memory Multiprocessors

Current shared-memory multiprocessors are inherently vulnerable to faults: any significant hardware or system software fault causes the entire system to fail. Unless provisions are made to limit the impact of faults, users will perceive a decrease in reliability when they entrust their applications to larger machines. This paper shows that fault containment techniques can be effectively applied to scalable shared-memory multiprocessors to reduce the reliability problems created by increased machine size.The primary goal of our approach is to leave normal-mode performance unaffected. Rather than using expensive fault-tolerance techniques to mask the effects of data and resource loss, our strategy is based on limiting the damage caused by faults to only a portion of the machine. After a hardware fault, we run a distributed recovery algorithm that allows normal operation to be resumed in the functioning parts of the machine.Our approach is implemented in the Stanford FLASH multiprocessor. Using a detailed hardware simulator, we have performed a number of fault injection experiments on a FLASH system running Hive, an operating system designed to support fault containment. The results we report validate our approach and show that in conjunction with an operating system like Hive, we can improve the reliability seen by unmodified applications without substantial performance cost. Simulation results suggest that our algorithms scale well for systems up to 128 processors.

[1]  William A. Wulf,et al.  HYDRA/C.Mmp, An Experimental Computer System , 1981 .

[2]  Christine Morin,et al.  A Survey of Recoverable Distributed Shared Memory Systems , 1995 .

[3]  Anoop Gupta,et al.  Complete computer system simulation: the SimOS approach , 1995, IEEE Parallel Distributed Technol. Syst. Appl..

[4]  Rudy Lauwereins,et al.  Fault-Tolerant Compact Routing Based on Reduced Structural Information in Wormhole-Switching Based Networks , 1994, SIROCCO.

[5]  RosenblumMendel,et al.  Implementing efficient fault containment for multiprocessors , 1996 .

[6]  Tom Lovett,et al.  STiNG: A CC-NUMA Computer System for the Commercial Marketplace , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[7]  Seif Haridi,et al.  Distributed Algorithms , 1992, Lecture Notes in Computer Science.

[8]  Philip A. Bernstein,et al.  Sequoia: a fault-tolerant tightly coupled multiprocessor for transaction processing , 1988, Computer.

[9]  Mary K. Vernon,et al.  Efficient synchronization primitives for large-scale cache-coherent multiprocessors , 1989, ASPLOS 1989.

[10]  J. Laudon,et al.  The SGI Origin 2000: A CCNUMA highly Scaleble Server , 1997, ISCA 1997.

[11]  Lionel M. Ni,et al.  Fault-tolerant wormhole routing in meshes without virtual channels , 1996, IEEE Transactions on Parallel and Distributed Systems.

[12]  Anoop Gupta,et al.  Hive: fault containment for shared-memory multiprocessors , 1995, SOSP.

[13]  Mary K. Vernon,et al.  Efficient synchronization primitives for large-scale cache-coherent multiprocessors , 1989, ASPLOS III.

[14]  José M. Bernabéu-Aubán,et al.  Solaris MC: A Multi Computer OS , 1996, USENIX Annual Technical Conference.

[15]  Piotr Indyk,et al.  Fast estimation of diameter and shortest paths (without matrix multiplication) , 1996, SODA '96.

[16]  Anoop Gupta,et al.  The Stanford FLASH multiprocessor , 1994, ISCA '94.

[17]  Victor P. Nelson Fault-tolerant computing: fundamental concepts , 1990, Computer.

[18]  Anoop Gupta,et al.  Implementing efficient fault containment for multiprocessors: confining faults in a shared-memory multiprocessor environment , 1996, CACM.

[19]  Bruce J. Walker,et al.  The LOCUS Distributed System Architecture , 1986 .

[20]  Kenneth C. Yeager The Mips R10000 superscalar microprocessor , 1996, IEEE Micro.

[21]  D. Lenoski,et al.  The SGI Origin: A ccnuma Highly Scalable Server , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[22]  Lionel M. Ni,et al.  Fault-tolerant routing in hypercube multicomputers using local safety information , 1996 .

[23]  T. Lovett,et al.  STiNG: A CC-NUMA Computer System for the Commercial Marketplace , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[24]  David L Weaver,et al.  The SPARC architecture manual : version 9 , 1994 .

[25]  Thomas L. Sterling,et al.  Improving Application Performance on HP/Convex Exemplar , 1996, Computer.

[26]  Andrew R. Cherenson,et al.  The Sprite network operating system , 1988, Computer.

[27]  A. Gefflaut,et al.  COMA: An Opportunity for Building Fault-Tolerant Scalable Shared Memory Multiprocessors , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[28]  Anoop Gupta,et al.  The performance impact of flexibility in the Stanford FLASH multiprocessor , 1994, ASPLOS VI.

[29]  Anoop Gupta,et al.  The Stanford FLASH Multiprocessor , 1994, ISCA.

[30]  T. Wicki,et al.  The Mercury Interconnect Architecture: A Cost-effective Infrastructure For High-performance Servers , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.