Concurrent and fail-safe replicated simulations on heterogeneous networks: An introduction to EcliPSe

This paper presents an overview of the ACES parallel software system and, in particular, an introduction to the EcliPSe layer of the system. The ACES system is a fault-tolerant, layered software system for heterogeneous-network based cluster computing. The EcliPSe toolkit, which resides on an upper layer, was constructed specifically for replication-based and domain-decomposition based simulation applications. It is not, however, restricted to simulations and supports any message-passing form of parallel processing. By taking advantage of networks of heterogeneous machines, generally “idle” workstations, EcliPSe programs can achieve supercomputer level performance with little programming effort. This was a motivating factor in EcliPSe's design. We present an overview of key application-level features in EcliPSe, a new user interface, support for fault-tolerant simulation, and performance results for three simple but large scale and representative experiments.

[1]  Philip Heidelberger,et al.  Discrete event simulations and parallel processing: statistical properties , 1988 .

[2]  Louis H. Turcotte,et al.  A Survey of Software Environments for Exploiting Networked Computing Resources , 1993 .

[3]  Jayadev Misra,et al.  Distributed discrete-event simulation , 1986, CSUR.

[4]  R. Sarnath,et al.  Proceedings of the International Conference on Parallel Processing , 1992 .

[5]  Vaidy S. Sunderam,et al.  Failure-Resilient Computations in the EcliPSe System , 1994, 1994 International Conference on Parallel Processing Vol. 3.

[6]  Ke-Hsiung Chung A concurrent composite computational model for stochastic simulation , 1993 .

[7]  David R. Jefferson,et al.  Virtual time , 1985, ICPP.

[8]  Vernon Rego,et al.  EcliPse: A system for high performance concurrent simulation , 1991, Softw. Pract. Exp..

[9]  Felipe Knop,et al.  Parallel Cluster Labeling on a Network of Workstations 1 , 1995 .

[10]  William E. Biles,et al.  Statistical considerations in simulation on a network of microcomputers , 1985, WSC '85.

[11]  Vaidy S. Sunderam,et al.  Superconcurrent simulation of polymer chains on heterogeneous networks , 1992, Proceedings Supercomputing '92.

[12]  R. M. Fujimoto,et al.  Parallel discrete event simulation , 1989, WSC '89.

[13]  Gholamali C. Shoja A distributed facility for load sharing and parallel processing among workstations , 1991, J. Syst. Softw..

[14]  Bruce M. McMillin,et al.  DAWGS - A Distributed Compute Server Utilizing Idle Workstations , 1992, J. Parallel Distributed Comput..

[15]  Nakanishi,et al.  Statistics of self-avoiding walks on randomly diluted lattices. , 1994, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[16]  Peter Steenkiste,et al.  Fail-Safe PVM: A Portable Package for Distributed Programming with Transparent Recovery , 1993 .

[17]  Vaidy S. Sunderam,et al.  Experiments in Concurrent Stochastic Simulation: The EcliPSe Paradigm , 1992, J. Parallel Distributed Comput..