论文信息 - The co-replication methodology and its application to structured parallel programs

The co-replication methodology and its application to structured parallel programs

We introduce Co-Replication, a technique exploiting abstract properties of a computation to allow parallel replicas of a software module to cooperate, enhancing both the reliability and availability of the resulting component, and providing a flexible trade-off among the two properties. In Co-Replication a complete partial ordering is defined on the computation state. The formal expression of the state combination operation among replicas allows them to compute independently as a co-algorithm, and to exploit low-overhead, opportunistic strategies for spreading results and surviving to faults. Co-Replication suits structured parallel and component based programming, as it needs a high level description of the computation properties, and thus can ease exploitation ofnon fault-free, parallel platforms like large clusters and Grids. We describe the theoretical foundations of Co-Replication, and investigate the use of random gossiping strategies for the state combination. To show the applicability of the technique, we discuss the modelization of Master-Slave and task farm computations, and report test results over two applications.

[1] Zvi M. Kedem,et al. Charlotte: Metacomputing on the Web , 1999, Future Gener. Comput. Syst..

[2] Rachid Guerraoui,et al. Software-Based Replication for Fault Tolerance , 1997, Computer.

[3] Kenneth P. Birman,et al. Reliable communication in the presence of failures , 1987, TOCS.

[4] Alan D. George,et al. Gossip-Style Failure Detection and Distributed Consensus for Scalable Heterogeneous Clusters , 2004, Cluster Computing.

[5] Donald E. Knuth,et al. Dancing links , 2000, cs/0011047.

[6] Jeff T. Linderoth,et al. An enabling framework for master-worker applications on the Computational Grid , 2000, Proceedings the Ninth International Symposium on High-Performance Distributed Computing.

[7] Brian A. Davey,et al. An Introduction to Lattices and Order , 1989 .

[8] Bradley C. Kuszmaul,et al. Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[9] Antonio Cisternino,et al. Parallelization of C# Programs Through Annotations , 2007, International Conference on Computational Science.

[10] Anand Sivasubramaniam,et al. Filtering failure logs for a BlueGene/L prototype , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[11] Alan Stewart,et al. Computational Models for Web- and Grid-Based Computation , 2003, Euro-Par.

[12] Murray Cole,et al. Bringing skeletons out of the closet: a pragmatic manifesto for skeletal parallel programming , 2004, Parallel Comput..

[13] Robert D. Blumofe,et al. Adaptive and Reliable ParallelComputing9 Networks of Workstations , 1997 .

[14] Marco Danelutto,et al. SkIE: A heterogeneous environment for HPC applications , 1999, Parallel Comput..

[15] David W. Krumme,et al. Gossiping in Minimal Time , 1992, SIAM J. Comput..

[16] Pierre Sens,et al. DARX - a framework for the fault-tolerant support of agent software , 2003, 14th International Symposium on Software Reliability Engineering, 2003. ISSRE 2003..

[17] Fabrizio Petrini,et al. System-level fault-tolerance in large-scale parallel machines with buffered coscheduling , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[18] Eric A. Brewer,et al. ATLAS: an infrastructure for global computing , 1996, EW 7.

[19] Marco Danelutto,et al. Algorithmic skeletons meeting grids , 2006, Parallel Comput..