The co-replication methodology and its application to structured parallel programs

We introduce Co-Replication, a technique exploiting abstract properties of a computation to allow parallel replicas of a software module to cooperate, enhancing both the reliability and availability of the resulting component, and providing a flexible trade-off among the two properties. In Co-Replication a complete partial ordering is defined on the computation state. The formal expression of the state combination operation among replicas allows them to compute independently as a co-algorithm, and to exploit low-overhead, opportunistic strategies for spreading results and surviving to faults. Co-Replication suits structured parallel and component based programming, as it needs a high level description of the computation properties, and thus can ease exploitation ofnon fault-free, parallel platforms like large clusters and Grids. We describe the theoretical foundations of Co-Replication, and investigate the use of random gossiping strategies for the state combination. To show the applicability of the technique, we discuss the modelization of Master-Slave and task farm computations, and report test results over two applications.

[1]  Zvi M. Kedem,et al.  Charlotte: Metacomputing on the Web , 1999, Future Gener. Comput. Syst..

[2]  Rachid Guerraoui,et al.  Software-Based Replication for Fault Tolerance , 1997, Computer.

[3]  Kenneth P. Birman,et al.  Reliable communication in the presence of failures , 1987, TOCS.

[4]  Alan D. George,et al.  Gossip-Style Failure Detection and Distributed Consensus for Scalable Heterogeneous Clusters , 2004, Cluster Computing.

[5]  Donald E. Knuth,et al.  Dancing links , 2000, cs/0011047.

[6]  Jeff T. Linderoth,et al.  An enabling framework for master-worker applications on the Computational Grid , 2000, Proceedings the Ninth International Symposium on High-Performance Distributed Computing.

[7]  Brian A. Davey,et al.  An Introduction to Lattices and Order , 1989 .

[8]  Bradley C. Kuszmaul,et al.  Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[9]  Antonio Cisternino,et al.  Parallelization of C# Programs Through Annotations , 2007, International Conference on Computational Science.

[10]  Anand Sivasubramaniam,et al.  Filtering failure logs for a BlueGene/L prototype , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[11]  Alan Stewart,et al.  Computational Models for Web- and Grid-Based Computation , 2003, Euro-Par.

[12]  Murray Cole,et al.  Bringing skeletons out of the closet: a pragmatic manifesto for skeletal parallel programming , 2004, Parallel Comput..

[13]  Robert D. Blumofe,et al.  Adaptive and Reliable ParallelComputing9 Networks of Workstations , 1997 .

[14]  Marco Danelutto,et al.  SkIE: A heterogeneous environment for HPC applications , 1999, Parallel Comput..

[15]  David W. Krumme,et al.  Gossiping in Minimal Time , 1992, SIAM J. Comput..

[16]  Pierre Sens,et al.  DARX - a framework for the fault-tolerant support of agent software , 2003, 14th International Symposium on Software Reliability Engineering, 2003. ISSRE 2003..

[17]  Fabrizio Petrini,et al.  System-level fault-tolerance in large-scale parallel machines with buffered coscheduling , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[18]  Eric A. Brewer,et al.  ATLAS: an infrastructure for global computing , 1996, EW 7.

[19]  Marco Danelutto,et al.  Algorithmic skeletons meeting grids , 2006, Parallel Comput..