Reparallelization techniques for migrating OpenMP codes in computational grids

Typical computational grid users target only a single cluster and have to estimate the runtime of their jobs. Job schedulers prefer short‐running jobs to maintain a high system utilization. If the user underestimates the runtime, premature termination causes computation loss; overestimation is penalized by long queue times. As a solution, we present an automatic reparallelization and migration of OpenMP applications. A reparallelization is dynamically computed for an OpenMP work distribution when the number of CPUs changes. The application can be migrated between clusters when an allocated time slice is exceeded. Migration is based on a coordinated, heterogeneous checkpointing algorithm. Both reparallelization and migration enable the user to freely use computing time at more than a single point of the grid. Our demo applications successfully adapt to the changed CPU setting and smoothly migrate between, for example, clusters in Erlangen, Germany, and Amsterdam, the Netherlands, that use different kinds and numbers of processors. Benchmarks show that reparallelization and migration impose average overheads of about 4 and 2%, respectively. Copyright © 2008 John Wiley & Sons, Ltd.

[1]  Benoit Hudzia,et al.  Transparent Migration of Multi-Threaded Applications on a Java Based Grid , 2006, ArXiv.

[2]  Mark Kambites,et al.  Towards OpenMP for Java , 2004 .

[3]  Boleslaw K. Szymanski,et al.  Dynamic Malleability in Iterative MPI Applications , 2007, Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid '07).

[4]  J. Duell The design and implementation of Berkeley Lab's linux checkpoint/restart , 2005 .

[5]  Sarmistha Neogy,et al.  Distributed checkpointing using synchronized clocks , 2002, Proceedings 26th Annual International Computer Software and Applications.

[6]  Péter Kacsuk,et al.  SERVER BASED MIGRATION OF PARALLEL APPLICATIONS , 2002 .

[7]  Kai Li,et al.  Libckpt: Transparent Checkpointing under UNIX , 1995, USENIX.

[8]  Michael Philippsen,et al.  JaMP: an implementation of OpenMP for a Java DSM , 2007, Concurr. Comput. Pract. Exp..

[9]  Stephen A. Jarvis,et al.  Performance-based middleware services for grid computing , 2003, 2003 Autonomic Computing Workshop.

[10]  Rajeev Motwani,et al.  The load rebalancing problem , 2006, J. Algorithms.

[11]  B. Ramkumar,et al.  Portable checkpointing for heterogeneous architectures , 1997, Proceedings of IEEE 27th International Symposium on Fault Tolerant Computing.

[12]  William H. Sanders,et al.  Application-Driven Coordination-Free Distributed Checkpointing , 2005, 25th IEEE International Conference on Distributed Computing Systems (ICDCS'05).

[13]  D. Wolf-Gladrow Lattice-Gas Cellular Automata and Lattice Boltzmann Models: An Introduction , 2000 .

[14]  Fred Douglis,et al.  Transparent process migration: Design alternatives and the sprite implementation , 1991, Softw. Pract. Exp..

[15]  Keshav Pingali,et al.  Mobile MPI programs in computational grids , 2006, PPoPP '06.

[16]  Dmitry N. Zotkin,et al.  Attacking the bottlenecks of backfilling schedulers , 2004, Cluster Computing.

[17]  Sathish S. Vadhiyar,et al.  Self adaptivity in Grid computing , 2005, Concurr. Pract. Exp..

[18]  Cyril S. Ku,et al.  Design Patterns , 2008, Wiley Encyclopedia of Computer Science and Engineering.

[19]  L.A. Smith,et al.  A Parallel Java Grande Benchmark Suite , 2001, ACM/IEEE SC 2001 Conference (SC'01).

[20]  Jason Duell,et al.  The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing , 2005, Int. J. High Perform. Comput. Appl..

[21]  John Shalf,et al.  The Cactus Worm: Experiments with Dynamic Resource Discovery and Allocation in a Grid Environment , 2001, Int. J. High Perform. Comput. Appl..

[22]  V. Rajaraman,et al.  A survey of checkpointing algorithms for parallel and distributed computers , 2000 .

[23]  Henri E. Bal,et al.  Runtime optimizations for a Java DSM implementation , 2001, JGI '01.

[24]  Laxmikant V. Kalé,et al.  Adaptive MPI , 2003, LCPC.

[25]  Jörg Schneider,et al.  Heuristic Scheduling of Grid Workflows Supporting Co-Allocation and Advance Reservation , 2007, Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid '07).

[26]  Amnon Barak,et al.  The MOSIX multicomputer operating system for high performance cluster computing , 1998, Future Gener. Comput. Syst..

[27]  Jian Huang,et al.  Dynamic co-scheduling of distributed computation and replication , 2006, Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID'06).

[28]  Michael Philippsen,et al.  Near Overhead-free Heterogeneous Thread-migration , 2005, 2005 IEEE International Conference on Cluster Computing.

[29]  Claudia Fohry,et al.  Implementing Irregular Parallel Algorithms with OpenMP , 2006, Euro-Par.