Process Introspection: A Heterogeneous Checkpoint/Restart Mechanism Based on Automatic Code Modification

Process Introspection is a fundamentally new solution to the process checkpoint/restart problem suitable for use in high-performance heterogeneous distributed systems. A process checkpoint/restart mechanism for such an environment has the primary requirement that it must be platform-independent: process checkpoints produced on a computer system of one architecture or operating system platform must be restartable on a computer system of a different architecture or operating system platform. The central feature of the Process Introspection approach is automatic augmentation of program code to incorporate checkpoint and restart functionality. This program modification is performed at a platform-independent intermediate level of code representation, and preserves the original program semantics. This approach has attractive properties including portability, ease of use, customizability to application-specific requirements, and flexibility with respect to basic performance trade-offs. Our solution is novel in its true platform- and run-time system independence - no system support or non-portable code is required by our core mechanisms. Recent experimental results obtained using a prototype implementation of the Process Introspection system indicate the overheads introduced by the mechanisms are acceptable for computationally demanding applications.

[1]  Andrew S. Grimshaw,et al.  Metasystems: An Approach Combining Parallel Processing and Heterogeneous Distributed Computing Systems , 1994, J. Parallel Distributed Comput..

[2]  Richard F. Freund,et al.  Superconcurrency: A Form of Distributed Heterogeneous Supercomputing , 1991 .

[3]  Jonathan Walpole,et al.  MIST: PVM with Transparent Migration and Checkpointing , 1995 .

[4]  Charles M. Shub,et al.  A unified model of pointwise equivalence of procedural computations , 1994, TOPL.

[5]  Marvin Theimer,et al.  Heterogeneous process migration by recompilation , 1991, [1991] Proceedings. 11th International Conference on Distributed Computing Systems.

[6]  David R. Jefferson,et al.  Virtual time , 1985, ICPP.

[7]  Peter Smith,et al.  Heterogeneous process migration: the Tui system , 1998, Softw. Pract. Exp..

[8]  Jonathan Robinson,et al.  A task migration implementation of the Message-Passing Interface , 1996, Proceedings of 5th IEEE International Symposium on High Performance Distributed Computing.

[9]  Miron Livny,et al.  Condor-a hunter of idle workstations , 1988, [1988] Proceedings. The 8th International Conference on Distributed.

[10]  Andrew S. Grimshaw,et al.  The core Legion object model , 1996, Proceedings of 5th IEEE International Symposium on High Performance Distributed Computing.

[11]  Bjarne Steensgaard,et al.  Object and native code thread mobility among heterogeneous computers , 1995, SOSP.

[12]  A. S. Grimshaw Meta-Systems: An Approach Combining Parallel Processing and Heterogeneous Distributed Computing Systems , 1992, Proceedings. Workshop on Heterogeneous Processing.

[13]  H. Zhou,et al.  "Receiver makes right" data conversion in PVM , 1995, Proceedings International Phoenix Conference on Computers and Communications.

[14]  Charles M. Shub,et al.  Process-originated migration in a heterogeneous environment , 1989, CSC '89.

[15]  Dennis Gannon,et al.  Sage++: An Object-Oriented Toolkit and Class Library for Building Fortran and C++ Restructuring Tool , 1994 .