Checkpointing in CosMiC: a user-level process migration environment

The CosMiC system is a user-level process migration environment. Process migration is defined as the mechanism to checkpoint the state of an unfinished process, transfer the state from one machine to another and resume process execution on the new machine. The main purposes of process migration are: (1) to utilize the CPU power and balance load on all machines in an environment; (2) to provide fault-tolerance by migrating a process from a failed machine to another machine.

[1]  Charlie Kindel,et al.  Distributed Component Object Model Protocol -- DCOM/1.0 , 1998 .

[2]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[3]  Glenn S. Fowler The Shell as a Service , 1993, USENIX Summer.

[4]  Andrew R. Cherenson,et al.  The Sprite network operating system , 1988, Computer.

[5]  J. N. Chelotti,et al.  A software fault tolerance experiment for space applications , 1990, [1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium.

[6]  Kai Li,et al.  Libckpt: Transparent Checkpointing under UNIX , 1995, USENIX.

[7]  Jingwen Wang,et al.  Utopia: A load sharing facility for large, heterogeneous distributed computer systems , 1993, Softw. Pract. Exp..

[8]  George G. Robertson,et al.  Accent: A communication oriented network operating system kernel , 1981, SOSP.

[9]  Yi-Min Wang,et al.  Integrating checkpointing with transaction processing , 1997, Proceedings of IEEE 27th International Symposium on Fault Tolerant Computing.

[10]  Kang G. Shin,et al.  Optimization criteria for checkpoint placement , 1984, CACM.

[11]  Amnon Barak,et al.  Mos: A multicomputer distributed operating system , 1985, Softw. Pract. Exp..

[12]  Kang G. Shin,et al.  Optimal Checkpointing of Real-Time Tasks , 1987, IEEE Transactions on Computers.

[13]  Erol Gelenbe,et al.  On the Optimum Checkpoint Interval , 1979, JACM.

[14]  David R. Cheriton,et al.  The V distributed system , 1988, CACM.

[15]  Yennun Huang,et al.  A software fault tolerance platform , 1995 .

[16]  Miron Livny,et al.  Condor-a hunter of idle workstations , 1988, [1988] Proceedings. The 8th International Conference on Distributed.

[17]  William J. Bolosky,et al.  Mach: A New Kernel Foundation for UNIX Development , 1986, USENIX Summer.

[18]  Yi-Min Wang,et al.  Checkpointing and its applications , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[19]  Jack Dongarra,et al.  PVM: Parallel virtual machine: a users' guide and tutorial for networked parallel computing , 1995 .

[20]  Erik Seligman,et al.  High-Level Fault Tolerance in Distributed Programs , 1994 .