Error scope on a computational grid: theory and practice

Error propagation is a central problem in grid computing. We re-learned this while adding a Java feature to the Condor computational grid. Our initial experience with the system was negative, due to the large number of new ways in which the system could fail. To reason about this problem, we developed a theory of error propagation. Central to our theory is the concept of an error's scope, defined as the portion of a system that it invalidates. With this theory in hand, we recognized that the expanded system did not properly consider the scope of errors it discovered. We modified the system according to our theory, and succeeded in making it a more robust platform for distributed computing.

[1]  Ian T. Foster,et al.  Condor-G: A Computation Management Agent for Multi-Institutional Grids , 2004, Cluster Computing.

[2]  Andrew S. Grimshaw,et al.  The Legion vision of a worldwide virtual computer , 1997, Commun. ACM.

[3]  Russ Abbott,et al.  Resourceful systems for fault tolerance, reliability, and safety , 1990, CSUR.

[4]  Ken Arnold,et al.  The Java programming language (2nd ed.) , 1998 .

[5]  Andrew Birrell,et al.  Implementing Remote procedure calls , 1983, SOSP '83.

[6]  Pieter H. Hartel,et al.  Abstract machines for programming language implementation , 2000, Future Gener. Comput. Syst..

[7]  Rajesh Raman,et al.  High-throughput resource management , 1998 .

[8]  Miron Livny,et al.  Condor-a hunter of idle workstations , 1988, [1988] Proceedings. The 8th International Conference on Distributed.

[9]  Arthur J. Bernstein,et al.  Some new transitions in hierarchical level structures , 1978, OPSR.

[10]  Kemal Efe,et al.  A proposed solution to the problem of levels in error-message generation , 1987, CACM.

[11]  John B. Goodenough,et al.  Exception handling: issues and a proposed notation , 1975, CACM.

[12]  Andrew S. Tanenbaum,et al.  Structured Computer Organization , 1976 .

[13]  簡聰富,et al.  物件導向軟體之架構(Object-Oriented Software Construction)探討 , 1989 .

[14]  C. A. R. HOARE,et al.  An axiomatic basis for computer programming , 1969, CACM.

[15]  M. Litzkow REMOTE UNIX TURNING IDLE WORKSTATIONS INTO CYCLE SERVERS , 1992 .

[16]  Jerome H. Saltzer,et al.  End-to-end arguments in system design , 1984, TOCS.

[17]  Dan Walsh,et al.  Design and implementation of the Sun network filesystem , 1985, USENIX Conference Proceedings.

[18]  Alan Messer,et al.  Increasing relevance of memory hardware errors: a case for recoverable programming models , 2000, EW 9.

[19]  Ken Arnold,et al.  The Java Programming Language , 1996 .

[20]  Andrew S. Grimshaw,et al.  Integrating fault-tolerance techniques in grid applications , 2000 .

[21]  Bjarne Stroustrup,et al.  The Annotated C++ Reference Manual , 1990 .

[22]  C. A. R. Hoare,et al.  An Axiomatic Basis for Computer Programming (Reprint) , 2002, Software Pioneers.

[23]  Alan Snyder,et al.  Exception Handling in CLU , 1979, IEEE Transactions on Software Engineering.

[24]  Ian T. Foster,et al.  A security architecture for computational grids , 1998, CCS '98.

[25]  Ken Arnold,et al.  The Java Programming Language, Second Edition , 1999 .

[26]  Felix C. Gärtner,et al.  Fundamentals of fault-tolerant distributed computing in asynchronous environments , 1999, CSUR.

[27]  Ian T. Foster,et al.  Replica selection in the Globus Data Grid , 2001, Proceedings First IEEE/ACM International Symposium on Cluster Computing and the Grid.

[28]  Miron Livny,et al.  Providing resource management services to parallel applications , 1994 .

[29]  Henri Casanova,et al.  Netsolve: a Network-Enabled Server for Solving Computational Science Problems , 1997, Int. J. High Perform. Comput. Appl..

[30]  Michael Litzkow,et al.  Supporting checkpointing and process migration outside the UNIX kernel , 1999 .

[31]  W. J. Shaw Making APL error messages kinder and gentler , 1989 .

[32]  Andrew S. Grimshaw,et al.  Using Reflection for Incorporating Fault-Tolerance Techniques into Distributed Applications , 1998, Parallel Process. Lett..

[33]  Geoffrey C. Fox,et al.  Java for parallel computing and as a general language for scientific and engineering simulation and modeling , 1997, Concurr. Pract. Exp..

[34]  Edsger W. Dijkstra,et al.  The structure of the “THE”-multiprogramming system , 1968, CACM.

[35]  K. Mani Chandy,et al.  A world-wide distributed system using Java and the Internet , 1996, Proceedings of 5th IEEE International Symposium on High Performance Distributed Computing.

[36]  Andrew Davison,et al.  A Standard for the Transmission of IP Datagrams on Avian Carriers , 1995 .

[37]  Brian Randell,et al.  Reliability Issues in Computing System Design , 1978, CSUR.

[38]  Chris J. Scheiman,et al.  SuperWeb: research issues in Java‐based global computing , 1997 .

[39]  Andrew P. Black,et al.  Exception handling : The case against , 1982 .

[40]  Ian T. Foster,et al.  Globus: a Metacomputing Infrastructure Toolkit , 1997, Int. J. High Perform. Comput. Appl..

[41]  Craig Partridge,et al.  When the CRC and TCP checksum disagree , 2000, SIGCOMM.

[42]  Steven Tuecke,et al.  Protocols and services for distributed data-intensive science , 2002 .

[43]  W. J. Shaw Making APL Error Messages Kinder and Gentler , 1989, APL.

[44]  Jeffrey I. Schiller,et al.  An Authentication Service for Open Network Systems. In , 1998 .

[45]  Miron Livny,et al.  JavaGenes and Condor: cycle-scavenging genetic algorithms , 2000, JAVA '00.

[46]  C. Howell,et al.  Exception handling in large Ada systems , 1991, WADAS '91.

[47]  Ami Marowka,et al.  The GRID: Blueprint for a New Computing Infrastructure , 2000, Parallel Distributed Comput. Pract..

[48]  Joseph D. Darcy,et al.  How Java’s Floating-Point Hurts Everyone Everywhere , 2004 .

[49]  A. Avizienis,et al.  Dependable computing: From concepts to design diversity , 1986, Proceedings of the IEEE.

[50]  Rajesh Raman,et al.  Matchmaking frameworks for distributed resource management , 2000 .

[51]  Simon L. Peyton Jones,et al.  Asynchronous exceptions in Haskell , 2001, PLDI '01.