Symmetric Active/Active High Availability for High-Performance Computing System Services
暂无分享,去创建一个
[1] Idit Keidar,et al. Group communication specifications: a comprehensive study , 2001, CSUR.
[2] Jack J. Dongarra,et al. Fault Tolerant MPI for the HARNESS Meta-computing System , 2001, International Conference on Computational Science.
[3] David P. Anderson,et al. SETI@home-massively distributed computing for SETI , 2001, Comput. Sci. Eng..
[4] Kai Li,et al. Diskless Checkpointing , 1998, IEEE Trans. Parallel Distributed Syst..
[5] James Arthur Kohl,et al. Harness: Adaptable Virtual Machine Environment for Heterogeneous Clusters , 1999, Parallel Process. Lett..
[6] Luís E. T. Rodrigues,et al. An indulgent uniform total order algorithm with optimistic delivery , 2002, 21st IEEE Symposium on Reliable Distributed Systems, 2002. Proceedings..
[7] Xubin He,et al. Design of a high performance and high availability distributed storage system , 2006 .
[8] Michael T. Heath,et al. Scientific Computing , 2018 .
[9] Stephen L. Scott,et al. Evaluation of fault-tolerant policies using simulation , 2007, 2007 IEEE International Conference on Cluster Computing.
[10] Mario Lauria,et al. CSAR: cluster storage with adaptive redundancy , 2003, 2003 International Conference on Parallel Processing, 2003. Proceedings..
[11] Christian Engelmann,et al. A diskless checkpointing algorithm for super-scale architectures applied to the fast fourier transform , 2003, Proceedings of the International Workshop on Challenges of Large Applications in Distributed Environments, 2003..
[12] Leslie Lamport,et al. The Byzantine Generals Problem , 1982, TOPL.
[13] G. A. Geist,et al. High Availability through Distributed Control , 2004 .
[14] Jason Duell,et al. Berkeley Lab Checkpoint/Restart (BLCR) for Linux Clusters , 2006 .
[15] John A. Gunnels,et al. Simulating solidification in metals at high pressure: The drive to petascale computing , 2006 .
[16] David E. Bernholdt,et al. MOLAR: adaptive runtime support for high-end computing operating and runtime systems , 2006, OPSR.
[17] Andrew Lumsdaine,et al. A Component Architecture for LAM/MPI , 2003, PVM/MPI.
[18] Andy B. Yoo,et al. Approved for Public Release; Further Dissemination Unlimited X-ray Pulse Compression Using Strained Crystals X-ray Pulse Compression Using Strained Crystals , 2002 .
[19] Louise E. Moser,et al. The Totem single-ring ordering and membership protocol , 1995, TOCS.
[20] Danny Dolev,et al. Early delivery totally ordered multicast in asynchronous environments , 1993, FTCS-23 The Twenty-Third International Symposium on Fault-Tolerant Computing.
[21] Christian Engelmann,et al. JOSHUA: Symmetric Active/Active Replication for Highly Available HPC Job and Resource Management , 2006, 2006 IEEE International Conference on Cluster Computing.
[22] Leslie Lamport,et al. The part-time parliament , 1998, TOCS.
[23] Leslie Lamport,et al. Distributed snapshots: determining global states of distributed systems , 1985, TOCS.
[24] Fred B. Schneider,et al. Implementing fault-tolerant services using the state machine approach: a tutorial , 1990, CSUR.
[25] Xubin He,et al. A Fast Delivery Protocol for Total Order Broadcasting , 2007, 2007 16th International Conference on Computer Communications and Networks.
[26] Leslie Lamport,et al. Time, clocks, and the ordering of events in a distributed system , 1978, CACM.
[27] Pedro Pla. Drbd in a heartbeat , 2006 .
[28] Kenneth P. Birman,et al. Performance of the ISIS Distributed Computing Toolkit , 1994 .
[29] Daniel J. Palermo,et al. Enhancing an Open Source Resource Manager with Multi-core/Multi-threaded Support , 2007, JSSPP.
[30] Xubin He,et al. Symmetric Active/Active Replication for Dependent Services , 2008, 2008 Third International Conference on Availability, Reliability and Security.
[31] Danny Dolev,et al. The Design of the Transis System , 1994, Dagstuhl Seminar on Distributed Systems.
[32] Sam Toueg,et al. A Modular Approach to Fault-Tolerant Broadcasts and Related Problems , 1994 .
[33] Christian Engelmann,et al. Concepts for High Availability in Scientific High-End Computing , 2005 .
[34] Thomas Sterling,et al. How to Build a Beowulf: A Guide to the Implementation and Application of PC Clusters 2nd Printing , 1999 .
[35] Israel Koren,et al. Fault-Tolerant Systems , 2007 .
[36] James Arthur Kohl,et al. HARNESS: a next generation distributed virtual machine , 1999, Future Gener. Comput. Syst..
[37] Louise E. Moser,et al. A reliable ordered delivery protocol for interconnected local area networks , 1995, Proceedings of International Conference on Network Protocols.
[38] George Bosilca,et al. Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation , 2004, PVM/MPI.
[39] Courtenay T. Vaughan,et al. Extending catamount for multi-core processors. , 2007 .
[40] Robbert van Renesse,et al. Reliable Distributed Computing with the Isis Toolkit , 1994 .
[41] William Gropp,et al. Beowulf Cluster Computing with Linux , 2003 .
[42] Nancy A. Lynch,et al. Early-Delivery Dynamic Atomic Broadcast , 2002, DISC.
[43] Sean Landis,et al. Building Reliable Distributed Systems with CORBA , 1997, Theory Pract. Object Syst..
[44] Jack J. Dongarra,et al. HARNESS and fault tolerant MPI , 2001, Parallel Comput..
[45] Priya Narasimhan,et al. Thema: Byzantine-fault-tolerant middleware for Web-service applications , 2005, 24th IEEE Symposium on Reliable Distributed Systems (SRDS'05).
[46] Douglas Thain,et al. Distributed computing in practice: the Condor experience , 2005, Concurr. Pract. Exp..
[47] Christian Engelmann,et al. Distributed Peer-to-Peer Control in Harness , 2002, International Conference on Computational Science.
[48] Salim Hariri,et al. Tools and Environments for Parallel and Distributed Computing , 2004 .
[49] Miguel Castro,et al. BASE: using abstraction to improve fault tolerance , 2001, SOSP.
[50] Robert P. Goldberg,et al. Architecture of virtual machines , 1973, Workshop on Virtual Computer Systems.
[51] Danny Dolev,et al. The Transis approach to high availability cluster communication , 1996, CACM.
[52] Silvano Maffeis,et al. The Object Group Design Pattern , 1996, COOTS.
[53] David P. Anderson,et al. BOINC: a system for public-resource computing and storage , 2004, Fifth IEEE/ACM International Workshop on Grid Computing.
[54] Louise E. Moser,et al. Extended virtual synchrony , 1994, 14th International Conference on Distributed Computing Systems.
[55] Philipp Reisner,et al. Replicated Storage with Shared Disk Semantics , 2007 .
[56] Christian Engelmann,et al. Super-Scalable Algorithms for Computing on 100, 000 Processors , 2005, International Conference on Computational Science.
[57] Christian Engelmann,et al. Proactive fault tolerance for HPC with Xen virtualization , 2007, ICS '07.
[58] Robbert van Renesse,et al. The Amoeba distributed operating system - A status report , 1991, Comput. Commun..
[59] Ira Pramanick,et al. High Availability , 2001, Int. J. High Perform. Comput. Appl..
[60] Suzanne M. Kelly,et al. Software Architecture of the Light Weight Kernel, Catamount , 2005 .
[61] Hari Balakrishnan,et al. Tolerating byzantine faults in transaction processing systems using commit barrier scheduling , 2007, SOSP.
[62] Christian Engelmann,et al. Job-Site Level Fault Tolerance for Cluster and Grid environments , 2005, 2005 IEEE International Conference on Cluster Computing.
[63] Leslie Lamport,et al. Using Time Instead of Timeout for Fault-Tolerant Distributed Systems. , 1984, TOPL.
[64] Wu-chun Feng,et al. A Power-Aware Run-Time System for High-Performance Computing , 2005, ACM/IEEE SC 2005 Conference (SC'05).
[65] Robbert van Renesse,et al. Design and Performance of Horus: A Lightweight Group Communications System , 1994 .
[66] Laurent Lefèvre,et al. T2CP-AR: A system for Transparent TCP Active Replication , 2007, 21st International Conference on Advanced Information Networking and Applications (AINA '07).
[67] Andrew S. Tanenbaum,et al. An evaluation of the Amoeba group communication system , 1996, Proceedings of 16th International Conference on Distributed Computing Systems.
[68] D. B. Davis,et al. Sun Microsystems Inc. , 1993 .
[69] Christian Engelmann,et al. Active/active replication for highly available HPC system services , 2006, First International Conference on Availability, Reliability and Security (ARES'06).
[70] Xubin He,et al. Transparent Symmetric Active/Active Replication for Service-Level High Availability , 2007, Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid '07).