Symmetric Active/Active High Availability for High-Performance Computing System Services: Accomplishments and Limitations
暂无分享,去创建一个
Xubin He | Christian Engelmann | Stephen L. Scott | Chokchai Leangsuksun | S. Scott | C. Engelmann | C. Leangsuksun | Xubin He
[1] Christian Engelmann,et al. JOSHUA: Symmetric Active/Active Replication for Highly Available HPC Job and Resource Management , 2006, 2006 IEEE International Conference on Cluster Computing.
[2] Rafael M. Gasca,et al. Towards a Dependable Architecture for Highly Available Internet Services , 2008, 2008 Third International Conference on Availability, Reliability and Security.
[3] Danny Dolev,et al. Early delivery totally ordered multicast in asynchronous environments , 1993, FTCS-23 The Twenty-Third International Symposium on Fault-Tolerant Computing.
[4] Xubin He,et al. Symmetric Active/Active Replication for Dependent Services , 2008, 2008 Third International Conference on Availability, Reliability and Security.
[5] Terry Jones,et al. HPC System Call Usage Trends , 2007 .
[6] Anoop Gupta,et al. Parallel computer architecture - a hardware / software approach , 1998 .
[7] Jack J. Dongarra,et al. Fault Tolerant MPI for the HARNESS Meta-computing System , 2001, International Conference on Computational Science.
[8] Xin Chen,et al. Symmetric active/active metadata service for high availability parallel file systems , 2009, J. Parallel Distributed Comput..
[9] Thomas Hérault,et al. MPI tools and performance studies - Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI , 2006, SC.
[10] Ramakrishna Kotla,et al. Zyzzyva , 2007, SOSP.
[11] T. Inglett,et al. Designing a Highly-Scalable Operating System: The Blue Gene/L Story , 2006, ACM/IEEE SC 2006 Conference (SC'06).
[12] Leslie Lamport,et al. The Byzantine Generals Problem , 1982, TOPL.
[13] Luís E. T. Rodrigues,et al. An indulgent uniform total order algorithm with optimistic delivery , 2002, 21st IEEE Symposium on Reliable Distributed Systems, 2002. Proceedings..
[14] Luís Moura Silva,et al. An experimental study about diskless checkpointing , 1998, Proceedings. 24th EUROMICRO Conference (Cat. No.98EX204).
[15] John A. Gunnels,et al. Simulating solidification in metals at high pressure: The drive to petascale computing , 2006 .
[16] Xubin He,et al. A Fast Delivery Protocol for Total Order Broadcasting , 2007, 2007 16th International Conference on Computer Communications and Networks.
[17] Jack J. Dongarra,et al. Fault-Tolerant Matrix Operations for Networks of Workstations Using Diskless Checkpointing , 1997, J. Parallel Distributed Comput..
[18] James Arthur Kohl,et al. HARNESS: a next generation distributed virtual machine , 1999, Future Gener. Comput. Syst..
[19] Jason Duell,et al. Berkeley Lab Checkpoint/Restart (BLCR) for Linux Clusters , 2006 .
[20] Song Jiang,et al. Transparent, Incremental Checkpointing at Kernel Level: a Foundation for Fault Tolerance for Parallel Computers , 2005, ACM/IEEE SC 2005 Conference (SC'05).
[21] Andrew Lumsdaine,et al. A Component Architecture for LAM/MPI , 2003, PVM/MPI.
[22] Andy B. Yoo,et al. Approved for Public Release; Further Dissemination Unlimited X-ray Pulse Compression Using Strained Crystals X-ray Pulse Compression Using Strained Crystals , 2002 .
[23] Louise E. Moser,et al. The Totem single-ring ordering and membership protocol , 1995, TOCS.
[24] Leslie Lamport,et al. Using Time Instead of Timeout for Fault-Tolerant Distributed Systems. , 1984, TOPL.
[25] Wu-chun Feng,et al. A Power-Aware Run-Time System for High-Performance Computing , 2005, ACM/IEEE SC 2005 Conference (SC'05).
[26] Robbert van Renesse,et al. Design and Performance of Horus: A Lightweight Group Communications System , 1994 .
[27] Laurent Lefèvre,et al. T2CP-AR: A system for Transparent TCP Active Replication , 2007, 21st International Conference on Advanced Information Networking and Applications (AINA '07).
[28] Andrew Robertson. The Evolution of the Linux-HA Project , 2004 .
[29] Bianca Schroeder,et al. Understanding failures in petascale computers , 2007 .
[30] Hong Ong,et al. Middleware in Modern High Performance Computing System Architectures , 2007, International Conference on Computational Science.
[31] Silvano Maffeis,et al. The Object Group Design Pattern , 1996, COOTS.
[32] David P. Anderson,et al. BOINC: a system for public-resource computing and storage , 2004, Fifth IEEE/ACM International Workshop on Grid Computing.
[33] Louise E. Moser,et al. Extended virtual synchrony , 1994, 14th International Conference on Distributed Computing Systems.
[34] Leslie Lamport,et al. Time, clocks, and the ordering of events in a distributed system , 1978, CACM.
[35] William Gropp,et al. Beowulf Cluster Computing with Linux , 2003 .
[36] Christoph Kreitz,et al. Building reliable, high-performance communication systems from components , 2000, OPSR.
[37] Nancy A. Lynch,et al. Early-Delivery Dynamic Atomic Broadcast , 2002, DISC.
[38] Leslie Lamport,et al. Distributed snapshots: determining global states of distributed systems , 1985, TOCS.
[39] Heather M. Quinn,et al. Terrestrial-based radiation upsets: a cautionary tale , 2005, 13th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM'05).
[40] Sean Landis,et al. Building Reliable Distributed Systems with CORBA , 1997, Theory Pract. Object Syst..
[41] C. Leangsuksun,et al. Asymmetric Active-Active High Availability for High-end Computing , 2005 .
[42] Claudiu Danilov,et al. The Spread Toolkit: Architecture and Performance , 2004 .
[43] Srinidhi Varadarajan,et al. DejaVu: transparent user-level checkpointing, migration and recovery for distributed systems , 2006, SC.
[44] James E. Smith,et al. The architecture of virtual machines , 2005, Computer.
[45] Robbert van Renesse,et al. The Amoeba distributed operating system - A status report , 1991, Comput. Commun..
[46] Hagit Attiya,et al. Distributed computing - fundamentals, simulations, and advanced topics (2. ed.) , 2004, Wiley series on parallel and distributed computing.
[47] George Bosilca,et al. Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation , 2004, PVM/MPI.
[48] Louise E. Moser,et al. A reliable ordered delivery protocol for interconnected local area networks , 1995, Proceedings of International Conference on Network Protocols.
[49] Philipp Reisner,et al. Replicated Storage with Shared Disk Semantics , 2007 .
[50] Xubin He,et al. On Programming Models for Service-Level High Availability , 2007, The Second International Conference on Availability, Reliability and Security (ARES'07).
[51] Christian Engelmann,et al. Super-Scalable Algorithms for Computing on 100, 000 Processors , 2005, International Conference on Computational Science.
[52] Christian Engelmann,et al. Proactive fault tolerance for HPC with Xen virtualization , 2007, ICS '07.
[53] Dhabaleswar K. Panda,et al. Benefits of high speed interconnects to cluster file systems: a case study with Lustre , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.
[54] Priya Narasimhan,et al. Thema: Byzantine-fault-tolerant middleware for Web-service applications , 2005, 24th IEEE Symposium on Reliable Distributed Systems (SRDS'05).
[55] Kenneth P. Birman,et al. Performance of the ISIS Distributed Computing Toolkit , 1994 .
[56] Daniel J. Palermo,et al. Enhancing an Open Source Resource Manager with Multi-core/Multi-threaded Support , 2007, JSSPP.
[57] G. A. Geist,et al. High Availability through Distributed Control , 2004 .
[58] Michael K. Reiter,et al. Low-overhead byzantine fault-tolerant storage , 2007, SOSP.
[59] A. Singh,et al. Fault-tolerant systems , 1990, Computer.
[60] Fred B. Schneider,et al. Implementing fault-tolerant services using the state machine approach: a tutorial , 1990, CSUR.
[61] Zizhong Chen,et al. Algorithm-based checkpoint-free fault tolerance for parallel matrix computations on volatile resources , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.
[62] Gustavo Alonso,et al. Using Optimistic Atomic Broadcast in Transaction Processing Systems , 2003, IEEE Trans. Knowl. Data Eng..
[63] Courtenay T. Vaughan,et al. Extending catamount for multi-core processors. , 2007 .
[64] Sheng-Kai Hung,et al. Modularized Redundant Parallel Virtual File System , 2005, Asia-Pacific Computer Systems Architecture Conference.
[65] Christian Engelmann,et al. Symmetric Active/Active High Availability for High-Performance Computing System Services , 2006, J. Comput..
[66] Robbert van Renesse,et al. Reliable Distributed Computing with the Isis Toolkit , 1994 .
[67] Andrew S. Tanenbaum,et al. An evaluation of the Amoeba group communication system , 1996, Proceedings of 16th International Conference on Distributed Computing Systems.
[68] Christian Engelmann,et al. Active/active replication for highly available HPC system services , 2006, First International Conference on Availability, Reliability and Security (ARES'06).
[69] Kai Li,et al. Diskless Checkpointing , 1998, IEEE Trans. Parallel Distributed Syst..
[70] James Arthur Kohl,et al. Harness: Adaptable Virtual Machine Environment for Heterogeneous Clusters , 1999, Parallel Process. Lett..
[71] Jason Duell,et al. The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing , 2005, Int. J. High Perform. Comput. Appl..
[72] Christian Engelmann,et al. Symmetric active/active metadata service for highly available cluster storage systems , 2007 .
[73] Suzanne M. Kelly,et al. Catamount Software Architecture with Dual Core Extensions , 2005 .
[74] Christian Engelmann,et al. High Availability for Ultra-Scale High-End Scientific Computing , 2008 .
[75] Xubin He,et al. Transparent Symmetric Active/Active Replication for Service-Level High Availability , 2007, Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid '07).
[76] Pedro Pla. Drbd in a heartbeat , 2006 .
[77] Douglas Thain,et al. Distributed computing in practice: the Condor experience , 2005, Concurr. Pract. Exp..
[78] Robert B. Ross,et al. PVFS: A Parallel File System for Linux Clusters , 2000, Annual Linux Showcase & Conference.
[79] Christian Engelmann,et al. Distributed Peer-to-Peer Control in Harness , 2002, International Conference on Computational Science.
[80] Leslie Lamport,et al. The part-time parliament , 1998, TOCS.
[81] Jack J. Dongarra,et al. HARNESS and fault tolerant MPI , 2001, Parallel Comput..
[82] Richard E. Harper,et al. A Case for High Availability in a Virtualized Environment (HAVEN) , 2008, 2008 Third International Conference on Availability, Reliability and Security.
[83] Miguel Castro,et al. Practical byzantine fault tolerance and proactive recovery , 2002, TOCS.
[84] Christian Engelmann,et al. Job-Site Level Fault Tolerance for Cluster and Grid environments , 2005, 2005 IEEE International Conference on Cluster Computing.
[85] Sam Toueg,et al. A Modular Approach to Fault-Tolerant Broadcasts and Related Problems , 1994 .
[86] Christian Engelmann,et al. Concepts for High Availability in Scientific High-End Computing , 2005 .
[87] David E. Bernholdt,et al. MOLAR: adaptive runtime support for high-end computing operating and runtime systems , 2006, OPSR.
[88] Salim Hariri,et al. Tools and Environments for Parallel and Distributed Computing , 2004 .
[89] Miguel Castro,et al. Using abstraction to improve fault tolerance , 2001, Proceedings Eighth Workshop on Hot Topics in Operating Systems.
[90] Miguel Castro,et al. BASE: using abstraction to improve fault tolerance , 2001, SOSP.
[91] Danny Dolev,et al. The Transis approach to high availability cluster communication , 1996, CACM.
[92] Suzanne M. Kelly,et al. Software Architecture of the Light Weight Kernel, Catamount , 2005 .
[93] W. Vogels,et al. The Horus and Ensemble projects: accomplishments and limitations , 2000, Proceedings DARPA Information Survivability Conference and Exposition. DISCEX'00.
[94] Hari Balakrishnan,et al. Tolerating byzantine faults in transaction processing systems using commit barrier scheduling , 2007, SOSP.
[95] Idit Keidar,et al. Group communication specifications: a comprehensive study , 2001, CSUR.
[96] Roberto Baldoni,et al. Total Order Communications: A Practical Analysis , 2005, EDCC.
[97] Kees Verstoep,et al. Group communication in Amoeba and its applications , 1993, Distributed Syst. Eng..
[98] David P. Anderson,et al. SETI@home-massively distributed computing for SETI , 2001, Comput. Sci. Eng..
[99] Sheng-Kai Hung,et al. DPCT: Distributed Parity Cache Table for Redundant Parallel File System , 2006, HPCC.
[100] Stephen L. Scott,et al. Evaluation of fault-tolerant policies using simulation , 2007, 2007 IEEE International Conference on Cluster Computing.
[101] Mario Lauria,et al. CSAR: cluster storage with adaptive redundancy , 2003, 2003 International Conference on Parallel Processing, 2003. Proceedings..
[102] Christian Engelmann,et al. A diskless checkpointing algorithm for super-scale architectures applied to the fast fourier transform , 2003, Proceedings of the International Workshop on Challenges of Large Applications in Distributed Environments, 2003..
[103] Sarah Ellen Michalak,et al. Application MTTFE vs. Platform MTBF: A Fresh Perspective on System Reliability and Application Throughput for Computations at Scale , 2008, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID).
[104] Danny Dolev,et al. The Design of the Transis System , 1994, Dagstuhl Seminar on Distributed Systems.
[105] Benjamin Ray Seyfarth,et al. How to Build a Beowulf: A Guide to the Implementation and Application of PC Clusters , 2000, Scalable Comput. Pract. Exp..
[106] Kenneth P. Birman,et al. Reliable Distributed Systems: Technologies, Web Services, and Applications , 2005 .
[107] Xubin He,et al. Design of a high performance and high availability distributed storage system , 2006 .