Using lightweight checkpoint/recovery to improve the availability and designability of shared memory multiprocessors
暂无分享,去创建一个
[1] Milo M. K. Martin,et al. Bandwidth adaptive snooping , 2002, Proceedings Eighth International Symposium on High Performance Computer Architecture.
[2] Dhiraj K. Pradhan,et al. Fault-tolerant computer system design , 1996 .
[3] Marc Tremblay,et al. High-Performance Fault-Tolerant VLSI Systems Using Micro Rollback , 1990, IEEE Trans. Computers.
[4] Paul Barford,et al. Generating representative Web workloads for network and server performance evaluation , 1998, SIGMETRICS '98/PERFORMANCE '98.
[5] Flaviu Cristian,et al. A timestamp-based checkpointing protocol for long-lived distributed computations , 1991, [1991] Proceedings Tenth Symposium on Reliable Distributed Systems.
[6] J. T. Robinson,et al. On optimistic methods for concurrency control , 1979, TODS.
[7] Kenneth C. Yeager. The Mips R10000 superscalar microprocessor , 1996, IEEE Micro.
[8] Phillip B. Gibbons,et al. Testing Shared Memories , 1997, SIAM J. Comput..
[9] Leslie Lamport,et al. How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs , 2016, IEEE Transactions on Computers.
[10] David A. Patterson,et al. Computer Architecture: A Quantitative Approach , 1969 .
[11] Eric Rotenberg,et al. AR-SMT: a microarchitectural approach to fault tolerance in microprocessors , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).
[12] Todd C. Mowry,et al. The potential for using thread-level data speculation to facilitate automatic parallelization , 1998, Proceedings 1998 Fourth International Symposium on High-Performance Computer Architecture.
[13] W. Daniel Hillis,et al. The Network Architecture of the Connection Machine CM-5 , 1996, J. Parallel Distributed Comput..
[14] Omri Serlin. Fault-Tolerant Systems in Commercial Applications , 1984, Computer.
[15] Sarita V. Adve,et al. Using speculative retirement and larger instruction windows to narrow the performance gap between memory consistency models , 1997, SPAA '97.
[16] Milo M. K. Martin,et al. Fast Checkpoint/Recovery to Support Kilo-Instruction Speculation and Hardware Fault Tolerance , 2000 .
[17] Haitham Akkary,et al. A dynamic multithreading processor , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.
[18] Eric Rotenberg,et al. Slipstream processors: improving both performance and fault tolerance , 2000, SIGP.
[19] Hugh Garraway. Parallel Computer Architecture: A Hardware/Software Approach , 1999, IEEE Concurrency.
[20] Lorenzo Alvisi,et al. Modeling the effect of technology trends on the soft error rate of combinational logic , 2002, Proceedings International Conference on Dependable Systems and Networks.
[21] Kourosh Gharachorloo,et al. Architecture and design of AlphaServer GS320 , 2000, SIGP.
[22] T. N. Vijaykumar,et al. Is SC+ILP=RC? , 1999, Proceedings of the 26th International Symposium on Computer Architecture (Cat. No.99CB36367).
[23] Miron Livny,et al. Checkpoint and Migration of UNIX Processes in the Condor Distributed Processing System , 1997 .
[24] Josep Torrellas,et al. Architectural support for scalable speculative parallelization in shared-memory multiprocessors , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).
[25] R. N. Gustafson,et al. IBM 3081 Processor Unit: Design Considerations and Design Process , 1982, IBM J. Res. Dev..
[26] Mikko H. Lipasti,et al. Dynamic Verification of Cache Coherence , 2022 .
[27] Jeffrey S. Chase,et al. Integrating coherency and recoverability in distributed systems , 1994, OSDI '94.
[28] Edmund M. Clarke,et al. Formal Methods: State of the Art and Future Directions Working Group Members , 1996 .
[29] James R. Larus,et al. The Wisconsin Wind Tunnel: virtual prototyping of parallel computers , 1993, SIGMETRICS '93.
[30] Shubhendu S. Mukherjee,et al. Detailed design and evaluation of redundant multithreading alternatives , 2002, ISCA.
[31] Lisa Spainhower,et al. IBM S/390 Parallel Enterprise Server G5 fault tolerance: A historical perspective , 1999, IBM J. Res. Dev..
[32] Dave Johnson,et al. The Intel 432: A VLSI Architecture for Fault-Tolerant Computer Systems , 1984, Computer.
[33] Kewal K. Saluja,et al. A study of time-redundant fault tolerance techniques for high-performance pipelined computers , 1989, [1989] The Nineteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.
[34] Rana Ejaz Ahmed,et al. Cache-aided rollback error recovery (CARER) algorithm for shared-memory multiprocessor systems , 1990, [1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium.
[35] Mike Galles. Spider: a high-speed network interconnect , 1997, IEEE Micro.
[36] Milo M. K. Martin,et al. Specifying and Verifying a Broadcast and a Multicast Snooping Cache Coherence Protocol , 2002, IEEE Trans. Parallel Distributed Syst..
[37] R. M. Fujimoto,et al. Parallel discrete event simulation , 1989, WSC '89.
[38] Randy H. Katz,et al. A case for redundant arrays of inexpensive disks (RAID) , 1988, SIGMOD '88.
[39] D. Jewett,et al. Integrity S2: A Fault-Tolerant Unix Platform , 1991, Twenty-Fifth International Symposium on Fault-Tolerant Computing, 1995, ' Highlights from Twenty-Five Years'..
[40] Milo M. K. Martin,et al. Timestamp snooping: an approach for extending SMPs , 2000, ASPLOS.
[41] Rajiv Gupta. The fuzzy barrier: a mechanism for high speed synchronization of processors , 1989, ASPLOS III.
[42] Bronis R. de Supinski,et al. Logical time coherence maintenance , 1998 .
[43] Bob Bentley. Validating the Intel/sup (R)/ Pentium/sup (R)/ 4 microprocessor , 2001, 2001 International Conference on Dependable Systems and Networks.
[44] Anne-Marie Kermarrec,et al. COMA: An Opportunity for Building Fault-Tolerant Scalable Shared Memory Multiprocessors , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).
[45] Kun-Lung Wu,et al. Recoverable Distributed Shared Virtual Memory , 1990, IEEE Trans. Computers.
[46] RICHARD KOO,et al. Checkpointing and Rollback-Recovery for Distributed Systems , 1986, IEEE Transactions on Software Engineering.
[47] Gurindar S. Sohi,et al. Speculative versioning cache , 1998, Proceedings 1998 Fourth International Symposium on High-Performance Computer Architecture.
[48] Liviu Iftode,et al. Scalable Fault-Tolerant Distributed Shared Memory , 2000, ACM/IEEE SC 2000 Conference (SC'00).
[49] Christos H. Papadimitriou,et al. The Theory of Database Concurrency Control , 1986 .
[50] Yi-Min Wang,et al. Integrating checkpointing with transaction processing , 1997, Proceedings of IEEE 27th International Symposium on Fault Tolerant Computing.
[51] James S. Plank,et al. An Overview of Checkpointing in Uniprocessor and DistributedSystems, Focusing on Implementation and Performance , 1997 .
[52] Janak H. Patel,et al. Error Recovery in Shared Memory Multiprocessors Using Private Caches , 1990, IEEE Trans. Parallel Distributed Syst..
[53] Todd M. Austin,et al. DIVA: a reliable substrate for deep submicron microarchitecture design , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.
[54] Mark D. Hill,et al. Using Lamport clocks to reason about relaxed memory models , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.
[55] Mark D. Hill,et al. Lamport clocks: verifying a directory cache-coherence protocol , 1998, SPAA '98.
[56] R. H. Katz,et al. Using cache mechanisms to exploit nonrefreshing DRAMs for on-chip memories , 1991 .
[57] Andrew R. Pleszkun,et al. Implementing Precise Interrupts in Pipelined Processors , 1988, IEEE Trans. Computers.
[58] M. Bohr. Interconnect scaling-the real limiter to high performance ULSI , 1995, Proceedings of International Electron Devices Meeting.
[59] Kunle Olukotun,et al. Software and Hardware for Exploiting Speculative Parallelism with a Multiprocessor , 1997 .
[60] Kamran Eshraghian,et al. Principles of CMOS VLSI Design: A Systems Perspective , 1985 .
[61] M. Hill,et al. Optimistic Simulation of Parallel Architectures Using Program Executables , 1996, Proceedings of Symposium on Parallel and Distributed Tools.
[62] W. W. Peterson,et al. Error-Correcting Codes. , 1962 .
[63] Philip A. Bernstein,et al. Sequoia: a fault-tolerant tightly coupled multiprocessor for transaction processing , 1988, Computer.
[64] Erik Hagersten,et al. WildFire: a scalable path for SMPs , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.
[65] Timothy J. Maloney,et al. The Quality and Reliability of Intel's Quarter Micron Process , 2000 .
[66] Willy Zwaenepoel,et al. Manetho: Transparent Rollback-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit , 1992, IEEE Trans. Computers.
[67] Fredrik Larsson,et al. Simics: A Full System Simulation Platform , 2002, Computer.
[68] Anoop Gupta,et al. The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.
[69] Milo M. K. Martin,et al. SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.
[70] William J. Dally. Virtual-channel flow control , 1990, ISCA '90.
[71] Randy H. Katz,et al. Verifying a multiprocessor cache controller using random test generation , 1990, IEEE Design & Test of Computers.
[72] R. H. Havemann,et al. High-performance interconnects: an integration overview , 2001, Proc. IEEE.
[73] H KatzRandy,et al. A case for redundant arrays of inexpensive disks (RAID) , 1988 .
[74] Manuel Blum,et al. Reflections on the Pentium Bug , 1996, IEEE Trans. Computers.
[75] Paul F. Reynolds,et al. Isotach Networks , 1997, IEEE Trans. Parallel Distributed Syst..
[76] Manuel Blum,et al. Designing programs that check their work , 1989, STOC '89.
[77] Shubhendu S. Mukherjee,et al. Transient fault detection via simultaneous multithreading , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).
[78] Kai Li,et al. Diskless Checkpointing , 1998, IEEE Trans. Parallel Distributed Syst..
[79] Antonio Rubio,et al. An approach to crosstalk effect analysis and avoidance techniques in digital CMOS VLSI circuits , 1988 .
[80] Shubhendu S. Mukherjee,et al. The Alpha 21364 network architecture , 2001, HOT 9 Interconnects. Symposium on High Performance Interconnects.
[81] Alan E. Charlesworth,et al. Starfire: extending the SMP envelope , 1998, IEEE Micro.
[82] Yi-Min Wang,et al. Checkpointing and its applications , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.
[83] William J. Dally,et al. Architecture and implementation of the reliable router , 1994, Symposium Record Hot Interconnects II.
[84] Rajiv Gupta. The fuzzy barrier: a mechanism for high speed synchronization of processors , 1989, ASPLOS 1989.
[85] Solomon W. Golomb,et al. Shift Register Sequences , 1981 .
[86] José Duato,et al. A New Theory of Deadlock-Free Adaptive Routing in Wormhole Networks , 1993, IEEE Trans. Parallel Distributed Syst..
[87] Carolyn Craig Williams,et al. Concurrency control in asynchronous computations , 1993 .
[88] D. B. Davis,et al. Sun Microsystems Inc. , 1993 .
[89] M. Hill,et al. Multicast snooping: a new coherence method using a multicast address network , 1999, Proceedings of the 26th International Symposium on Computer Architecture (Cat. No.99CB36367).
[90] Min Xu,et al. Evaluating Non-deterministic Multi-threaded Commercial Workloads , 2001 .
[91] D. Wilson. The STRATUS computer system , 1986 .
[92] James Laudon,et al. The SGI Origin: A ccNUMA Highly Scalable Server , 1997, ISCA.
[93] S. J. Frank,et al. Tightly coupled multiprocessor system speeds memory-access times , 1984 .
[94] Bruce S. Davie,et al. Computer Networks: A Systems Approach , 1996 .
[95] Robert P. Colwell. Maintaining a Leading Position , 1998 .
[96] Christine Morin,et al. An Architecture for Tolerating Processor Failures in Shared Memory Multiprocessors , 1996, IEEE Trans. Computers.
[97] Marc Tremblay,et al. Increasing Work, Pushing the Clock , 1998 .
[98] David A. Patterson,et al. Recovery Oriented Computing: A New Research Agenda for a New Century , 2002, HPCA.
[99] Richard L. Sites,et al. Alpha Architecture Reference Manual , 1995 .
[100] Leslie Lamport,et al. Time, clocks, and the ordering of events in a distributed system , 1978, CACM.
[101] Timothy J. Slegel,et al. IBM's S/390 G5 microprocessor design , 1999, IEEE Micro.
[102] Steve Harrison,et al. Boosting system performance with optimistic distributed protocols , 2001 .
[103] Carl Ramey,et al. Functional verification of a multiple-issue, out-of-order, superscalar Alpha processor-the DEC Alpha 21264 microprocessor , 1998, Proceedings 1998 Design and Automation Conference. 35th DAC. (Cat. No.98CH36175).
[104] James L. Walsh,et al. IBM experiments in soft fails in computer electronics (1978-1994) , 1996, IBM J. Res. Dev..
[105] Melvin A. Breuer,et al. Digital systems testing and testable design , 1990 .
[106] Parameswaran Ramanathan,et al. Checkpointing and rollback recovery in a distributed system using common time base , 1988, Proceedings [1988] Seventh Symposium on Reliable Distributed Systems.
[107] Neil Weste,et al. Principles of CMOS VLSI Design , 1985 .
[108] Timothy J. Dell,et al. A white paper on the benefits of chipkill-correct ecc for pc server main memory , 1997 .
[109] Josep Torrellas,et al. ReVive: cost-effective architectural support for rollback recovery in shared-memory multiprocessors , 2002, ISCA.
[110] Josep Torrellas,et al. Removing architectural bottlenecks to the scalability of speculative parallelization , 2001, Proceedings 28th Annual International Symposium on Computer Architecture.