Achieving Target MTTF by Duplicating Reliability-Critical Components in High Performance Computing Systems
暂无分享,去创建一个
Nithin Nakka | Alok N. Choudhary | John Bent | Gary Grider | James Nunez | Satsangat Khalsa | G. Grider | A. Choudhary | J. Nunez | Nithin Nakka | Satsangat Khalsa | John Bent
[1] Jon Stearley,et al. What Supercomputers Say: A Study of Five System Logs , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).
[2] Daniel P. Siewiorek,et al. Workload, Performance, and Reliability of Digital Computing Systems. , 1980 .
[3] Anand Sivasubramaniam,et al. SlicK: slice-based locality exploitation for efficient redundant multithreading , 2006, ASPLOS XII.
[4] Onur Mutlu,et al. Microarchitecture-based introspection: a technique for transient-fault tolerance in microprocessors , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).
[5] Todd M. Austin,et al. DIVA: a reliable substrate for deep submicron microarchitecture design , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.
[6] Dean M. Tullsen,et al. Simultaneous multithreading: Maximizing on-chip parallelism , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.
[7] Mark S. Squillante,et al. Failure data analysis of a large-scale heterogeneous server environment , 2004, International Conference on Dependable Systems and Networks, 2004.
[8] Zhiling Lan,et al. Adaptive Fault Management of Parallel Applications for High-Performance Computing , 2008, IEEE Transactions on Computers.
[9] Irith Pomeranz,et al. Transient-fault recovery using simultaneous multithreading , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.
[10] K. Sundaramoorthy,et al. Slipstream processors: improving both performance and fault tolerance , 2000, SIGP.
[11] Edward J. McCluskey,et al. Error detection by duplicated instructions in super-scalar processors , 2002, IEEE Trans. Reliab..
[12] Ravishankar K. Iyer,et al. Measurement and modeling of computer reliability as affected by system activity , 1986, TOCS.
[13] Timothy J. Slegel,et al. IBM's S/390 G5 microprocessor design , 1999, IEEE Micro.
[14] Ravishankar K. Iyer,et al. Networked Windows NT system field failure data analysis , 1999, Proceedings 1999 Pacific Rim International Symposium on Dependable Computing.
[15] Srinivasan Seshan,et al. Subtleties in tolerating correlated failures , 2006 .
[16] Richard P. Martin,et al. Improving cluster availability using workstation validation , 2002, SIGMETRICS '02.
[17] Ravishankar K. Iyer,et al. Failure analysis and modeling of a VAXcluster system , 1990, [1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium.
[18] Bianca Schroeder,et al. A Large-Scale Study of Failures in High-Performance Computing Systems , 2006, IEEE Transactions on Dependable and Secure Computing.
[19] D. Jewett,et al. Integrity S2: A Fault-Tolerant Unix Platform , 1991, Twenty-Fifth International Symposium on Fault-Tolerant Computing, 1995, ' Highlights from Twenty-Five Years'..
[20] Shubhendu S. Mukherjee,et al. Transient fault detection via simultaneous multithreading , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).
[21] D. Jewett. Integrity S2: a fault-tolerant Unix platform , 1991, [1991] Digest of Papers. Fault-Tolerant Computing: The Twenty-First International Symposium.
[22] Rajeev Thakur,et al. A Fault Diagnosis and Prognosis Service for TeraGrid Clusters , 2007 .
[23] Nithin Nakka,et al. Failure Data-Driven Selective Node-Level Duplication to Improve MTTF in High Performance Computing Systems , 2009, HPCS.
[24] James S. Plank,et al. Experimental assessment of workstation failures and their impact on checkpointing systems , 1998, Digest of Papers. Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing (Cat. No.98CB36224).
[25] Todd M. Austin,et al. A fault tolerant approach to microprocessor design , 2001, 2001 International Conference on Dependable Systems and Networks.
[26] Eric Rotenberg,et al. AR-SMT: a microarchitectural approach to fault tolerance in microprocessors , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).
[27] Algirdas Avizienis,et al. Arithmetic Error Codes: Cost and Effectiveness Studies for Application in Digital System Design , 1971, IEEE Transactions on Computers.
[28] Arthur E. Cooper,et al. Development of On-Board Space Computer Systems , 1976, IBM J. Res. Dev..
[29] Zhiling Lan,et al. Using Adaptive Fault Tolerance to Improve Application Robustness on the TeraGrid , 2007 .
[30] Anand Sivasubramaniam,et al. Fault-aware job scheduling for BlueGene/L systems , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..
[31] Edward J. McCluskey,et al. ED4I: Error Detection by Diverse Data and Duplicated Instructions , 2002, IEEE Trans. Computers.