From experiment to design - fault characterization and detection in parallel computer systems using computational accelerators

[1]  Ravishankar K. Iyer,et al.  Hierarchical Simulation Approach to Accurate Fault Modeling for System Dependability Evaluation , 1999, IEEE Trans. Software Eng..

[2]  Ravishankar K. Iyer,et al.  DEPEND: A Simulation-Based Environment for System Level Dependability Analysis , 1997, IEEE Trans. Computers.

[3]  James L. Walsh,et al.  IBM experiments in soft fails in computer electronics (1978-1994) , 1996, IBM J. Res. Dev..

[4]  Stephanie Forrest,et al.  A sense of self for Unix processes , 1996, Proceedings 1996 IEEE Symposium on Security and Privacy.

[5]  Jun Yang,et al.  Frequent value compression in data caches , 2000, MICRO 33.

[6]  G. Amdhal,et al.  Validity of the single processor approach to achieving large scale computing capabilities , 1967, AFIPS '67 (Spring).

[7]  Pedro J. Gil,et al.  A prototype of a VHDL-based fault injection tool: description and application , 2002, J. Syst. Archit..

[8]  Jason Cong,et al.  Application-specific instruction generation for configurable processor architectures , 2004, FPGA '04.

[9]  Cristian Constantinescu,et al.  Trends and Challenges in VLSI Circuit Reliability , 2003, IEEE Micro.

[10]  Sanjay J. Patel,et al.  Characterizing the effects of transient faults on a high-performance processor pipeline , 2004, International Conference on Dependable Systems and Networks, 2004.

[11]  James L. Walsh,et al.  Field testing for cosmic ray soft errors in semiconductor memories , 1996, IBM J. Res. Dev..

[12]  John A. Gunnels,et al.  Extending stability beyond CPU millennium: a micron-scale atomistic simulation of Kelvin-Helmholtz instability , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[13]  Dan Tsafrir,et al.  System noise, OS clock ticks, and fine-grained parallel applications , 2005, ICS '05.

[14]  Barry W. Johnson,et al.  System-level modeling in the ADEPT environment of a distributed computer system for real-time applications , 1995, Proceedings of 1995 IEEE International Computer Performance and Dependability Symposium.

[15]  Huntington W. Curtis,et al.  Accelerated testing for cosmic soft-error rate , 1996, IBM J. Res. Dev..

[16]  Tipp Moseley,et al.  Using Process-Level Redundancy to Exploit Multiple Cores for Transient Fault Tolerance , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).

[17]  Mahmut T. Kandemir,et al.  Analyzing heap error behavior in embedded JVM environments , 2004, International Conference on Hardware/Software Codesign and System Synthesis, 2004. CODES + ISSS 2004..

[18]  Miguel Castro,et al.  Fast byte-granularity software fault isolation , 2009, SOSP '09.

[19]  Edward J. McCluskey,et al.  Error detection by duplicated instructions in super-scalar processors , 2002, IEEE Trans. Reliab..

[20]  Kevin Skadron,et al.  A hardware redundancy and recovery mechanism for reliable scientific computation on graphics processors , 2007, GH '07.

[21]  Eduardo Pinheiro,et al.  DRAM errors in the wild: a large-scale field study , 2009, SIGMETRICS '09.

[22]  Diana Marculescu,et al.  Multiple Transient Faults in Combinational and Sequential Circuits: A Systematic Approach , 2010, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[23]  Matthias Hauswirth,et al.  Automating performance testing of interactive Java applications , 2010, AST '10.

[24]  L. Borucki,et al.  Comparison of accelerated DRAM soft error rates measured at component and system level , 2008, 2008 IEEE International Reliability Physics Symposium.

[25]  David Lie,et al.  Using VMM-based sensors to monitor honeypots , 2006, VEE '06.

[26]  Fred L. Yang,et al.  Simulation of faults causing analog behavior in digital circuits , 1992 .

[27]  Bianca Schroeder,et al.  Disk Failures in the Real World: What Does an MTTF of 1, 000, 000 Hours Mean to You? , 2007, FAST.

[28]  Ryuji Kan,et al.  Validation of hardware error recovery mechanisms for the SPARC64 V microprocessor , 2008, 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN).

[29]  Rakesh Kumar,et al.  Algorithmic approaches to low overhead fault detection for sparse linear algebra , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).

[30]  Sayed Mohammad Kia,et al.  Micro embedded monitoring for security in application specific instruction-set processors , 2005, CASES '05.

[31]  Inderpal S. Bhandari,et al.  Orthogonal Defect Classification - A Concept for In-Process Measurements , 1992, IEEE Trans. Software Eng..

[32]  Jan Vitek,et al.  Efficient intrusion detection using automaton inlining , 2005, 2005 IEEE Symposium on Security and Privacy (S&P'05).

[33]  Ravishankar K. Iyer,et al.  Error sensitivity of the Linux kernel executing on PowerPC G4 and Pentium 4 processors , 2004, International Conference on Dependable Systems and Networks, 2004.

[34]  Ravishankar K. Iyer,et al.  FAMAS: FAult Modeling via Adaptive Simulation , 1997, Proceedings Tenth International Conference on VLSI Design.

[35]  James F. Ziegler,et al.  Terrestrial cosmic rays , 1996, IBM J. Res. Dev..

[36]  Ravishankar K. Iyer,et al.  NFTAPE: a framework for assessing dependability in distributed systems with lightweight fault injectors , 2000, Proceedings IEEE International Computer Performance and Dependability Symposium. IPDS 2000.

[37]  G. C. Messenger,et al.  Collection of Charge on Junction Nodes from Ion Tracks , 1982, IEEE Transactions on Nuclear Science.

[38]  Charng-Da Lu,et al.  Assessing Fault Sensitivity in MPI Applications , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[39]  Ravishankar K. Iyer,et al.  Measurement-based analysis of fault and error sensitivities of dynamic memory , 2010, 2010 IEEE/IFIP International Conference on Dependable Systems & Networks (DSN).

[40]  Joel Emer,et al.  A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[41]  David A. Wagner,et al.  Intrusion detection via static analysis , 2001, Proceedings 2001 IEEE Symposium on Security and Privacy. S&P 2001.

[42]  Satoshi Matsuoka,et al.  A high-performance fault-tolerant software framework for memory on commodity GPUs , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[43]  Ram Chillarege,et al.  Understanding large system failures-a fault injection experiment , 1989, [1989] The Nineteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[44]  Henrique Madeira,et al.  RIFLE: A General Purpose Pin-level Fault Injector , 1994, EDCC.

[45]  Kishor S. Trivedi Probability and Statistics with Reliability, Queuing, and Computer Science Applications , 1984 .

[46]  Edward J. McCluskey,et al.  Concurrent Error Detection Using Watchdog Processors - A Survey , 1988, IEEE Trans. Computers.

[47]  Ravishankar K. Iyer,et al.  Hauberk: Lightweight Silent Data Corruption Error Detector for GPGPU , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[48]  Jean Arlat,et al.  Dependability of COTS Microkernel-Based Systems , 2002, IEEE Trans. Computers.

[49]  Volodymyr Kindratenko,et al.  On testing GPU memory for hard and soft errors , 2011 .

[50]  Ravishankar K. Iyer,et al.  Automated Derivation of Application-aware Error Detectors using Static Analysis , 2007, 13th IEEE International On-Line Testing Symposium (IOLTS 2007).

[51]  David Kaeli,et al.  Virtual machine monitor-based lightweight intrusion detection , 2011, OPSR.

[52]  Edward J. McCluskey,et al.  Word-voter: a new voter design for triple modular redundant systems , 2000, Proceedings 18th IEEE VLSI Test Symposium.

[53]  D.A. Rennels,et al.  Fault Injection Campaign for a Fault Tolerant Duplex Framework , 2007, 2007 IEEE Aerospace Conference.

[54]  Carl E. Landwehr,et al.  Basic concepts and taxonomy of dependable and secure computing , 2004, IEEE Transactions on Dependable and Secure Computing.

[55]  William G. Griswold,et al.  Dynamically discovering likely program invariants to support program evolution , 1999, Proceedings of the 1999 International Conference on Software Engineering (IEEE Cat. No.99CB37002).

[56]  Algirdas Avizienis,et al.  The N-Version Approach to Fault-Tolerant Software , 1985, IEEE Transactions on Software Engineering.

[57]  Rajeev Rastogi,et al.  Efficient algorithms for mining outliers from large data sets , 2000, SIGMOD 2000.

[58]  R. Butler Outlier Discordancy Tests in the Normal Linear Model , 1983 .

[59]  Jack J. Dongarra,et al.  The LINPACK Benchmark: past, present and future , 2003, Concurr. Comput. Pract. Exp..

[60]  Ravishankar K. Iyer,et al.  Measurement-Based Analysis of Error Latency , 1987, IEEE Transactions on Computers.

[61]  Raymond T. Ng,et al.  Algorithms for Mining Distance-Based Outliers in Large Datasets , 1998, VLDB.

[62]  Philip K. Chan,et al.  Learning Patterns from Unix Process Execution Traces for Intrusion Detection , 1997 .

[63]  Amin Ansari,et al.  Shoestring: probabilistic soft error reliability on the cheap , 2010, ASPLOS 2010.

[64]  Huiyang Zhou,et al.  Understanding software approaches for GPGPU reliability , 2009, GPGPU-2.

[65]  Xin Li,et al.  A Memory Soft Error Measurement on Production Systems , 2007, USENIX Annual Technical Conference.

[66]  Daniel P. Siewiorek,et al.  Fault Injection Experiments Using FIAT , 1990, IEEE Trans. Computers.

[67]  George M. Castillo,et al.  Single event upset testing of commercial off-the-shelf electronics for launch vehicle applications , 2011, 2011 Aerospace Conference.

[68]  Todd M. Austin,et al.  DIVA: a reliable substrate for deep submicron microarchitecture design , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[69]  P. Rousseeuw,et al.  Computing depth contours of bivariate point clouds , 1996 .

[70]  Hans P. Muhlfeld,et al.  Cosmic ray soft error rates of 16-Mb DRAM memory chips , 1998, IEEE J. Solid State Circuits.

[71]  Bernd Becker,et al.  A study of cognitive resilience in a JPEG compressor , 2008, 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN).

[72]  Grigore Rosu,et al.  Mop: an efficient and generic runtime verification framework , 2007, OOPSLA.

[73]  Timothy J. Slegel,et al.  IBM's S/390 G5 microprocessor design , 1999, IEEE Micro.

[74]  N. Hengartner,et al.  Predicting the number of fatal soft errors in Los Alamos national laboratory's ASC Q supercomputer , 2005, IEEE Transactions on Device and Materials Reliability.

[75]  Eun Ha Kim,et al.  Implementing an Effective Test Automation Framework , 2009, 2009 33rd Annual IEEE International Computer Software and Applications Conference.

[76]  David I. August,et al.  SWIFT: software implemented fault tolerance , 2005, International Symposium on Code Generation and Optimization.

[77]  David I. August,et al.  Automatic Instruction-Level Software-Only Recovery , 2006, IEEE Micro.

[78]  Jacob A. Abraham,et al.  Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.

[79]  Ravishankar K. Iyer,et al.  An architectural framework for providing reliability and security support , 2004, International Conference on Dependable Systems and Networks, 2004.

[80]  Janak H. Patel,et al.  Reliability of scrubbing recovery-techniques for memory systems , 1990 .

[81]  Sarita V. Adve,et al.  Using likely program invariants to detect hardware errors , 2008, 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN).

[82]  Kang G. Shin,et al.  Measurement and Application of Fault Latency , 1986, IEEE Transactions on Computers.

[83]  Sanjay J. Patel,et al.  ReStore: symptom based soft error detection in microprocessors , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[84]  Neha Narula,et al.  Native Client: A Sandbox for Portable, Untrusted x86 Native Code , 2009, IEEE Symposium on Security and Privacy.

[85]  Sarita V. Adve,et al.  Understanding the propagation of hard errors to software and implications for resilient system design , 2008, ASPLOS.

[86]  Ravishankar K. Iyer,et al.  FINE: A Fault Injection and Monitoring Environment for Tracing the UNIX System Behavior under Faults , 1993, IEEE Trans. Software Eng..

[87]  Hovav Shacham,et al.  On the effectiveness of address-space randomization , 2004, CCS '04.

[88]  Ravishankar K. Iyer,et al.  Quantitative Analysis of Long-Latency Failures in System Software , 2009, 2009 15th IEEE Pacific Rim International Symposium on Dependable Computing.

[89]  Jean-Claude Laprie,et al.  Dependable computing: concepts, limits, challenges , 1995 .

[90]  Vijay S. Pande,et al.  Hard Data on Soft Errors: A Large-Scale Assessment of Real-World Error Rates in GPGPU , 2009, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[91]  Henrique Madeira,et al.  Emulation of Software Faults: A Field Data Study and a Practical Approach , 2006, IEEE Transactions on Software Engineering.

[92]  Karthikeyan Sankaralingam,et al.  Relax: an architectural framework for software recovery of hardware faults , 2010, ISCA.

[93]  H.H.K. Tang,et al.  Measurement of the flux and energy spectrum of cosmic-ray induced neutrons on the ground , 2004, IEEE Transactions on Nuclear Science.

[94]  Neeraj Suri,et al.  On the placement of software mechanisms for detection of data errors , 2002, Proceedings International Conference on Dependable Systems and Networks.

[95]  Jacob A. Abraham,et al.  FERRARI: A Flexible Software-Based Fault and Error Injection System , 1995, IEEE Trans. Computers.

[96]  Kevin Skadron,et al.  The visual vulnerability spectrum: characterizing architectural vulnerability for graphics hardware , 2006, GH '06.

[97]  Ravishankar K. Iyer,et al.  FOCUS: An Experimental Environment for Fault Sensitivity Analysis , 1992, IEEE Trans. Computers.

[98]  Ravishankar K. Iyer,et al.  Microprocessor sensitivity to failures: control vs. execution and combinational vs. sequential logic , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[99]  Brad Calder,et al.  Phase tracking and prediction , 2003, ISCA '03.

[100]  Salvatore J. Stolfo,et al.  A data mining framework for building intrusion detection models , 1999, Proceedings of the 1999 IEEE Symposium on Security and Privacy (Cat. No.99CB36344).

[101]  Daniel Pierre Bovet,et al.  Understanding the Linux Kernel , 2000 .

[102]  Milos Krstic,et al.  FPGA implementation of hardware voter , 2001, 5th International Conference on Telecommunications in Modern Satellite, Cable and Broadcasting Service. TELSIKS 2001. Proceedings of Papers (Cat. No.01EX517).