JVM Susceptibility to Memory Errors

Modern computer systems are becoming more powerful and are using larger memories. However, except for very high end systems, little attention is being paid to high availability. This is particularly true for transient memory errors, which typically cause the entire system to fail. We believe that this situation can be improved by addressing memory errors at all levels of the system, bring commodity systems closer to mainframe-class availability. In this paper, we use fault injection experiments to investigate memory error susceptibility at the highest level using a JVM and four Java benchmark applications. We then consider JVM data structure checksums to increase detection of silent data corruption affecting the JVM and applications. Our results indicate that the JVM's heap area has a higher memory error susceptibility than its static data area and that we can detect up to 39% of all memory errors in the JVM and application. We believe that such techniques will allow commodity systems to be made much more robust and less error-prone to transient errors.

[1]  Peter M. Chen,et al.  The Rio file cache: surviving operating system crashes , 1996, ASPLOS VII.

[2]  Jeffrey M. Nick,et al.  S/390 Cluster Technology: Parallel Sysplex , 1997, IBM Syst. J..

[3]  Mahadev Satyanarayanan,et al.  Lightweight recoverable virtual memory , 1993, SOSP '93.

[4]  Brendan Murphy,et al.  Measuring system and software reliability using an automated data collection process , 1995 .

[5]  James F. Ziegler,et al.  Terrestrial cosmic rays , 1996, IBM J. Res. Dev..

[6]  Dhiraj K. Pradhan,et al.  Processor- and memory-based checkpoint and rollback recovery , 1993, Computer.

[7]  Anne-Marie Kermarrec,et al.  A recoverable distributed shared memory integrating coherence and recoverability , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[8]  Ken Arnold,et al.  The Java programming language (2nd ed.) , 1998 .

[9]  Timothy J. Dell,et al.  A white paper on the benefits of chipkill-correct ecc for pc server main memory , 1997 .

[10]  Joel F. Bartlett,et al.  A NonStop kernel , 1981, SOSP.

[11]  David Holmes,et al.  The Java Programming Language, Third Edition , 2000 .

[12]  James L. Walsh,et al.  IBM experiments in soft fails in computer electronics (1978-1994) , 1996, IBM J. Res. Dev..

[13]  Alan Messer,et al.  Increasing relevance of memory hardware errors: a case for recoverable programming models , 2000, EW 9.

[14]  Ken Arnold,et al.  The Java Programming Language , 1996 .

[15]  Ravishankar K. Iyer,et al.  FINE: A Fault Injection and Monitoring Environment for Tracing the UNIX System Behavior under Faults , 1993, IEEE Trans. Software Eng..

[16]  Andreas Reuter,et al.  Transaction Processing: Concepts and Techniques , 1992 .

[17]  Brendan Murphy,et al.  Windows 2000 Dependability , 2000 .