Filtering failure logs for a BlueGene/L prototype

The growing computational and storage needs of several scientific applications mandate the deployment of extreme-scale parallel machines, such as IBM's BlueGene/L, which can accommodate as many as 128K processors. In this paper, we present our experiences in collecting and filtering error event logs from a 8192 processor BlueGene/L prototype at IBM Rochester, which is currently ranked #8 in the Top-500 list. We analyze the logs collected from this machine over a period of 84 days starting from August 26, 2004. We perform a three-step filtering algorithm on these logs: extracting and categorizing failure events; temporal filtering to remove duplicate reports from the same location; and finally coalescing failure reports of the same error across different locations. Using this approach, we can substantially compress these logs, removing over 99.96% of the 828,387 original entries, and more accurately portray the failure occurrences on this system.

[1]  Luiz C. Alves,et al.  Reliability, availability, and serviceability (RAS) of the IBM eServer z990 , 2004, IBM J. Res. Dev..

[2]  Mark S. Squillante,et al.  Failure data analysis of a large-scale heterogeneous server environment , 2004, International Conference on Dependable Systems and Networks, 2004.

[3]  Daniel P. Siewiorek,et al.  A comparative analysis of event tupling schemes , 1996, Proceedings of Annual Symposium on Fault Tolerant Computing.

[4]  José E. Moreira,et al.  Job Scheduling for the BlueGene/L System , 2002, JSSPP.

[5]  Erik Riedel,et al.  More Than an Interface - SCSI vs. ATA , 2003, FAST.

[6]  James F. Ziegler,et al.  Terrestrial cosmic rays , 1996, IBM J. Res. Dev..

[7]  Ronald Minnich,et al.  A Network-Failure-Tolerant Message-Passing System for Terascale Clusters , 2002, ICS '02.

[8]  Margaret Martonosi,et al.  Dynamic thermal management for high-performance microprocessors , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[9]  Daniel P. Siewiorek,et al.  VAX/VMS event monitoring and analysis , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[10]  Anand Sivasubramaniam,et al.  A complexity-effective approach to ALU bandwidth enhancement for instruction-level temporal redundancy , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[11]  Lorenzo Alvisi,et al.  Modeling the effect of technology trends on the soft error rate of combinational logic , 2002, Proceedings International Conference on Dependable Systems and Networks.

[12]  Sarita V. Adve,et al.  The impact of technology scaling on lifetime reliability , 2004, International Conference on Dependable Systems and Networks, 2004.

[13]  Ravishankar K. Iyer,et al.  Analysis and Modeling of Correlated Failures in Multicomputer Systems , 1992, IEEE Trans. Computers.

[14]  Todd M. Austin,et al.  A Systematic Methodology to Compute the Architectural Vulnerability Factors for a High-Performance Microprocessor , 2003, MICRO.

[15]  Kevin Skadron,et al.  Temperature-aware microarchitecture , 2003, ISCA '03.

[16]  Ravishankar K. Iyer,et al.  Impact of Correlated Failures on Dependability in a VAXcluster System , 1992 .

[17]  Daniel P. Siewiorek,et al.  Error log analysis: statistical modeling and heuristic trend analysis , 1990 .

[18]  Ravishankar K. Iyer,et al.  Networked Windows NT system field failure data analysis , 1999, Proceedings 1999 Pacific Rim International Symposium on Dependable Computing.

[19]  David F. Heidel,et al.  An Overview of the BlueGene/L Supercomputer , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[20]  Ravishankar K. Iyer,et al.  Failure analysis and modeling of a VAXcluster system , 1990, [1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium.

[21]  Kishor S. Trivedi,et al.  Analysis and implementation of software rejuvenation in cluster systems , 2001, SIGMETRICS '01.