论文信息 - Filtering failure logs for a BlueGene/L prototype

Filtering failure logs for a BlueGene/L prototype

The growing computational and storage needs of several scientific applications mandate the deployment of extreme-scale parallel machines, such as IBM's BlueGene/L, which can accommodate as many as 128K processors. In this paper, we present our experiences in collecting and filtering error event logs from a 8192 processor BlueGene/L prototype at IBM Rochester, which is currently ranked #8 in the Top-500 list. We analyze the logs collected from this machine over a period of 84 days starting from August 26, 2004. We perform a three-step filtering algorithm on these logs: extracting and categorizing failure events; temporal filtering to remove duplicate reports from the same location; and finally coalescing failure reports of the same error across different locations. Using this approach, we can substantially compress these logs, removing over 99.96% of the 828,387 original entries, and more accurately portray the failure occurrences on this system.

[1] Luiz C. Alves,et al. Reliability, availability, and serviceability (RAS) of the IBM eServer z990 , 2004, IBM J. Res. Dev..

[2] Mark S. Squillante,et al. Failure data analysis of a large-scale heterogeneous server environment , 2004, International Conference on Dependable Systems and Networks, 2004.

[3] Daniel P. Siewiorek,et al. A comparative analysis of event tupling schemes , 1996, Proceedings of Annual Symposium on Fault Tolerant Computing.

[4] José E. Moreira,et al. Job Scheduling for the BlueGene/L System , 2002, JSSPP.

[5] Erik Riedel,et al. More Than an Interface - SCSI vs. ATA , 2003, FAST.

[6] James F. Ziegler,et al. Terrestrial cosmic rays , 1996, IBM J. Res. Dev..

[7] Ronald Minnich,et al. A Network-Failure-Tolerant Message-Passing System for Terascale Clusters , 2002, ICS '02.

[8] Margaret Martonosi,et al. Dynamic thermal management for high-performance microprocessors , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[9] Daniel P. Siewiorek,et al. VAX/VMS event monitoring and analysis , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[10] Anand Sivasubramaniam,et al. A complexity-effective approach to ALU bandwidth enhancement for instruction-level temporal redundancy , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[11] Lorenzo Alvisi,et al. Modeling the effect of technology trends on the soft error rate of combinational logic , 2002, Proceedings International Conference on Dependable Systems and Networks.

[12] Sarita V. Adve,et al. The impact of technology scaling on lifetime reliability , 2004, International Conference on Dependable Systems and Networks, 2004.

[13] Ravishankar K. Iyer,et al. Analysis and Modeling of Correlated Failures in Multicomputer Systems , 1992, IEEE Trans. Computers.

[14] Todd M. Austin,et al. A Systematic Methodology to Compute the Architectural Vulnerability Factors for a High-Performance Microprocessor , 2003, MICRO.

[15] Kevin Skadron,et al. Temperature-aware microarchitecture , 2003, ISCA '03.

[16] Ravishankar K. Iyer,et al. Impact of Correlated Failures on Dependability in a VAXcluster System , 1992 .

[17] Daniel P. Siewiorek,et al. Error log analysis: statistical modeling and heuristic trend analysis , 1990 .

[18] Ravishankar K. Iyer,et al. Networked Windows NT system field failure data analysis , 1999, Proceedings 1999 Pacific Rim International Symposium on Dependable Computing.

[19] David F. Heidel,et al. An Overview of the BlueGene/L Supercomputer , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[20] Ravishankar K. Iyer,et al. Failure analysis and modeling of a VAXcluster system , 1990, [1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium.

[21] Kishor S. Trivedi,et al. Analysis and implementation of software rejuvenation in cluster systems , 2001, SIGMETRICS '01.