Monitoring and Predicting Hardware Failures in HPC Clusters with FTB-IPMI
暂无分享,去创建一个
Dhabaleswar K. Panda | Raghunath Rajachandrasekar | Xavier Besseron | D. Panda | R. Rajachandrasekar | Xavier Besseron
[1] Ronald Minnich. Supermon: High-Performance Monitoring for Linux Clusters , 2001, Annual Linux Showcase & Conference.
[2] Sean Dague. OpenHPI: An Open Source Reference Implementation of the SA Forum Hardware Platform Interface , 2004, ISAS.
[3] Zhiling Lan,et al. System log pre-processing to improve failure prediction , 2009, 2009 IEEE/IFIP International Conference on Dependable Systems & Networks.
[4] Glenn A. Fink,et al. Predicting Computer System Failures Using Support Vector Machines , 2008, WASL.
[5] S. Scott,et al. A Failure Predictive and Policy-Based High Availability Strategy for Linux High Performance Computing Cluster , 2004 .
[6] Laxmikant V. Kalé,et al. Scalable molecular dynamics with NAMD , 2005, J. Comput. Chem..
[7] Matthew Wilcox,et al. Effective HPC hardware management and Failure prediction strategy using IPMI , 2003 .
[8] A. Arredondo,et al. Implementing PWM fan speed control within a computer chassis power supply , 2005, Twentieth Annual IEEE Applied Power Electronics Conference and Exposition, 2005. APEC 2005..
[9] Dhabaleswar K. Panda,et al. Nomad: migrating OS-bypass networks in virtual machines , 2007, VEE '07.
[10] Dhabaleswar K. Panda,et al. CIFTS: A Coordinated Infrastructure for Fault-Tolerant Systems , 2009, 2009 International Conference on Parallel Processing.
[11] David E. Culler,et al. The ganglia distributed monitoring system: design, implementation, and experience , 2004, Parallel Comput..
[12] Bert J. Debusschere,et al. Ovis-2: A robust distributed architecture for scalable RAS , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.
[13] Kincho H. Law,et al. ParCYCLIC: finite element modelling of earthquake liquefaction response on parallel computers , 2004 .
[14] Matei Ripeanu,et al. Failure Avoidance through Fault Prediction Based on Synthetic Transactions , 2011, 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.
[15] Christian Engelmann,et al. Proactive Fault Tolerance Using Preemptive Migration , 2009, 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing.
[16] D. K. Panda. InfiniBand Architecture , 2001 .
[17] Terry Jones,et al. Accurate fault prediction of BlueGene/P RAS logs via geometric reduction , 2010, 2010 International Conference on Dependable Systems and Networks Workshops (DSN-W).
[18] John E. Stone,et al. Rendering of numerical flow simulations using MPI , 1996, Proceedings. Second MPI Developer's Conference.
[19] J. Steele. ACPI thermal sensing and control in the PC , 1998, Wescon/98. Conference Proceedings (Cat. No.98CH36265).