Failure prediction for HPC systems and applications
暂无分享,去创建一个
Franck Cappello | Marc Snir | Ana Gainaru | William Kramer | M. Snir | F. Cappello | W. Kramer | Ana Gainaru
[1] Franck Cappello,et al. Taming of the Shrew: Modeling the Normal and Faulty Behaviour of Large-scale HPC Systems , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.
[2] Collin McCurdy,et al. Early evaluation of IBM BlueGene/P , 2008, HiPC 2008.
[3] Franck Cappello,et al. Improving the Computing Efficiency of HPC Systems Using a Combination of Proactive and Preventive Checkpointing , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.
[4] Zhiling Lan,et al. A practical failure prediction with location and lead time for Blue Gene/P , 2010, 2010 International Conference on Dependable Systems and Networks Workshops (DSN-W).
[5] Franck Cappello,et al. FTI: High performance Fault Tolerance Interface for hybrid systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[6] Bianca Schroeder,et al. A Large-Scale Study of Failures in High-Performance Computing Systems , 2010, IEEE Trans. Dependable Secur. Comput..
[7] Franck Cappello,et al. Event Log Mining Tool for Large Scale HPC Systems , 2011, Euro-Par.
[8] William Farr,et al. Software reliability modeling survey , 1996 .
[9] Major Scott G. Frickenstein. Reliability Theory with Applications to Preventive Maintenance llya Gertsbakh Springer-Verlag, 219 pp., ISBN 3-540-67275-3 , 2002 .
[10] Bianca Schroeder,et al. Cosmic rays don't strike twice: understanding the nature of DRAM errors and the implications for system design , 2012, ASPLOS XVII.
[11] Hai Qiu,et al. Physics-based Remaining Useful Life Prediction for Aircraft Engine Bearing Prognosis , 2009 .
[12] Ling Huang,et al. Online System Problem Detection by Mining Patterns of Console Logs , 2009, 2009 Ninth IEEE International Conference on Data Mining.
[13] Zhiling Lan,et al. Dynamic Meta-Learning for Failure Prediction in Large-Scale Systems: A Case Study , 2008, 2008 37th International Conference on Parallel Processing.
[14] Qiang Fu,et al. Mining dependency in distributed systems through unstructured logs analysis , 2010, OPSR.
[15] Michael Tortorella,et al. Reliability Theory: With Applications to Preventive Maintenance , 2001, Technometrics.
[16] Uday Kumar,et al. FAILURE PREDICTION OF RAIL CONSIDERING ROLLING CONTACT FATIGUE , 2010 .
[17] Franck Cappello,et al. Modeling and tolerating heterogeneous failures in large parallel systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[18] Pedro Trancoso,et al. Trends in High-Performance Computing , 2011, Computing in Science & Engineering.
[19] Zhiling Lan,et al. Anomaly localization in large-scale clusters , 2007, 2007 IEEE International Conference on Cluster Computing.
[20] Dorothy M. Andrews,et al. A Methodology for Analysis of Failure Prediction Data , 1985, RTSS.
[21] Dhabaleswar K. Panda,et al. Monitoring and Predicting Hardware Failures in HPC Clusters with FTB-IPMI , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.
[22] Joseph L. Hellerstein,et al. Predictive algorithms in the management of computer systems , 2002, IBM Syst. J..
[23] Miroslaw Malek,et al. A survey of online failure prediction methods , 2010, CSUR.
[24] Nithin Nakka,et al. Predicting Node Failure in High Performance Computing Systems from Failure and Usage Logs , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.
[25] William Gropp,et al. Exascale Research: Preparing for the Post-Moore Era , 2011 .
[26] F. Al-Shamali,et al. Author Biographies. , 2015, Journal of social work in disability & rehabilitation.
[27] Narayan Desai,et al. Co-analysis of RAS Log and Job Log on Blue Gene/P , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.
[28] Vanish Talwar,et al. Online detection of utility cloud anomalies using metric distributions , 2010, 2010 IEEE Network Operations and Management Symposium - NOMS 2010.
[29] Zhiling Lan,et al. Practical online failure prediction for Blue Gene/P: Period-based vs event-driven , 2011, 2011 IEEE/IFIP 41st International Conference on Dependable Systems and Networks Workshops (DSN-W).
[30] Franck Cappello,et al. Fault prediction under the microscope: A closer look into HPC systems , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.
[31] David A. Patterson,et al. Path-Based Failure and Evolution Management , 2004, NSDI.
[32] Laxmikant V. Kalé,et al. FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI , 2004, 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935).
[33] Anand Sivasubramaniam,et al. BlueGene/L Failure Analysis and Prediction Models , 2006, International Conference on Dependable Systems and Networks (DSN'06).
[34] Jon Stearley,et al. A State-Machine Approach to Disambiguating Supercomputer Event Logs , 2012, MAD.
[35] Miroslaw Malek,et al. Using Hidden Semi-Markov Models for Effective Online Failure Prediction , 2007, 2007 26th IEEE International Symposium on Reliable Distributed Systems (SRDS 2007).
[36] Rajeev Thakur,et al. A study of dynamic meta-learning for failure prediction in large-scale systems , 2010, J. Parallel Distributed Comput..
[37] SchroederBianca,et al. Cosmic rays don't strike twice , 2012 .
[38] Yves Robert,et al. Impact of fault prediction on checkpointing strategies , 2012, ArXiv.
[39] Domenico Cotroneo,et al. Assessing time coalescence techniques for the analysis of supercomputer logs , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).