Fault prediction under the microscope: A closer look into HPC systems
暂无分享,去创建一个
[1] Josep Torrellas,et al. Rebound: Scalable checkpointing for coherent shared memory , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).
[2] Franck Cappello,et al. Taming of the Shrew: Modeling the Normal and Faulty Behaviour of Large-scale HPC Systems , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.
[3] Anand Sivasubramaniam,et al. BlueGene/L Failure Analysis and Prediction Models , 2006, International Conference on Dependable Systems and Networks (DSN'06).
[4] Zhiling Lan,et al. A practical failure prediction with location and lead time for Blue Gene/P , 2010, 2010 International Conference on Dependable Systems and Networks Workshops (DSN-W).
[5] Zhiling Lan,et al. System log pre-processing to improve failure prediction , 2009, 2009 IEEE/IFIP International Conference on Dependable Systems & Networks.
[6] Jianfeng Zhan,et al. LogMaster: Mining Event Correlations in Logs of Large-Scale Cluster Systems , 2010, 2012 IEEE 31st Symposium on Reliable Distributed Systems.
[7] Jon Stearley,et al. What Supercomputers Say: A Study of Five System Logs , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).
[8] Ling Huang,et al. Online System Problem Detection by Mining Patterns of Console Logs , 2009, 2009 Ninth IEEE International Conference on Data Mining.
[9] Franck Cappello,et al. Modeling and tolerating heterogeneous failures in large parallel systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[10] Maguelonne Teisseire,et al. Mining Frequent Gradual Itemsets from Large Databases , 2009, IDA.
[11] Domenico Cotroneo,et al. Improving Log-based Field Failure Data Analysis of multi-node computing systems , 2011, 2011 IEEE/IFIP 41st International Conference on Dependable Systems & Networks (DSN).
[12] Roy C. Milton,et al. An Extended Table of Critical Values for the Mann-Whitney (Wilcoxon) Two-Sample Statistic , 1964 .
[13] Laxmikant V. Kalé,et al. FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI , 2004, 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935).
[14] Alexandru Iosup,et al. Analysis and modeling of time-correlated failures in large-scale distributed systems , 2010, 2010 11th IEEE/ACM International Conference on Grid Computing.
[15] Franck Cappello,et al. Event Log Mining Tool for Large Scale HPC Systems , 2011, Euro-Par.
[16] Franck Cappello,et al. Adaptive event prediction strategy with dynamic time window for large-scale HPC systems , 2011, SLAML '11.
[17] Anand Sivasubramaniam,et al. Critical event prediction for proactive management in large-scale computer clusters , 2003, KDD '03.
[18] Franck Cappello,et al. Checkpointing vs. Migration for Post-Petascale Supercomputers , 2010, 2010 39th International Conference on Parallel Processing.
[19] Franck Cappello,et al. FTI: High performance Fault Tolerance Interface for hybrid systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[20] Christian Engelmann,et al. Blue Gene/L Log Analysis and Time to Interrupt Estimation , 2009, 2009 International Conference on Availability, Reliability and Security.
[21] Christian Engelmann,et al. Proactive process-level live migration in HPC environments , 2008, HiPC 2008.
[22] Malgorzata Steinder,et al. A survey of fault localization techniques in computer networks , 2004, Sci. Comput. Program..
[23] Qiang Fu,et al. Mining dependency in distributed systems through unstructured logs analysis , 2010, OPSR.
[24] Bronis R. de Supinski,et al. Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.
[25] Zhiling Lan,et al. Dynamic Meta-Learning for Failure Prediction in Large-Scale Systems: A Case Study , 2008, 2008 37th International Conference on Parallel Processing.
[26] Nithin Nakka,et al. Predicting Node Failure in High Performance Computing Systems from Failure and Usage Logs , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.
[27] Alexandre Termier,et al. PGP-mc: Towards a Multicore Parallel Approach for Mining Gradual Patterns , 2010, DASFAA.
[28] Bianca Schroeder,et al. A Large-Scale Study of Failures in High-Performance Computing Systems , 2006, IEEE Transactions on Dependable and Secure Computing.
[29] Ravishankar K. Iyer,et al. Automatic Recognition of Intermittent Failures: An Experimental Study of Field Data , 1990, IEEE Trans. Computers.