Taming of the Shrew: Modeling the Normal and Faulty Behaviour of Large-scale HPC Systems
暂无分享,去创建一个
[1] Alexandru Iosup,et al. Analysis and modeling of time-correlated failures in large-scale distributed systems , 2010, 2010 11th IEEE/ACM International Conference on Grid Computing.
[2] Franck Cappello,et al. Adaptive event prediction strategy with dynamic time window for large-scale HPC systems , 2011, SLAML '11.
[3] Anand Sivasubramaniam,et al. BlueGene/L Failure Analysis and Prediction Models , 2006, International Conference on Dependable Systems and Networks (DSN'06).
[4] Bianca Schroeder,et al. A Large-Scale Study of Failures in High-Performance Computing Systems , 2006, IEEE Transactions on Dependable and Secure Computing.
[5] Zhiling Lan,et al. Anomaly localization in large-scale clusters , 2007, 2007 IEEE International Conference on Cluster Computing.
[6] Harry O. Posten,et al. The robustness of the two—sample t—test over the Pearson system , 1978 .
[7] Zhiling Lan,et al. Dynamic Meta-Learning for Failure Prediction in Large-Scale Systems: A Case Study , 2008, 2008 37th International Conference on Parallel Processing.
[8] Nithin Nakka,et al. Predicting Node Failure in High Performance Computing Systems from Failure and Usage Logs , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.
[9] Zhiling Lan,et al. System log pre-processing to improve failure prediction , 2009, 2009 IEEE/IFIP International Conference on Dependable Systems & Networks.
[10] Sreeram Chandrasekar,et al. An efficient methodology for noise characterization , 2005, 18th International Conference on VLSI Design held jointly with 4th International Conference on Embedded Systems Design.
[11] Anand Sivasubramaniam,et al. Critical event prediction for proactive management in large-scale computer clusters , 2003, KDD '03.
[12] Ravishankar K. Iyer,et al. Automatic Recognition of Intermittent Failures: An Experimental Study of Field Data , 1990, IEEE Trans. Computers.
[13] Franck Cappello,et al. Modeling and tolerating heterogeneous failures in large parallel systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[14] Franck Cappello,et al. Event Log Mining Tool for Large Scale HPC Systems , 2011, Euro-Par.
[15] Aleksandra Pizurica,et al. Removal of Correlated Noise by Modeling the Signal of Interest in the Wavelet Domain , 2009, IEEE Transactions on Image Processing.
[16] W.M. Waters,et al. Bandpass Signal Sampling and Coherent Detection , 1982, IEEE Transactions on Aerospace and Electronic Systems.
[17] Franck Cappello,et al. FTI: High performance Fault Tolerance Interface for hybrid systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[18] Qiang Fu,et al. Mining dependency in distributed systems through unstructured logs analysis , 2010, OPSR.
[19] Lakshminarayanan Subramanian,et al. Root Cause Localization in Large Scale Systems , 2005 .
[20] Domenico Cotroneo,et al. Improving Log-based Field Failure Data Analysis of multi-node computing systems , 2011, 2011 IEEE/IFIP 41st International Conference on Dependable Systems & Networks (DSN).
[21] Min-Jea Tahk,et al. IEEE TRANSACTIONS ON AEROSPACE AND ELECTRONIC SYSTEMS , 2022, IEEE Aerospace and Electronic Systems Magazine.
[22] Jon Stearley,et al. What Supercomputers Say: A Study of Five System Logs , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).
[23] Ling Huang,et al. Online System Problem Detection by Mining Patterns of Console Logs , 2009, 2009 Ninth IEEE International Conference on Data Mining.
[24] Zhiling Lan,et al. A practical failure prediction with location and lead time for Blue Gene/P , 2010, 2010 International Conference on Dependable Systems and Networks Workshops (DSN-W).
[25] Christian Engelmann,et al. Blue Gene/L Log Analysis and Time to Interrupt Estimation , 2009, 2009 International Conference on Availability, Reliability and Security.
[26] Jianfeng Zhan,et al. LogMaster: Mining Event Correlations in Logs of Large-Scale Cluster Systems , 2010, 2012 IEEE 31st Symposium on Reliable Distributed Systems.