Failure prediction : what to do with unpredicted failures ?
暂无分享,去创建一个
[1] Franck Cappello,et al. Improving the Computing Efficiency of HPC Systems Using a Combination of Proactive and Preventive Checkpointing , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.
[2] Bianca Schroeder,et al. A Large-Scale Study of Failures in High-Performance Computing Systems , 2006, IEEE Transactions on Dependable and Secure Computing.
[3] John T. Daly,et al. Impact of sub-optimal checkpoint intervals on application efficiency in computational clusters , 2010, HPDC '10.
[4] James D. Knoke. Testing for randomness against autocorrelation: Alternative tests , 1977 .
[5] J. Wolfowitz,et al. On a Test Whether Two Samples are from the Same Population , 1940 .
[6] Jean-Marc Vincent,et al. Discovering Statistical Models of Availability in Large Distributed Systems: An Empirical Study of SETI@home , 2011, IEEE Transactions on Parallel and Distributed Systems.
[7] Zhiling Lan,et al. Adaptive Fault Management of Parallel Applications for High-Performance Computing , 2008, IEEE Transactions on Computers.
[8] John T. Daly,et al. A higher order estimate of the optimum checkpoint interval for restart dumps , 2006, Future Gener. Comput. Syst..
[9] Christopher D. Carothers,et al. An analysis of clustered failures on large supercomputing systems , 2009, J. Parallel Distributed Comput..
[10] Mark S. Squillante,et al. Failure data analysis of a large-scale heterogeneous server environment , 2004, International Conference on Dependable Systems and Networks, 2004.
[11] J. V. Bradley. Distribution-Free Statistical Tests , 1968 .
[12] E. L. Lehmann,et al. Theory of point estimation , 1950 .
[13] Christian Engelmann,et al. Blue Gene/L Log Analysis and Time to Interrupt Estimation , 2009, 2009 International Conference on Availability, Reliability and Security.
[14] Franck Cappello,et al. Event Log Mining Tool for Large Scale HPC Systems , 2011, Euro-Par.
[15] Dhabaleswar K. Panda,et al. Monitoring and Predicting Hardware Failures in HPC Clusters with FTB-IPMI , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.
[16] Nithin Nakka,et al. Predicting Node Failure in High Performance Computing Systems from Failure and Usage Logs , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.
[17] Yuanyuan Zhou,et al. Understanding Customer Problem Troubleshooting from Storage System Logs , 2009, FAST.
[18] Emmanuel Jeannot,et al. Optimizing performance and reliability on heterogeneous parallel systems: Approximation algorithms and heuristics , 2012, J. Parallel Distributed Comput..
[19] F. Massey. The Kolmogorov-Smirnov Test for Goodness of Fit , 1951 .
[20] Miroslaw Malek,et al. A survey of online failure prediction methods , 2010, CSUR.
[21] D. B. Owen,et al. Confidence intervals for the coefficient of variation for the normal and log normal distributions , 1964 .
[22] Denis Trystram,et al. On the Scheduling of Checkpoints in Desktop Grids , 2011, 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.
[23] Franck Cappello,et al. Taming of the Shrew: Modeling the Normal and Faulty Behaviour of Large-scale HPC Systems , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.
[24] Dror G. Feitelson,et al. Workload Modeling for Performance Evaluation , 2002, Performance.
[25] Jianfeng Zhan,et al. LogMaster: Mining Event Correlations in Logs of Large-Scale Cluster Systems , 2010, 2012 IEEE 31st Symposium on Reliable Distributed Systems.
[26] Robert L. Wolpert,et al. Statistical Inference , 2019, Encyclopedia of Social Network Analysis and Mining.
[27] Narayan Desai,et al. Co-analysis of RAS Log and Job Log on Blue Gene/P , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.
[28] Zhiling Lan,et al. Dynamic Meta-Learning for Failure Prediction in Large-Scale Systems: A Case Study , 2008, 2008 37th International Conference on Parallel Processing.
[29] Franck Cappello,et al. Modeling and tolerating heterogeneous failures in large parallel systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[30] Franck Cappello,et al. Fault prediction under the microscope: A closer look into HPC systems , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.
[31] John W. Young,et al. A first order approximation to the optimum checkpoint interval , 1974, CACM.
[32] Jon Stearley,et al. A State-Machine Approach to Disambiguating Supercomputer Event Logs , 2012, MAD.