Online failure prediction for HPC resources using decentralized clustering
暂无分享,去创建一个
Edward Chuah | James C. Browne | Manish Parashar | Andres Quiroz | Alejandro Pelaez | J. Browne | M. Parashar | Andres Quiroz | Edward Chuah | Alejandro Pelaez
[1] Miroslaw Malek,et al. Predicting failures of computer systems: a case study for a telecommunication system , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.
[2] Naveen Sharma,et al. Design and evaluation of decentralized online clustering , 2012, TAAS.
[3] Bianca Schroeder,et al. A Large-Scale Study of Failures in High-Performance Computing Systems , 2010, IEEE Trans. Dependable Secur. Comput..
[4] Janos Gertler,et al. Fault Detection and Diagnosis , 2008, Encyclopedia of Systems and Control.
[5] Gautam Biswas,et al. Bayesian Fault Detection and Diagnosis in Dynamic Systems , 2000, AAAI/IAAI.
[6] Rajeev Thakur,et al. A study of dynamic meta-learning for failure prediction in large-scale systems , 2010, J. Parallel Distributed Comput..
[7] Jianfeng Zhan,et al. LogMaster: Mining Event Correlations in Logs of Large-Scale Cluster Systems , 2010, 2012 IEEE 31st Symposium on Reliable Distributed Systems.
[8] Naveen Sharma,et al. Clustering Analysis for the Management of Self-Monitoring Device Networks , 2008, 2008 International Conference on Autonomic Computing.
[9] Tommy Minyard,et al. End-to-end framework for fault management for open source clusters: Ranger , 2010, TG.
[10] Manish Parashar,et al. Autonomic management of distributed systems using online clustering , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).
[11] Hui Xiong,et al. Failure Prediction in IBM BlueGene/L Event Logs , 2007, ICDM.
[12] Hans-Peter Kriegel,et al. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.
[13] Chao Liu,et al. SOBER: statistical model-based bug localization , 2005, ESEC/FSE-13.
[14] Christian Callegari,et al. A Novel PCA-Based Network Anomaly Detection , 2011, 2011 IEEE International Conference on Communications (ICC).
[15] Anand Sivasubramaniam,et al. Critical event prediction for proactive management in large-scale computer clusters , 2003, KDD '03.
[16] Charng-Da Lu,et al. Reliability challenges in large systems , 2006, Future Gener. Comput. Syst..
[17] Janos Gertler,et al. Fault detection and diagnosis in engineering systems , 1998 .
[18] Arshad Jhumka,et al. Linking Resource Usage Anomalies with System Failures from Cluster Log Data , 2013, 2013 IEEE 32nd International Symposium on Reliable Distributed Systems.