Using Message Logs and Resource Use Data for Cluster Failure Diagnosis
暂无分享,去创建一个
Arshad Jhumka | Edward Chuah | Sai Narasimhamurthy | Bill Barth | Nentawe Gurumdimma | James C. Browne | J. Browne | A. Jhumka | B. Barth | Nentawe Gurumdimma | Edward Chuah | Sai B. Narasimhamurthy
[1] John P. Rouillard. Real-time Log File Analysis Using the Simple Event Correlator (SEC) , 2004, LISA.
[2] Arshad Jhumka,et al. Towards Detecting Patterns in Failure Logs of Large-Scale Distributed Systems , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium Workshop.
[3] Arshad Jhumka,et al. Towards Increasing the Error Handling Time Window in Large-Scale Distributed Systems Using Console and Resource Usage Logs , 2015, 2015 IEEE Trustcom/BigDataSE/ISPA.
[4] Rajeev Gandhi,et al. Draco: Statistical diagnosis of chronic problems in large distributed systems , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).
[5] Alexander Aiken,et al. Using correlated surprise to infer shared influence , 2010, 2010 IEEE/IFIP International Conference on Dependable Systems & Networks (DSN).
[6] Arshad Jhumka,et al. Linking Resource Usage Anomalies with System Failures from Cluster Log Data , 2013, 2013 IEEE 32nd International Symposium on Reliable Distributed Systems.
[7] Zhi-Li Zhang,et al. Extracting the textual and temporal structure of supercomputing logs , 2009, 2009 International Conference on High Performance Computing (HiPC).
[8] Thomas Reidemeister,et al. Diagnosis of recurrent faults using log files , 2009, CASCON.
[9] Alexander Aiken,et al. Alert Detection in System Logs , 2008, 2008 Eighth IEEE International Conference on Data Mining.
[10] Edward Chuah,et al. Online failure prediction for HPC resources using decentralized clustering , 2014, 2014 21st International Conference on High Performance Computing (HiPC).
[11] Arshad Jhumka,et al. CRUDE: Combining Resource Usage Data and Error Logs for Accurate Error Detection in Large-Scale Distributed Systems , 2016, 2016 IEEE 35th Symposium on Reliable Distributed Systems (SRDS).
[12] Song Fu,et al. Anomaly detection in large-scale coalition clusters for dependability assurance , 2010, 2010 International Conference on High Performance Computing.
[13] Evangelos E. Milios,et al. Clustering event logs using iterative partitioning , 2009, KDD.
[14] Saharon Rosset,et al. Analyzing system logs: a new view of what's important , 2007 .
[15] Zhiling Lan,et al. 3-Dimensional root cause diagnosis via co-analysis , 2012, ICAC '12.
[16] Jianfeng Zhan,et al. LogMaster: Mining Event Correlations in Logs of Large-Scale Cluster Systems , 2010, 2012 IEEE 31st Symposium on Reliable Distributed Systems.
[17] Tommy Minyard,et al. End-to-end framework for fault management for open source clusters: Ranger , 2010, TG.
[18] Felix Salfner,et al. Cross-core event monitoring for processor failure prediction , 2009, 2009 International Conference on High Performance Computing & Simulation.
[19] Risto Vaarandi,et al. Mining event logs with SLCT and LogHound , 2008, NOMS 2008 - 2008 IEEE Network Operations and Management Symposium.
[20] Bronis R. de Supinski,et al. Automatic fault characterization via abnormality-enhanced classification , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).
[21] Daniel T. Larose,et al. Discovering Knowledge in Data: An Introduction to Data Mining , 2005 .
[22] Raymond H. Myers,et al. Probability and Statistics for Engineers and Scientists. , 1973 .
[23] Felix Salfner,et al. Error Log Processing for Accurate Failure Prediction , 2008, WASL.
[24] Franck Cappello,et al. Fault prediction under the microscope: A closer look into HPC systems , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.
[25] Ling Huang,et al. Mining Console Logs for Large-Scale System Problem Detection , 2008, SysML.