Failure Diagnosis for Cluster Systems using Partial Correlations
暂无分享,去创建一个
[1] Frank Mueller,et al. Systemic Assessment of Node Failures in HPC Production Platforms , 2021, 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS).
[2] Shilin He,et al. Experience Report: System Log Analysis for Anomaly Detection , 2016, 2016 IEEE 27th International Symposium on Software Reliability Engineering (ISSRE).
[3] Erkki Oja,et al. Independent component analysis: algorithms and applications , 2000, Neural Networks.
[4] Arshad Jhumka,et al. Using Message Logs and Resource Use Data for Cluster Failure Diagnosis , 2016, 2016 IEEE 23rd International Conference on High Performance Computing (HiPC).
[5] James C. Browne,et al. Enabling comprehensive data-driven system management for large computational facilities , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[6] Bianca Schroeder,et al. The Computer Failure Data Repository (CFDR): collecting, sharing and analyzing failure data , 2006, SC.
[7] Rodrigo Fonseca,et al. Pivot tracing , 2018, USENIX ATC.
[8] Franck Cappello,et al. LOGAIDER: A Tool for Mining Potential Correlations of HPC Log Events , 2017, 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).
[9] Alan Agresti,et al. Statistics: The Art and Science of Learning from Data , 2005 .
[10] Jon Stearley,et al. What Supercomputers Say: A Study of Five System Logs , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).
[11] Arshad Jhumka,et al. Towards comprehensive dependability-driven resource use and message log-analysis for HPC systems diagnosis , 2019, J. Parallel Distributed Comput..
[12] Fengxi Song,et al. Feature Selection Using Principal Component Analysis , 2010, 2010 International Conference on System Science, Engineering Design and Manufacturing Informatization.
[13] R. Shibata,et al. PARTIAL CORRELATION AND CONDITIONAL CORRELATION AS MEASURES OF CONDITIONAL INDEPENDENCE , 2004 .
[14] Joachim Selbig,et al. Non-linear PCA: a missing data approach , 2005, Bioinform..
[15] Bernd Hamann,et al. Scrubjay: Deriving Knowledge from the Disarray of HPC Performance Data , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.
[16] Merve Astekin,et al. DILAF: A framework for distributed analysis of large‐scale system logs for anomaly detection , 2018, Softw. Pract. Exp..
[17] Carl E. Landwehr,et al. Basic concepts and taxonomy of dependable and secure computing , 2004, IEEE Transactions on Dependable and Secure Computing.
[18] Saurabh Gupta,et al. Understanding and Exploiting Spatial Properties of System Failures on Extreme-Scale HPC Systems , 2015, 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.
[19] Zhiling Lan,et al. 3-Dimensional root cause diagnosis via co-analysis , 2012, ICAC '12.
[20] Neeraj Suri,et al. On-Line Diagnosis and Recovery: On the Choice and Impact of Tuning Parameters , 2007, IEEE Transactions on Dependable and Secure Computing.
[21] Song Fu,et al. Adaptive Anomaly Identification by Exploring Metric Subspace in Cloud Computing Infrastructures , 2013, 2013 IEEE 32nd International Symposium on Reliable Distributed Systems.
[22] Jelle J. Goeman,et al. Multiple hypothesis testing in genomics , 2014, Statistics in medicine.
[23] Arshad Jhumka,et al. Using Resource Use Data and System Logs for HPC System Error Propagation and Recovery Diagnosis , 2019, 2019 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom).
[24] Wei Xu,et al. What Can We Learn from Four Years of Data Center Hardware Failures? , 2017, 2017 47th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).
[25] Matthias Scholz,et al. Validation of Nonlinear PCA , 2012, Neural Processing Letters.
[26] Chokchai Leangsuksun,et al. Baler: deterministic, lossless log message clustering tool , 2011, Computer Science - Research and Development.
[27] Zhiling Lan,et al. Toward Automated Anomaly Identification in Large-Scale Systems , 2010, IEEE Transactions on Parallel and Distributed Systems.
[28] Arshad Jhumka,et al. Linking Resource Usage Anomalies with System Failures from Cluster Log Data , 2013, 2013 IEEE 32nd International Symposium on Reliable Distributed Systems.
[29] Tommy Minyard,et al. End-to-end framework for fault management for open source clusters: Ranger , 2010, TG.
[30] James C. Browne,et al. Open XDMoD: A Tool for the Comprehensive Management of High-Performance Computing Resources , 2015, Computing in Science & Engineering.
[31] Guy Feldman,et al. A Principled Approach to HPC Event Monitoring , 2015, FTXS@HPDC.
[32] Saurabh Jha,et al. The Mystery of the Failing Jobs: Insights from Operational Data from Two University-Wide Computing Systems , 2020, 2020 50th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).
[33] Weisong Shi,et al. Understanding and Analyzing Interconnect Errors and Network Congestion on a Large Scale HPC System , 2018, 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).
[34] Heiko Hoffmann,et al. Kernel PCA for novelty detection , 2007, Pattern Recognit..
[35] James C. Browne,et al. Understanding Application and System Performance Through System-Wide Monitoring , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).
[36] Franck Cappello,et al. Taming of the Shrew: Modeling the Normal and Faulty Behaviour of Large-scale HPC Systems , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.
[37] Christian Engelmann,et al. A Big Data Analytics Framework for HPC Log Data: Three Case Studies Using the Titan Supercomputer Log , 2018, 2018 IEEE International Conference on Cluster Computing (CLUSTER).
[38] Edward Chuah,et al. Online failure prediction for HPC resources using decentralized clustering , 2014, 2014 21st International Conference on High Performance Computing (HiPC).
[39] Franck Cappello,et al. Characterizing and Understanding HPC Job Failures Over The 2K-Day Life of IBM BlueGene/Q System , 2019, 2019 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).
[40] Alexander Aiken,et al. Using correlated surprise to infer shared influence , 2010, 2010 IEEE/IFIP International Conference on Dependable Systems & Networks (DSN).
[41] Raymond H. Myers,et al. Probability and Statistics for Engineers and Scientists. , 1973 .
[42] Christian Engelmann,et al. Failures in Large Scale Systems: Long-term Measurement, Analysis, and Implications , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.
[43] Song Fu,et al. Anomaly detection in large-scale coalition clusters for dependability assurance , 2010, 2010 International Conference on High Performance Computing.