Spatiotemporal Real-Time Anomaly Detection for Supercomputing Systems
暂无分享,去创建一个
Wei-keng Liao | Kesheng Wu | Rajkumar Kettimuthu | Alok Choudhary | Peter H. Beckman | Alex Sim | Ankit Agrawal | Zhengchun Liu | Qiao Kang
[1] Terry Jones,et al. Accurate fault prediction of BlueGene/P RAS logs via geometric reduction , 2010, 2010 International Conference on Dependable Systems and Networks Workshops (DSN-W).
[2] Franck Cappello,et al. Exploring Properties and Correlations of Fatal Events in a Large-Scale HPC System , 2019, IEEE Transactions on Parallel and Distributed Systems.
[3] Michael Gschwind,et al. The IBM Blue Gene/Q Compute Chip , 2012, IEEE Micro.
[4] Franck Cappello,et al. LOGAIDER: A Tool for Mining Potential Correlations of HPC Log Events , 2017, 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).
[5] Rajeev Thakur,et al. A Meta-Learning Failure Predictor for Blue Gene/L Systems , 2007, 2007 International Conference on Parallel Processing (ICPP 2007).
[6] R. Capuzzo-Dolcetta. PRACE: Partnership for Advanced Computing in Europe , 2010 .
[7] Franck Cappello,et al. Characterizing and Understanding HPC Job Failures Over The 2K-Day Life of IBM BlueGene/Q System , 2019, 2019 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).
[8] Miroslaw Malek,et al. Using Hidden Semi-Markov Models for Effective Online Failure Prediction , 2007, 2007 26th IEEE International Symposium on Reliable Distributed Systems (SRDS 2007).
[9] Philip Heidelberger,et al. The IBM Blue Gene/Q interconnection network and message unit , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[10] Franck Cappello,et al. Reducing Waste in Extreme Scale Systems through Introspective Analysis , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).
[11] Anand Sivasubramaniam,et al. Failure Prediction in IBM BlueGene/L Event Logs , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).
[12] Anand Sivasubramaniam,et al. BlueGene/L Failure Analysis and Prediction Models , 2006, International Conference on Dependable Systems and Networks (DSN'06).
[13] Christopher D. Carothers,et al. An analysis of clustered failures on large supercomputing systems , 2009, J. Parallel Distributed Comput..
[14] Zhiling Lan,et al. A practical failure prediction with location and lead time for Blue Gene/P , 2010, 2010 International Conference on Dependable Systems and Networks Workshops (DSN-W).
[15] Zhiling Lan,et al. Practical online failure prediction for Blue Gene/P: Period-based vs event-driven , 2011, 2011 IEEE/IFIP 41st International Conference on Dependable Systems and Networks Workshops (DSN-W).
[16] Zhiling Lan,et al. System log pre-processing to improve failure prediction , 2009, 2009 IEEE/IFIP International Conference on Dependable Systems & Networks.
[17] Kesheng Wu,et al. Towards Autonomic Science Infrastructure: Architecture, Limitations, and Open Issues , 2018, AI-Science@HPDC.
[18] Ping Huang,et al. Power-Capping Aware Checkpointing: On the Interplay Among Power-Capping, Temperature, Reliability, Performance, and Energy , 2016, 2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).
[19] Bianca Schroeder,et al. A Large-Scale Study of Failures in High-Performance Computing Systems , 2006, IEEE Transactions on Dependable and Secure Computing.
[20] Anand Sivasubramaniam,et al. Filtering failure logs for a BlueGene/L prototype , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).
[21] Christian Engelmann,et al. Failures in Large Scale Systems: Long-term Measurement, Analysis, and Implications , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.