Big Data Meets HPC Log Analytics: Scalable Approach to Understanding Systems at Extreme Scale
暂无分享,去创建一个
Christian Engelmann | Saurabh Hukerikar | Byung H. Park | Ryan Adamson | C. Engelmann | Byung H. Park | Ryan Adamson | Saurabh Hukerikar
[1] Avrim Blum,et al. Correlation Clustering , 2004, Machine Learning.
[2] Bianca Schroeder,et al. A Large-Scale Study of Failures in High-Performance Computing Systems , 2010, IEEE Trans. Dependable Secur. Comput..
[3] Anand Sivasubramaniam,et al. BlueGene/L Failure Analysis and Prediction Models , 2006, International Conference on Dependable Systems and Networks (DSN'06).
[4] Domenico Cotroneo,et al. Improving Log-based Field Failure Data Analysis of multi-node computing systems , 2011, 2011 IEEE/IFIP 41st International Conference on Dependable Systems & Networks (DSN).
[5] Eduardo Pinheiro,et al. DRAM errors in the wild: a large-scale field study , 2009, SIGMETRICS '09.
[6] Gregory Piatetsky-Shapiro,et al. Discovery, Analysis, and Presentation of Strong Rules , 1991, Knowledge Discovery in Databases.
[7] J. Ross Quinlan,et al. Induction of Decision Trees , 1986, Machine Learning.
[8] Sudhanva Gurumurthi,et al. Feng Shui of supercomputer memory positional effects in DRAM and SRAM faults , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[9] Mark S. Squillante,et al. Failure data analysis of a large-scale heterogeneous server environment , 2004, International Conference on Dependable Systems and Networks, 2004.
[10] Saurabh Gupta,et al. Reliability lessons learned from GPU experience with the Titan supercomputer at Oak Ridge leadership computing facility , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.
[11] Anand Sivasubramaniam,et al. Filtering failure logs for a BlueGene/L prototype , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).
[12] Jon Stearley,et al. What Supercomputers Say: A Study of Five System Logs , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).
[13] Bianca Schroeder,et al. Cosmic rays don't strike twice: understanding the nature of DRAM errors and the implications for system design , 2012, ASPLOS XVII.
[14] Wolfgang Barth,et al. Nagios: System and Network Monitoring , 2006 .
[15] Domenico Cotroneo,et al. Assessing time coalescence techniques for the analysis of supercomputer logs , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).
[16] Ann C. Gentile,et al. OVIS: a tool for intelligent, real-time monitoring of computational clusters , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.
[17] Franck Cappello,et al. Fault prediction under the microscope: A closer look into HPC systems , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.
[18] David Heckerman,et al. Bayesian Networks for Data Mining , 2004, Data Mining and Knowledge Discovery.
[19] Ravishankar K. Iyer,et al. Lessons Learned from the Analysis of System Failures at Petascale: The Case of Blue Waters , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.
[20] David E. Culler,et al. The ganglia distributed monitoring system: design, implementation, and experience , 2004, Parallel Comput..