Failure prediction for autonomic management of networked computer systems with availability assurance
暂无分享,去创建一个
[1] Anand Sivasubramaniam,et al. BlueGene/L Failure Analysis and Prediction Models , 2006, International Conference on Dependable Systems and Networks (DSN'06).
[2] Charng-Da Lu,et al. Big Systems and Big Reliability Challenges , 2003, PARCO.
[3] Swapna S. Gokhale,et al. Analytical Models for Architecture-Based Software Reliability Prediction: A Unification Framework , 2006, IEEE Transactions on Reliability.
[4] Song Fu,et al. Failure-aware resource management for high-availability computing clusters with distributed virtual machines , 2010, J. Parallel Distributed Comput..
[5] Song Fu. Failure-Aware Construction and Reconfiguration of Distributed Virtual Machines for High Availability Computing , 2009, 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid.
[6] Laxmikant V. Kalé,et al. Proactive Fault Tolerance in MPI Applications Via Task Migration , 2006, HiPC.
[7] Christian Engelmann,et al. Proactive fault tolerance for HPC with Xen virtualization , 2007, ICS '07.
[8] Wu-chun Feng,et al. A Power-Aware Run-Time System for High-Performance Computing , 2005, ACM/IEEE SC 2005 Conference (SC'05).
[9] J. Berger,et al. Objective Bayesian Analysis of Spatially Correlated Data , 2001 .
[10] Mourad Hakem,et al. Reliability and Scheduling on Systems Subject to Failures , 2007, 2007 International Conference on Parallel Processing (ICPP 2007).
[11] Richard P. Martin,et al. Improving cluster availability using workstation validation , 2002, SIGMETRICS '02.
[12] Zhiling Lan,et al. Exploit failure prediction for adaptive fault-tolerance in cluster computing , 2006, Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID'06).
[13] Xiao Qin,et al. A dynamic and reliability-driven scheduling algorithm for parallel real-time jobs executing on heterogeneous clusters , 2005, J. Parallel Distributed Comput..
[14] Willy Zwaenepoel,et al. Dynamic content web applications: Crash, failover, and recovery analysis , 2009, 2009 IEEE/IFIP International Conference on Dependable Systems & Networks.
[15] Atakan Dogan,et al. Matching and Scheduling Algorithms for Minimizing Execution Time and Failure Probability of Applications in Heterogeneous Computing , 2002, IEEE Trans. Parallel Distributed Syst..
[16] L. Alvisi,et al. A Survey of Rollback-Recovery Protocols , 2002 .
[17] David F. Heidel,et al. An Overview of the BlueGene/L Supercomputer , 2002, ACM/IEEE SC 2002 Conference (SC'02).
[18] Larry Rudolph,et al. Probabilistic QoS guarantees for supercomputing systems , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).
[19] Jon Stearley,et al. What Supercomputers Say: A Study of Five System Logs , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).
[20] Jason Nieh,et al. Transparent Checkpoint-Restart of Multiple Processes on Commodity Operating Systems , 2007, USENIX Annual Technical Conference.
[21] Gao Wen,et al. A proactive fault-detection mechanism in large-scale cluster systems , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.
[22] F. Mueller,et al. Proactive process-level live migration in HPC environments , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.
[23] Ming Wu,et al. Performance under failures of high-end computing , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).
[24] Cheng-Zhong Xu,et al. Quantifying Temporal and Spatial Correlation of Failure Events for Proactive Management , 2007, 2007 26th IEEE International Symposium on Reliable Distributed Systems (SRDS 2007).
[25] Cheng-Zhong Xu,et al. Proactive Resource Management for Failure Resilient High Performance Computing Clusters , 2009, 2009 International Conference on Availability, Reliability and Security.
[26] Anand Sivasubramaniam,et al. Filtering failure logs for a BlueGene/L prototype , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).
[27] Cheng-Zhong Xu,et al. Exploring event correlation for failure prediction in coalitions of clusters , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).
[28] Miroslaw Malek,et al. Predicting failures of computer systems: a case study for a telecommunication system , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.
[29] Suman Nath,et al. Beyond Availability: Towards a Deeper Understanding of Machine Failure Characteristics in Large Distributed Systems , 2004, WORLDS.
[30] Bianca Schroeder,et al. A Large-Scale Study of Failures in High-Performance Computing Systems , 2010, IEEE Trans. Dependable Secur. Comput..
[31] David García,et al. NonStop/spl reg/ advanced architecture , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).
[32] Robert G. Gallager,et al. Discrete Stochastic Processes , 1995 .
[33] Brian D. Noble,et al. Exploiting Availability Prediction in Distributed Systems , 2006, NSDI.
[34] Mark S. Squillante,et al. Performance Implications of Failures in Large-Scale Cluster Scheduling , 2004, JSSPP.
[35] Zhiling Lan,et al. A fast restart mechanism for checkpoint/recovery protocols in networked environments , 2008, 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN).
[36] Roy Friedman,et al. Model-based performance evaluation of distributed checkpointing protocols , 2008, Perform. Evaluation.
[37] Mark S. Squillante,et al. Failure data analysis of a large-scale heterogeneous server environment , 2004, International Conference on Dependable Systems and Networks, 2004.
[38] Bianca Schroeder,et al. Understanding failures in petascale computers , 2007 .
[39] Anand Sivasubramaniam,et al. Critical event prediction for proactive management in large-scale computer clusters , 2003, KDD '03.
[40] Samiha Mourad,et al. On the Reliability of the IBM MVS/XA Operating System , 1987, IEEE Transactions on Software Engineering.
[41] Daniel Marques,et al. Compiler-enhanced incremental checkpointing for OpenMP applications , 2009, IPDPS.
[42] Zhiling Lan,et al. Dynamic Meta-Learning for Failure Prediction in Large-Scale Systems: A Case Study , 2008, 2008 37th International Conference on Parallel Processing.
[43] Luiz C. Alves,et al. Reliability, availability, and serviceability (RAS) of the IBM eServer z990 , 2004, IBM J. Res. Dev..