Ensemble of Bayesian Predictors and Decision Trees for Proactive Failure Management in Cloud Computing Systems
暂无分享,去创建一个
Ziming Zhang | Song Fu | Qiang Guan | Qiang Guan | Ziming Zhang | Song Fu
[1] Anand Sivasubramaniam,et al. BlueGene/L Failure Analysis and Prediction Models , 2006, International Conference on Dependable Systems and Networks (DSN'06).
[2] Anand Sivasubramaniam,et al. Filtering failure logs for a BlueGene/L prototype , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).
[3] Cheng-Zhong Xu,et al. Exploring event correlation for failure prediction in coalitions of clusters , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).
[4] Miroslaw Malek,et al. A survey of online failure prediction methods , 2010, CSUR.
[5] Song Fu,et al. auto-AID: A data mining framework for autonomic anomaly identification in networked computer systems , 2010, International Performance Computing and Communications Conference.
[6] Cheng-Zhong Xu,et al. Proactive Resource Management for Failure Resilient High Performance Computing Clusters , 2009, 2009 International Conference on Availability, Reliability and Security.
[7] Richard P. Martin,et al. Improving cluster availability using workstation validation , 2002, SIGMETRICS '02.
[8] Anand Sivasubramaniam,et al. Critical event prediction for proactive management in large-scale computer clusters , 2003, KDD '03.
[9] Miroslaw Malek,et al. Using Hidden Semi-Markov Models for Effective Online Failure Prediction , 2007, 2007 26th IEEE International Symposium on Reliable Distributed Systems (SRDS 2007).
[10] Jeffrey S. Chase,et al. Correlating Instrumentation Data to System States: A Building Block for Automated Diagnosis and Control , 2004, OSDI.
[11] Zhiling Lan,et al. Dynamic Meta-Learning for Failure Prediction in Large-Scale Systems: A Case Study , 2008, 2008 37th International Conference on Parallel Processing.
[12] Felix Salfner,et al. Cross-core event monitoring for processor failure prediction , 2009, 2009 International Conference on High Performance Computing & Simulation.
[13] Mark S. Squillante,et al. Performance Implications of Failures in Large-Scale Cluster Scheduling , 2004, JSSPP.
[14] Wei Peng,et al. Mining Logs Files for Computing System Management , 2005, Second International Conference on Autonomic Computing (ICAC'05).
[15] Christian Engelmann,et al. Proactive fault tolerance for HPC with Xen virtualization , 2007, ICS '07.
[16] Swapna S. Gokhale,et al. Analytical Models for Architecture-Based Software Reliability Prediction: A Unification Framework , 2006, IEEE Transactions on Reliability.
[17] Cheng-Zhong Xu,et al. Quantifying event correlations for proactive failure management in networked computing systems , 2010, J. Parallel Distributed Comput..
[18] Suman Nath,et al. Beyond Availability: Towards a Deeper Understanding of Machine Failure Characteristics in Large Distributed Systems , 2004, WORLDS.
[19] Jason Nieh,et al. Transparent Checkpoint-Restart of Multiple Processes on Commodity Operating Systems , 2007, USENIX Annual Technical Conference.
[20] Leo Breiman,et al. Classification and Regression Trees , 1984 .
[21] Petra Perner,et al. Data Mining - Concepts and Techniques , 2002, Künstliche Intell..
[22] Xiaobo Zhou,et al. Regression based multi-tier resource provisioning for session slowdown guarantees , 2010, International Performance Computing and Communications Conference.
[23] Mark S. Squillante,et al. Failure data analysis of a large-scale heterogeneous server environment , 2004, International Conference on Dependable Systems and Networks, 2004.
[24] Anand Sivasubramaniam,et al. Fault-aware job scheduling for BlueGene/L systems , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..
[25] Bianca Schroeder,et al. A Large-Scale Study of Failures in High-Performance Computing Systems , 2006, IEEE Transactions on Dependable and Secure Computing.
[26] Wei-Yin Loh,et al. Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..
[27] David G. Stork,et al. Pattern Classification , 1973 .
[28] Daniel Marques,et al. Compiler-enhanced incremental checkpointing for OpenMP applications , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.
[29] Thomas M. Cover,et al. Elements of Information Theory , 2005 .
[30] Chao Wang,et al. Proactive process-level live migration in HPC environments , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.
[31] Raja Nassar,et al. Availability modeling and analysis on high performance cluster computing systems , 2006, First International Conference on Availability, Reliability and Security (ARES'06).
[32] Ziming Zhang,et al. Ensemble of Bayesian Predictors for Autonomic Failure Management in Cloud Computing , 2011, 2011 Proceedings of 20th International Conference on Computer Communications and Networks (ICCCN).
[33] Xiaobo Zhou,et al. Coordinated session-based admission control with statistical learning for multi-tier internet applications , 2011, J. Netw. Comput. Appl..
[34] Brian D. Noble,et al. Exploiting Availability Prediction in Distributed Systems , 2006, NSDI.
[35] Ziming Zhang,et al. Proactive Failure Management by Integrated Unsupervised and Semi-Supervised Learning for Dependable Cloud Systems , 2011, 2011 Sixth International Conference on Availability, Reliability and Security.
[36] Roy Friedman,et al. Model-based performance evaluation of distributed checkpointing protocols , 2008, Perform. Evaluation.
[37] L. Alvisi,et al. A Survey of Rollback-Recovery Protocols , 2002 .
[38] Michael I. Jordan,et al. Failure diagnosis using decision trees , 2004 .
[39] Song Fu,et al. Anomaly detection in large-scale coalition clusters for dependability assurance , 2010, 2010 International Conference on High Performance Computing.
[40] Miroslaw Malek,et al. Proactive fault handling for system availability enhancement , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.
[41] Felix Salfner,et al. Timely Virtual Machine Migration for Pro-active Fault Tolerance , 2011, 2011 14th IEEE International Symposium on Object/Component/Service-Oriented Real-Time Distributed Computing Workshops.
[42] Song Fu,et al. Failure-aware resource management for high-availability computing clusters with distributed virtual machines , 2010, J. Parallel Distributed Comput..
[43] Ziming Zhang,et al. Failure prediction for autonomic management of networked computer systems with availability assurance , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).
[44] Laxmikant V. Kalé,et al. Proactive Fault Tolerance in MPI Applications Via Task Migration , 2006, HiPC.
[45] Song Fu. Failure-Aware Construction and Reconfiguration of Distributed Virtual Machines for High Availability Computing , 2009, 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid.
[46] Cheng-Zhong Xu,et al. Quantifying Temporal and Spatial Correlation of Failure Events for Proactive Management , 2007, 2007 26th IEEE International Symposium on Reliable Distributed Systems (SRDS 2007).
[47] Zhiling Lan,et al. A fast restart mechanism for checkpoint/recovery protocols in networked environments , 2008, 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN).