Failure Prediction Mechanisms in Cluster Systems
暂无分享,去创建一个
[1] Kishor S. Trivedi,et al. A proactive approach towards always-on availability in broadband cable networks , 2005, Comput. Commun..
[2] Liudong Xing. Reliability analysis of fault-tolerant systems with common-cause failures , 2003, 2003 International Conference on Dependable Systems and Networks, 2003. Proceedings..
[3] Kishor S. Trivedi,et al. Proactive management of software aging , 2001, IBM J. Res. Dev..
[4] David Lorge Parnas,et al. Software aging , 1994, Proceedings of 16th International Conference on Software Engineering.
[5] William H. Sanders,et al. A performability-oriented software rejuvenation framework for distributed applications , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).
[6] Marina Papatriantafilou,et al. Dynamic and fault-tolerant cluster management , 2005, Fifth IEEE International Conference on Peer-to-Peer Computing (P2P'05).
[7] Yuanyuan Zhou,et al. Fast cluster failover using virtual memory-mapped communication , 1999, ICS '99.
[8] Jason Duell,et al. The design and implementation of Berkeley Lab's linuxcheckpoint/restart , 2005 .
[9] Mark S. Squillante,et al. Performance Implications of Failures in Large-Scale Cluster Scheduling , 2004, JSSPP.
[10] Xiaola Lin,et al. A Variational Calculus Approach to Optimal Checkpoint Placement , 2001, IEEE Trans. Computers.
[11] Christian Engelmann,et al. Job-Site Level Fault Tolerance for Cluster and Grid environments , 2005, 2005 IEEE International Conference on Cluster Computing.
[12] George Candea,et al. Reducing recovery time in a small recursively restartable system , 2002, Proceedings International Conference on Dependable Systems and Networks.
[13] Wei Xie,et al. Performability analysis of clustered systems with rejuvenation under varying workload , 2007, Perform. Evaluation.
[14] Zhiling Lan,et al. Exploit failure prediction for adaptive fault-tolerance in cluster computing , 2006, Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID'06).
[15] Laxmikant V. Kale,et al. Proactive Fault Tolerance in Large Systems , 2004 .
[16] Füsun Özgüner,et al. Enhanced Cluster k-Ary n-Cube, A Fault-Tolerant Multiprocessor , 2003, IEEE Trans. Computers.
[17] Kishor S. Trivedi,et al. Proactive management of software systems: analysis and implementation , 2002 .
[18] Kishor S. Trivedi,et al. Performance Assurance via Software Rejuvenation: Monitoring, Statistics and Algorithms , 2006, International Conference on Dependable Systems and Networks (DSN'06).
[19] C. Siva Ram Murthy,et al. Task Allocation Algorithms for Maximizing Reliability of Distributed Computing Systems , 1997, IEEE Trans. Computers.
[20] Kishor S. Trivedi,et al. Analysis and implementation of software rejuvenation in cluster systems , 2001, SIGMETRICS '01.
[21] Song Jiang,et al. Transparent, Incremental Checkpointing at Kernel Level: a Foundation for Fault Tolerance for Parallel Computers , 2005, ACM/IEEE SC 2005 Conference (SC'05).
[22] Elaine J. Weyuker,et al. Ensuring system performance for cluster and single server systems , 2007, J. Syst. Softw..
[23] Kishor S. Trivedi,et al. A measurement-based model for estimation of resource exhaustion in operational software systems , 1999, Proceedings 10th International Symposium on Software Reliability Engineering (Cat. No.PR00443).
[24] Kishor S. Trivedi,et al. Stochastic Reward Nets for Reliability Prediction , 1996 .
[25] Kishor S. Trivedi,et al. A comprehensive model for software rejuvenation , 2005, IEEE Transactions on Dependable and Secure Computing.
[26] Jason Duell,et al. Requirements for Linux Checkpoint/Restart , 2002 .
[27] Lars Lundberg,et al. Optimal Recovery Schemes for High-Availability Cluster and Distributed Computing , 2001, J. Parallel Distributed Comput..
[28] J. Duell. The design and implementation of Berkeley Lab's linux checkpoint/restart , 2005 .