Fault-Driven Re-Scheduling For Improving System-level Fault Resilience
暂无分享,去创建一个
Zhiling Lan | Yawei Li | Xian-He Sun | Prashasta Gujrati | Z. Lan | Yawei Li | Xian-He Sun | P. Gujrati
[1] Ricardo Vilalta,et al. Predicting rare events in temporal domains , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..
[2] Kang G. Shin,et al. A Fault-Tolerant Scheduling Algorithm for Real-Time Periodic Tasks with Possible Software Faults , 2003, IEEE Trans. Computers.
[3] Anand Sivasubramaniam,et al. Fault-aware job scheduling for BlueGene/L systems , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..
[4] S. Scott,et al. A Failure Predictive and Policy-Based High Availability Strategy for Linux High Performance Computing Cluster , 2004 .
[5] Mark S. Squillante,et al. Performance Implications of Failures in Large-Scale Cluster Scheduling , 2004, JSSPP.
[6] Miroslaw Malek,et al. Advanced Failure Prediction in Complex Software Systems , 2004 .
[7] Atakan Dogan,et al. Reliable matching and scheduling of precedence-constrained tasks in heterogeneous distributed computing , 2000, Proceedings 2000 International Conference on Parallel Processing.
[8] J. P. Herzog,et al. Application of a model-based fault detection system to nuclear plant signals , 1997 .
[9] Jonathan M. Smith,et al. A survey of process migration mechanisms , 1988, OPSR.
[10] Charng-da Lu,et al. Scalable Diskless Checkpointing for Large Parallel Systems , 2005 .
[11] Bruce Allen,et al. Monitoring hard disks with smart , 2004 .
[12] Dror G. Feitelson,et al. Utilization, Predictability, Workloads, and User Runtime Estimates in Scheduling the IBM SP2 with Backfilling , 2001, IEEE Trans. Parallel Distributed Syst..
[13] J.-P. Wang,et al. Task Allocation for Maximizing Reliability of Distributed Computer Systems , 1992, IEEE Trans. Computers.
[14] Francine Berman,et al. New Grid Scheduling and Rescheduling Methods in the GrADS Project , 2004, IPDPS Next Generation Software Program - NSFNGS - PI Workshop.
[15] Anand Sivasubramaniam,et al. Critical event prediction for proactive management in large-scale computer clusters , 2003, KDD '03.
[16] Cong Du,et al. MPI-Mitten: Enabling Migration Technology in MPI , 2006, Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID'06).
[17] Hong Jiang,et al. A Dynamic and Reliability-Driven Scheduling Algorithm for Parallel Real-time Jobs on Heterogeneous Clusters , 2005 .
[18] Zhiling Lan,et al. Exploit failure prediction for adaptive fault-tolerance in cluster computing , 2006, Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID'06).
[19] E. N. Elnozahy,et al. Checkpointing for peta-scale systems: a look into the future of practical rollback-recovery , 2004, IEEE Transactions on Dependable and Secure Computing.
[20] Fabrizio Petrini,et al. System-level fault-tolerance in large-scale parallel machines with buffered coscheduling , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..
[21] Richard Wolski,et al. Automatic methods for predicting machine availability in desktop Grid and peer-to-peer systems , 2004, IEEE International Symposium on Cluster Computing and the Grid, 2004. CCGrid 2004..
[22] Niraj K. Jha,et al. Safety and Reliability Driven Task Allocation in Distributed Systems , 1999, IEEE Trans. Parallel Distributed Syst..
[23] Xiao Qin,et al. A dynamic and reliability-driven scheduling algorithm for parallel real-time jobs executing on heterogeneous clusters , 2005, J. Parallel Distributed Comput..
[24] Wednesday September,et al. 2007 International Conference on Parallel Processing , 2007 .
[25] Miron Livny,et al. Checkpoint and Migration of UNIX Processes in the Condor Distributed Processing System , 1997 .