A hierarchical watchdog mechanism for systemic fault awareness on distributed systems
暂无分享,去创建一个
Pier Stanislao Paolucci | Davide Rossetti | Roberto Ammendola | Andrea Biagioni | Ottorino Frezza | Francesca Lo Cicero | Alessandro Lonardo | Francesco Simula | Laura Tosoratto | Piero Vicini | P. Paolucci | R. Ammendola | A. Biagioni | O. Frezza | F. L. Cicero | A. Lonardo | D. Rossetti | F. Simula | L. Tosoratto | P. Vicini
[1] Rajendra Patrikar,et al. Implementation of Watch Dog Timer for Fault Tolerant Computing on Cluster Server , 2008 .
[2] Dhabaleswar K. Panda,et al. Monitoring and Predicting Hardware Failures in HPC Clusters with FTB-IPMI , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.
[3] Davide Rossetti,et al. APEnet+ 34 Gbps data transmission system and custom transmission logic , 2013 .
[4] A H Bhagyashree,et al. A hierarchical fault detection and recovery in a computational grid using watchdog timers , 2010, 2010 International Conference on Communication and Computational Intelligence (INCOCCI).
[5] Franck Cappello,et al. Fault prediction under the microscope: A closer look into HPC systems , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.
[6] John T. Daly,et al. A higher order estimate of the optimum checkpoint interval for restart dumps , 2006, Future Gener. Comput. Syst..
[7] Johan Vounckx,et al. Fault-Tolerance in Massively Parallel Systems , 1994 .
[8] Davide Rossetti,et al. APEnet+: a 3D Torus network optimized for GPU-based HPC Systems , 2012 .
[9] Davide Rossetti,et al. QUonG: A GPU-based HPC System Dedicated to LQCD Computing , 2011, 2011 Symposium on Application Accelerators in High-Performance Computing.
[10] P. Cochat,et al. Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.
[11] Bran Selic,et al. A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems , 2013, The Journal of Supercomputing.
[12] James H. Laros,et al. rMPI : increasing fault resiliency in a message-passing environment. , 2011 .
[13] John Daly. A Model for Predicting the Optimum Checkpoint Interval for Restart Dumps , 2003, International Conference on Computational Science.
[14] Pier Stanislao Paolucci,et al. The Distributed Network Processor: a novel off-chip and on-chip interconnection network architecture , 2012, ArXiv.
[15] Davide Rossetti,et al. APEnet+ project status , 2012 .
[16] Pier Stanislao Paolucci,et al. LO-FA-MO: Fault Detection and Systemic Awareness for the QUonG Computing System , 2014, 2014 IEEE 33rd International Symposium on Reliable Distributed Systems.
[17] David E. Culler,et al. The ganglia distributed monitoring system: design, implementation, and experience , 2004, Parallel Comput..
[18] Rainer Leupers,et al. EURETILE 2010-2012 summary: first three years of activity of the European Reference Tiled Experiment , 2013, ArXiv.
[19] Israel Koren,et al. Software-Based Failure Detection and Recovery in Programmable Network Interfaces , 2007, IEEE Transactions on Parallel and Distributed Systems.