LO-FA-MO: Fault Detection and Systemic Awareness for the QUonG Computing System
暂无分享,去创建一个
Pier Stanislao Paolucci | Davide Rossetti | Roberto Ammendola | Andrea Biagioni | Ottorino Frezza | Francesca Lo Cicero | Alessandro Lonardo | Francesco Simula | Laura Tosoratto | Piero Vicini | P. Paolucci | R. Ammendola | A. Biagioni | O. Frezza | F. L. Cicero | A. Lonardo | D. Rossetti | F. Simula | L. Tosoratto | P. Vicini
[1] Pier Stanislao Paolucci,et al. Design and implementation of a modular, low latency, fault-aware, FPGA-based network interface , 2013, 2013 International Conference on Reconfigurable Computing and FPGAs (ReConFig).
[2] Davide Rossetti,et al. APEnet+ project status , 2012 .
[3] James H. Laros,et al. rMPI : increasing fault resiliency in a message-passing environment. , 2011 .
[4] Davide Rossetti,et al. QUonG: A GPU-based HPC System Dedicated to LQCD Computing , 2011, 2011 Symposium on Application Accelerators in High-Performance Computing.
[5] A H Bhagyashree,et al. A hierarchical fault detection and recovery in a computational grid using watchdog timers , 2010, 2010 International Conference on Communication and Computational Intelligence (INCOCCI).
[6] Massimo Bernaschi,et al. GPU Peer-to-Peer Techniques Applied to a Cluster Interconnect , 2013, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum.
[7] Rajendra Patrikar,et al. Implementation of Watch Dog Timer for Fault Tolerant Computing on Cluster Server , 2008 .
[8] David Josephsen,et al. Building a Monitoring Infrastructure with Nagios , 2007 .
[9] Davide Rossetti,et al. APEnet+: a 3D Torus network optimized for GPU-based HPC Systems , 2012 .
[10] Israel Koren,et al. Software-Based Failure Detection and Recovery in Programmable Network Interfaces , 2007, IEEE Transactions on Parallel and Distributed Systems.
[11] Franck Cappello,et al. Fault Tolerance in Petascale/ Exascale Systems: Current Knowledge, Challenges and Research Opportunities , 2009, Int. J. High Perform. Comput. Appl..
[12] Bran Selic,et al. A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems , 2013, The Journal of Supercomputing.
[13] Johan Vounckx,et al. Fault-Tolerance in Massively Parallel Systems , 1994 .
[14] David E. Culler,et al. The ganglia distributed monitoring system: design, implementation, and experience , 2004, Parallel Comput..
[15] Dhabaleswar K. Panda,et al. Monitoring and Predicting Hardware Failures in HPC Clusters with FTB-IPMI , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.
[16] Davide Rossetti,et al. APEnet+ 34 Gbps data transmission system and custom transmission logic , 2013 .