Versioned Distributed Arrays for Resilience in Scientific Applications: Global View Resilience
暂无分享,去创建一个
Andrew A. Chien | Brian van Straalen | Pavan Balaji | Ignacio Laguna | Andrew R. Siegel | Ziming Zheng | Anshu Dubey | Nan Dun | Kamil Iskra | James Dinan | Peter H. Beckman | Aiman Fang | Hajime Fujita | Jeff R. Hammond | Michael A. Heroux | Keita Teranishi | Mark Hoemmen | D. Richards | Rob Schreiber | Zachary A. Rubenstein | James Dinan | P. Balaji | M. Heroux | P. Beckman | K. Iskra | J. Hammond | K. Teranishi | M. Hoemmen | A. Chien | A. Siegel | I. Laguna | A. Fang | B. V. Straalen | A. Dubey | D. Richards | N. Dun | R. Schreiber | H. Fujita | Ziming Zheng | Z. Rubenstein
[1] M. Berger,et al. Adaptive mesh refinement for hyperbolic partial differential equations , 1982 .
[2] Jarek Nieplocha,et al. Advances, Applications and Performance of the Global Arrays Shared Memory Programming Toolkit , 2006, Int. J. High Perform. Comput. Appl..
[3] John A. Gunnels,et al. 100 + TFlop Solidification Simulations on BlueGene / L , 2005 .
[4] Bradford L. Chamberlain,et al. Parallel Programmability and the Chapel Language , 2007, Int. J. High Perform. Comput. Appl..
[5] Benoit Forget,et al. The OpenMC Monte Carlo particle transport code , 2012 .
[6] Katherine Yelick,et al. Introduction to UPC and Language Specification , 2000 .
[7] Andrew A. Chien,et al. Log-Structured Global Array for Efficient Multi-Version Snapshots , 2015, 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.
[8] P. Colella. Multidimensional upwind methods for hyperbolic conservation laws , 1990 .
[9] Vivek Sarkar,et al. X10: an object-oriented approach to non-uniform cluster computing , 2005, OOPSLA '05.
[10] Robert E. Lyons,et al. The Use of Triple-Modular Redundancy to Improve Computer Reliability , 1962, IBM J. Res. Dev..
[11] Andrew A. Chien,et al. Error Checking and Snapshot-Based Recovery in a Preconditioned Conjugate Gradient Solver , 2013 .
[12] Martin C. Rinard,et al. Verifying quantitative reliability for programs that execute on unreliable hardware , 2013, OOPSLA.
[13] William R. Martin,et al. THE MONTE CARLO PERFORMANCE BENCHMARK TEST - AIMS, SPECIFICATIONS AND FIRST RESULTS , 2011 .
[14] Bronis R. de Supinski,et al. Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.
[15] Heather Quinn,et al. Final report for CCS cross-layer reliability visioning study , 2010 .
[16] Michael A. Heroux. Toward resilient algorithms and applications , 2013, FTXS '13.
[17] Sarita V. Adve,et al. Low-cost program-level detectors for reducing silent data corruptions , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).
[18] Ron Brightwell,et al. Cooperative Application/OS DRAM Fault Recovery , 2011, Euro-Par Workshops.
[19] Tamara G. Kolda,et al. An overview of the Trilinos project , 2005, TOMS.
[20] Josef Bacik,et al. BTRFS: The Linux B-Tree Filesystem , 2013, TOS.
[21] Gene H. Golub,et al. Matrix computations , 1983 .
[22] Jacob A. Abraham,et al. Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.
[23] Sandia Report,et al. HPCG Technical Specification , 2013 .
[24] Ravishankar K. Iyer,et al. Lessons Learned from the Analysis of System Failures at Petascale: The Case of Blue Waters , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.
[25] Koji Sato,et al. The Linux implementation of a log-structured file system , 2006, OPSR.
[26] John A. Gunnels,et al. Simulating solidification in metals at high pressure: The drive to petascale computing , 2006 .
[27] Thomas Hérault,et al. An evaluation of User-Level Failure Mitigation support in MPI , 2012, Computing.
[28] Sally A. McKee,et al. ROSE::FTTransform - A source-to-source translation framework for exascale fault-tolerance research , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012).
[29] Kunle Olukotun,et al. The Future of Microprocessors , 2005, ACM Queue.
[30] John A. Gunnels,et al. Extending stability beyond CPU millennium: a micron-scale atomistic simulation of Kelvin-Helmholtz instability , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).
[31] Dhabaleswar K. Panda,et al. CIFTS: A Coordinated Infrastructure for Fault-Tolerant Systems , 2009, 2009 International Conference on Parallel Processing.
[32] Zizhong Chen,et al. Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods , 2013, PPoPP '13.
[33] Robert W. Numrich,et al. Co-array Fortran for parallel programming , 1998, FORF.
[34] Narayan Desai,et al. Co-analysis of RAS Log and Job Log on Blue Gene/P , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.
[35] Gene H. Golub,et al. Matrix computations (3rd ed.) , 1996 .
[36] Andrew A. Chien,et al. When is multi-version checkpointing needed? , 2013, FTXS '13.
[37] Franck Cappello,et al. FTI: High performance Fault Tolerance Interface for hybrid systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[38] Bronis R. de Supinski,et al. Soft error vulnerability of iterative linear algebra methods , 2007, ICS '08.
[39] Wilson C. Hsieh,et al. Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.
[40] Bianca Schroeder,et al. A Large-Scale Study of Failures in High-Performance Computing Systems , 2006, IEEE Transactions on Dependable and Secure Computing.
[41] Mark F. Adams,et al. Chombo Software Package for AMR Applications Design Document , 2014 .
[42] P. Colella,et al. Local adaptive mesh refinement for shock hydrodynamics , 1989 .
[43] Andrew A. Chien,et al. Data decomposition in Monte Carlo neutron transport simulations using global view arrays , 2015, Int. J. High Perform. Comput. Appl..
[44] Dan Grossman,et al. EnerJ: approximate data types for safe and general low-power computation , 2011, PLDI '11.
[45] James H. Laros,et al. Evaluating the viability of process replication reliability for exascale systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[46] Martin C. Rinard,et al. Chisel: reliability- and accuracy-aware optimization of approximate computational kernels , 2014, OOPSLA.