CoREC: Scalable and Resilient In-memory Data Staging for In-situ Workflows
暂无分享,去创建一个
Manish Parashar | Shaohua Duan | Keita Teranishi | Hemanth Kolla | Marc Gamell | Philip E. Davis | Pradeep Subedi
[1] Kerstin Kleese van Dam,et al. Management, analysis, and visualization of experimental and observational data — The convergence of data and computing , 2016, 2016 IEEE 12th International Conference on e-Science (e-Science).
[2] Ray W. Grout,et al. Ultrascale Visualization In Situ Visualization for Large-Scale Combustion Simulations , 2010 .
[3] Franck Cappello,et al. Reducing Waste in Extreme Scale Systems through Introspective Analysis , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).
[4] Bin Nie,et al. A large-scale study of soft-errors on GPUs in the field , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).
[5] E Wes Bethel,et al. Report of the DOE Workshop on Management, Analysis, and Visualization of Experimental and Observational data – The Convergence of Data and Computing , 2016 .
[6] Franck Cappello,et al. Distributed Diskless Checkpoint for Large Scale Systems , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.
[7] Patrick P. C. Lee,et al. Erasure coding for small objects in in-memory KV storage , 2017, SYSTOR.
[8] Christian Engelmann,et al. Combining Partial Redundancy and Checkpointing for HPC , 2012, 2012 IEEE 32nd International Conference on Distributed Computing Systems.
[9] Fan Zhang,et al. ActiveSpaces: Exploring dynamic code deployment for extreme scale data processing , 2015, Concurr. Comput. Pract. Exp..
[10] Paul Messina,et al. The Exascale Computing Project , 2017, Comput. Sci. Eng..
[11] Scott Klasky,et al. Terascale direct numerical simulations of turbulent combustion using S3D , 2008 .
[12] Scott Klasky,et al. Stacker: An Autonomic Data Movement Engine for Extreme-Scale Data Staging-Based In-Situ Workflows , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.
[13] Franck Cappello,et al. Toward Exascale Resilience: 2014 update , 2014, Supercomput. Front. Innov..
[14] Xubin He,et al. A Comprehensive Analysis of XOR-Based Erasure Codes Tolerating 3 or More Concurrent Failures , 2013, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum.
[15] Manish Parashar,et al. Scalable Data Resilience for In-memory Data Staging , 2018, 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS).
[16] Carlos Maltzahn,et al. Ceph: a scalable, high-performance distributed file system , 2006, OSDI '06.
[17] Jianliang Xu,et al. Real-Time In-Memory Checkpointing for Future Hybrid Memory Systems , 2015, ICS.
[18] Lavanya Ramakrishnan,et al. AnalyzeThis: an analysis workflow-aware storage system , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.
[19] Ada Gavrilovska,et al. SmartBlock: An Approach to Standardizing In Situ Workflow Components , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).
[20] Xian-He Sun,et al. Harmonia: An Interference-Aware Dynamic I/O Scheduler for Shared Non-volatile Burst Buffers , 2018, 2018 IEEE International Conference on Cluster Computing (CLUSTER).
[21] Robert Latham,et al. Leveraging burst buffer coordination to prevent I/O interference , 2016, 2016 IEEE 12th International Conference on e-Science (e-Science).
[22] Thomas Hérault,et al. Post-failure recovery of MPI communication capability , 2013, Int. J. High Perform. Comput. Appl..
[23] F. Moore,et al. Polynomial Codes Over Certain Finite Fields , 2017 .
[24] Scott Klasky,et al. DataSpaces: an interaction and coordination framework for coupled simulation workflows , 2012, HPDC '10.
[25] Franck Cappello,et al. FTI: High performance Fault Tolerance Interface for hybrid systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[26] Saurabh Gupta,et al. Understanding and Exploiting Spatial Properties of System Failures on Extreme-Scale HPC Systems , 2015, 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.
[27] Bran Selic,et al. A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems , 2013, The Journal of Supercomputing.
[28] Peter Desnoyers,et al. Active flash: towards energy-efficient, in-situ data analytics on extreme-scale machines , 2013, FAST.
[29] Catherine D. Schuman,et al. A Performance Evaluation and Examination of Open-Source Erasure Coding Libraries for Storage , 2009, FAST.
[30] Herbert Bos,et al. Techniques for efficient in-memory checkpointing , 2013, HotDep.
[31] Karsten Schwan,et al. PreDatA – preparatory data analytics on peta-scale machines , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).
[32] Christian Engelmann,et al. Failures in Large Scale Systems: Long-term Measurement, Analysis, and Implications , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.
[33] Tao Lu,et al. Toward Managing HPC Burst Buffers Effectively: Draining Strategy to Regulate Bursty I/O Behavior , 2017, 2017 IEEE 25th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS).
[34] Carlos Maltzahn,et al. Efficient, Failure Resilient Transactions for Parallel and Distributed Computing , 2014, 2014 International Workshop on Data Intensive Scalable Computing Systems.
[35] Ray W. Grout,et al. Feature-Based Statistical Analysis of Combustion Simulation Data , 2011, IEEE Transactions on Visualization and Computer Graphics.
[36] Manish Parashar,et al. Local recovery and failure masking for stencil-based applications at extreme scales , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.
[37] Guillaume Aupy,et al. Sizing and Partitioning Strategies for Burst-Buffers to Reduce IO Contention , 2019, 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS).
[38] Thomas Hérault,et al. An evaluation of User-Level Failure Mitigation support in MPI , 2012, Computing.
[39] Jiaqi Liu,et al. Supporting Fault-Tolerance in Presence of In-Situ Analytics , 2017, CCGrid.
[40] Heng Zhang,et al. Efficient and Available In-Memory KV-Store with Hybrid Erasure Coding and Replication , 2016, FAST.
[41] Bin Nie,et al. Machine Learning Models for GPU Error Prediction in a Large Scale HPC System , 2018, 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).
[42] Fan Zhang,et al. Combining in-situ and in-transit processing to enable extreme-scale scientific analysis , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.