论文信息 - CoREC: Scalable and Resilient In-memory Data Staging for In-situ Workflows

CoREC: Scalable and Resilient In-memory Data Staging for In-situ Workflows

ing with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. © 2020 Association for Computing Machinery. 2329-4949/2020/05-ART12 $15.00 https://doi.org/10.1145/3391448 ACM Transactions on Parallel Computing, Vol. 7, No. 2, Article 12. Publication date: May 2020.

[1] Kerstin Kleese van Dam,et al. Management, analysis, and visualization of experimental and observational data — The convergence of data and computing , 2016, 2016 IEEE 12th International Conference on e-Science (e-Science).

[2] Ray W. Grout,et al. Ultrascale Visualization In Situ Visualization for Large-Scale Combustion Simulations , 2010 .

[3] Franck Cappello,et al. Reducing Waste in Extreme Scale Systems through Introspective Analysis , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[4] Bin Nie,et al. A large-scale study of soft-errors on GPUs in the field , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[5] E Wes Bethel,et al. Report of the DOE Workshop on Management, Analysis, and Visualization of Experimental and Observational data – The Convergence of Data and Computing , 2016 .

[6] Franck Cappello,et al. Distributed Diskless Checkpoint for Large Scale Systems , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[7] Patrick P. C. Lee,et al. Erasure coding for small objects in in-memory KV storage , 2017, SYSTOR.

[8] Christian Engelmann,et al. Combining Partial Redundancy and Checkpointing for HPC , 2012, 2012 IEEE 32nd International Conference on Distributed Computing Systems.

[9] Fan Zhang,et al. ActiveSpaces: Exploring dynamic code deployment for extreme scale data processing , 2015, Concurr. Comput. Pract. Exp..

[10] Paul Messina,et al. The Exascale Computing Project , 2017, Comput. Sci. Eng..

[11] Scott Klasky,et al. Terascale direct numerical simulations of turbulent combustion using S3D , 2008 .

[12] Scott Klasky,et al. Stacker: An Autonomic Data Movement Engine for Extreme-Scale Data Staging-Based In-Situ Workflows , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[13] Franck Cappello,et al. Toward Exascale Resilience: 2014 update , 2014, Supercomput. Front. Innov..

[14] Xubin He,et al. A Comprehensive Analysis of XOR-Based Erasure Codes Tolerating 3 or More Concurrent Failures , 2013, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum.

[15] Manish Parashar,et al. Scalable Data Resilience for In-memory Data Staging , 2018, 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[16] Carlos Maltzahn,et al. Ceph: a scalable, high-performance distributed file system , 2006, OSDI '06.

[17] Jianliang Xu,et al. Real-Time In-Memory Checkpointing for Future Hybrid Memory Systems , 2015, ICS.

[18] Lavanya Ramakrishnan,et al. AnalyzeThis: an analysis workflow-aware storage system , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[19] Ada Gavrilovska,et al. SmartBlock: An Approach to Standardizing In Situ Workflow Components , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[20] Xian-He Sun,et al. Harmonia: An Interference-Aware Dynamic I/O Scheduler for Shared Non-volatile Burst Buffers , 2018, 2018 IEEE International Conference on Cluster Computing (CLUSTER).

[21] Robert Latham,et al. Leveraging burst buffer coordination to prevent I/O interference , 2016, 2016 IEEE 12th International Conference on e-Science (e-Science).

[22] Thomas Hérault,et al. Post-failure recovery of MPI communication capability , 2013, Int. J. High Perform. Comput. Appl..

[23] F. Moore,et al. Polynomial Codes Over Certain Finite Fields , 2017 .

[24] Scott Klasky,et al. DataSpaces: an interaction and coordination framework for coupled simulation workflows , 2012, HPDC '10.

[25] Franck Cappello,et al. FTI: High performance Fault Tolerance Interface for hybrid systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[26] Saurabh Gupta,et al. Understanding and Exploiting Spatial Properties of System Failures on Extreme-Scale HPC Systems , 2015, 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[27] Bran Selic,et al. A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems , 2013, The Journal of Supercomputing.

[28] Peter Desnoyers,et al. Active flash: towards energy-efficient, in-situ data analytics on extreme-scale machines , 2013, FAST.

[29] Catherine D. Schuman,et al. A Performance Evaluation and Examination of Open-Source Erasure Coding Libraries for Storage , 2009, FAST.

[30] Herbert Bos,et al. Techniques for efficient in-memory checkpointing , 2013, HotDep.

[31] Karsten Schwan,et al. PreDatA – preparatory data analytics on peta-scale machines , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[32] Christian Engelmann,et al. Failures in Large Scale Systems: Long-term Measurement, Analysis, and Implications , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[33] Tao Lu,et al. Toward Managing HPC Burst Buffers Effectively: Draining Strategy to Regulate Bursty I/O Behavior , 2017, 2017 IEEE 25th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS).

[34] Carlos Maltzahn,et al. Efficient, Failure Resilient Transactions for Parallel and Distributed Computing , 2014, 2014 International Workshop on Data Intensive Scalable Computing Systems.

[35] Ray W. Grout,et al. Feature-Based Statistical Analysis of Combustion Simulation Data , 2011, IEEE Transactions on Visualization and Computer Graphics.

[36] Manish Parashar,et al. Local recovery and failure masking for stencil-based applications at extreme scales , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[37] Guillaume Aupy,et al. Sizing and Partitioning Strategies for Burst-Buffers to Reduce IO Contention , 2019, 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[38] Thomas Hérault,et al. An evaluation of User-Level Failure Mitigation support in MPI , 2012, Computing.

[39] Jiaqi Liu,et al. Supporting Fault-Tolerance in Presence of In-Situ Analytics , 2017, CCGrid.

[40] Heng Zhang,et al. Efficient and Available In-Memory KV-Store with Hybrid Erasure Coding and Replication , 2016, FAST.

[41] Bin Nie,et al. Machine Learning Models for GPU Error Prediction in a Large Scale HPC System , 2018, 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[42] Fan Zhang,et al. Combining in-situ and in-transit processing to enable extreme-scale scientific analysis , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.