CoREC: Scalable and Resilient In-memory Data Staging for In-situ Workflows

ing with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. © 2020 Association for Computing Machinery. 2329-4949/2020/05-ART12 $15.00 https://doi.org/10.1145/3391448 ACM Transactions on Parallel Computing, Vol. 7, No. 2, Article 12. Publication date: May 2020.

[1]  Kerstin Kleese van Dam,et al.  Management, analysis, and visualization of experimental and observational data — The convergence of data and computing , 2016, 2016 IEEE 12th International Conference on e-Science (e-Science).

[2]  Ray W. Grout,et al.  Ultrascale Visualization In Situ Visualization for Large-Scale Combustion Simulations , 2010 .

[3]  Franck Cappello,et al.  Reducing Waste in Extreme Scale Systems through Introspective Analysis , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[4]  Bin Nie,et al.  A large-scale study of soft-errors on GPUs in the field , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[5]  E Wes Bethel,et al.  Report of the DOE Workshop on Management, Analysis, and Visualization of Experimental and Observational data – The Convergence of Data and Computing , 2016 .

[6]  Franck Cappello,et al.  Distributed Diskless Checkpoint for Large Scale Systems , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[7]  Patrick P. C. Lee,et al.  Erasure coding for small objects in in-memory KV storage , 2017, SYSTOR.

[8]  Christian Engelmann,et al.  Combining Partial Redundancy and Checkpointing for HPC , 2012, 2012 IEEE 32nd International Conference on Distributed Computing Systems.

[9]  Fan Zhang,et al.  ActiveSpaces: Exploring dynamic code deployment for extreme scale data processing , 2015, Concurr. Comput. Pract. Exp..

[10]  Paul Messina,et al.  The Exascale Computing Project , 2017, Comput. Sci. Eng..

[11]  Scott Klasky,et al.  Terascale direct numerical simulations of turbulent combustion using S3D , 2008 .

[12]  Scott Klasky,et al.  Stacker: An Autonomic Data Movement Engine for Extreme-Scale Data Staging-Based In-Situ Workflows , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[13]  Franck Cappello,et al.  Toward Exascale Resilience: 2014 update , 2014, Supercomput. Front. Innov..

[14]  Xubin He,et al.  A Comprehensive Analysis of XOR-Based Erasure Codes Tolerating 3 or More Concurrent Failures , 2013, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum.

[15]  Manish Parashar,et al.  Scalable Data Resilience for In-memory Data Staging , 2018, 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[16]  Carlos Maltzahn,et al.  Ceph: a scalable, high-performance distributed file system , 2006, OSDI '06.

[17]  Jianliang Xu,et al.  Real-Time In-Memory Checkpointing for Future Hybrid Memory Systems , 2015, ICS.

[18]  Lavanya Ramakrishnan,et al.  AnalyzeThis: an analysis workflow-aware storage system , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[19]  Ada Gavrilovska,et al.  SmartBlock: An Approach to Standardizing In Situ Workflow Components , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[20]  Xian-He Sun,et al.  Harmonia: An Interference-Aware Dynamic I/O Scheduler for Shared Non-volatile Burst Buffers , 2018, 2018 IEEE International Conference on Cluster Computing (CLUSTER).

[21]  Robert Latham,et al.  Leveraging burst buffer coordination to prevent I/O interference , 2016, 2016 IEEE 12th International Conference on e-Science (e-Science).

[22]  Thomas Hérault,et al.  Post-failure recovery of MPI communication capability , 2013, Int. J. High Perform. Comput. Appl..

[23]  F. Moore,et al.  Polynomial Codes Over Certain Finite Fields , 2017 .

[24]  Scott Klasky,et al.  DataSpaces: an interaction and coordination framework for coupled simulation workflows , 2012, HPDC '10.

[25]  Franck Cappello,et al.  FTI: High performance Fault Tolerance Interface for hybrid systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[26]  Saurabh Gupta,et al.  Understanding and Exploiting Spatial Properties of System Failures on Extreme-Scale HPC Systems , 2015, 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[27]  Bran Selic,et al.  A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems , 2013, The Journal of Supercomputing.

[28]  Peter Desnoyers,et al.  Active flash: towards energy-efficient, in-situ data analytics on extreme-scale machines , 2013, FAST.

[29]  Catherine D. Schuman,et al.  A Performance Evaluation and Examination of Open-Source Erasure Coding Libraries for Storage , 2009, FAST.

[30]  Herbert Bos,et al.  Techniques for efficient in-memory checkpointing , 2013, HotDep.

[31]  Karsten Schwan,et al.  PreDatA – preparatory data analytics on peta-scale machines , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[32]  Christian Engelmann,et al.  Failures in Large Scale Systems: Long-term Measurement, Analysis, and Implications , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[33]  Tao Lu,et al.  Toward Managing HPC Burst Buffers Effectively: Draining Strategy to Regulate Bursty I/O Behavior , 2017, 2017 IEEE 25th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS).

[34]  Carlos Maltzahn,et al.  Efficient, Failure Resilient Transactions for Parallel and Distributed Computing , 2014, 2014 International Workshop on Data Intensive Scalable Computing Systems.

[35]  Ray W. Grout,et al.  Feature-Based Statistical Analysis of Combustion Simulation Data , 2011, IEEE Transactions on Visualization and Computer Graphics.

[36]  Manish Parashar,et al.  Local recovery and failure masking for stencil-based applications at extreme scales , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[37]  Guillaume Aupy,et al.  Sizing and Partitioning Strategies for Burst-Buffers to Reduce IO Contention , 2019, 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[38]  Thomas Hérault,et al.  An evaluation of User-Level Failure Mitigation support in MPI , 2012, Computing.

[39]  Jiaqi Liu,et al.  Supporting Fault-Tolerance in Presence of In-Situ Analytics , 2017, CCGrid.

[40]  Heng Zhang,et al.  Efficient and Available In-Memory KV-Store with Hybrid Erasure Coding and Replication , 2016, FAST.

[41]  Bin Nie,et al.  Machine Learning Models for GPU Error Prediction in a Large Scale HPC System , 2018, 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[42]  Fan Zhang,et al.  Combining in-situ and in-transit processing to enable extreme-scale scientific analysis , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.