Online data analysis and reduction: An important Co-design motif for extreme-scale computers

A growing disparity between supercomputer computation speeds and I/O rates means that it is rapidly becoming infeasible to analyze supercomputer application output only after that output has been written to a file system. Instead, data-generating applications must run concurrently with data reduction and/or analysis operations, with which they exchange information via high-speed methods such as interprocess communications. The resulting parallel computing motif, online data analysis and reduction (ODAR), has important implications for both application and HPC systems design. Here we introduce the ODAR motif and its co-design concerns, describe a co-design process for identifying and addressing those concerns, present tools that assist in the co-design process, and present case studies to illustrate the use of the process and tools in practical settings.

[1]  Satoshi Matsuoka,et al.  Co-design Center for Exascale Machine Learning Technologies (ExaLearn) , 2021, Int. J. High Perform. Comput. Appl..

[2]  Scott Klasky,et al.  The Exascale Framework for High Fidelity coupled Simulations (EFFIS): Enabling whole device modeling in fusion science , 2021, Int. J. High Perform. Comput. Appl..

[3]  Ian Foster,et al.  FTK: A Simplicial Spacetime Meshing Framework for Robust and Scalable Feature Tracking , 2021, IEEE Transactions on Visualization and Computer Graphics.

[4]  Wei Xu,et al.  Chimbuko: A Workflow-Level Scalable Performance Trace Analysis Tool , 2020, ISAV@SC.

[5]  Scott Klasky,et al.  Feature-preserving Lossy Compression for In Situ Data Analysis , 2020, ICPP Workshops.

[6]  Joseph A. Cottam,et al.  A terminology for in situ visualization and analysis systems , 2020, Int. J. High Perform. Comput. Appl..

[7]  Keichi Takahashi,et al.  ADIOS 2: The Adaptable Input Output System. A framework for high-performance data management , 2020, SoftwareX.

[8]  Franck Cappello,et al.  Significantly Improving Lossy Compression for HPC Datasets with Second-Order Prediction and Parameter Optimization , 2020, HPDC.

[9]  Scott Klasky,et al.  Multilevel Techniques for Compression and Reduction of Scientific Data - The Unstructured Case , 2020, SIAM J. Sci. Comput..

[10]  Katherine Yelick,et al.  Exascale applications: skin in the game , 2020, Philosophical Transactions of the Royal Society A.

[11]  Li Tang,et al.  MPI jobs within MPI jobs: A practical way of enabling task-level fault-tolerance in HPC workflows , 2019, Future Gener. Comput. Syst..

[12]  Scott Klasky,et al.  A Co-Design Study Of Fusion Whole Device Modeling Using Code Coupling , 2019, 2019 IEEE/ACM 5th International Workshop on Data Analysis and Reduction for Big Scientific Data (DRBSD-5).

[13]  Keichi Takahashi,et al.  A Codesign Framework for Online Data Analysis and Reduction , 2019, 2019 IEEE/ACM Workflows in Support of Large-Scale Science (WORKS).

[14]  Matteo Turilli,et al.  DeepDriveMD: Deep-Learning Driven Adaptive Molecular Simulations for Protein Folding , 2019, 2019 IEEE/ACM Third Workshop on Deep Learning on Supercomputers (DLS).

[15]  Mukund Raj,et al.  InSituNet: Deep Image Synthesis for Parameter Space Exploration of Ensemble Simulations , 2019, IEEE Transactions on Visualization and Computer Graphics.

[16]  Scott Klasky,et al.  Multilevel Techniques for Compression and Reduction of Scientific Data - The Multivariate Case , 2019, SIAM J. Sci. Comput..

[17]  Klaus Mueller,et al.  A Visual Analytics Framework for the Detection of Anomalous Call Stack Trees in High Performance Computing Applications , 2019, IEEE Transactions on Visualization and Computer Graphics.

[18]  Franck Cappello,et al.  Error-Controlled Lossy Compression Optimized for High Compression Ratios of Scientific Datasets , 2018, 2018 IEEE International Conference on Big Data (Big Data).

[19]  Fangfang Xia,et al.  CANDLE/Supervisor: a workflow framework for machine learning applied to cancer research , 2018, BMC Bioinformatics.

[20]  Rick Stevens,et al.  Scaling Deep Learning for Cancer with Advanced Workflow Storage Integration , 2018, 2018 IEEE/ACM Machine Learning in HPC Environments (MLHPC).

[21]  Michael E. Papka,et al.  Topology-Aware Space-Shared Co-Analysis of Large-Scale Molecular Dynamics Simulations , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[22]  Franck Cappello,et al.  Coupling Exascale Multiphysics Applications: Methods and Lessons Learned , 2018, 2018 IEEE 14th International Conference on e-Science (e-Science).

[23]  Klaus Mueller,et al.  Streaming Classical Multidimensional Scaling , 2018, 2018 New York Scientific Data Summit (NYSDS).

[24]  Dingwen Tao,et al.  Prescriptive provenance for streaming analysis of workflows at scale , 2018, 2018 New York Scientific Data Summit (NYSDS).

[25]  Prasanna Balaprakash,et al.  Autotuning in High-Performance Computing Applications , 2018, Proceedings of the IEEE.

[26]  J. Choi,et al.  A tight-coupling scheme sharing minimum information across a spatial interface between gyrokinetic turbulence codes , 2018, Physics of Plasmas.

[27]  Peter Lindstrom,et al.  Error Analysis of ZFP Compression for Floating-Point Data , 2018, SIAM J. Sci. Comput..

[28]  John K. Ousterhout,et al.  A Philosophy of Software Design , 2018 .

[29]  James P. Ahrens,et al.  The ALPINE In Situ Infrastructure: Ascending from the Ashes of Strawman , 2017, ISAV@SC.

[30]  Scott Klasky,et al.  TGE: Machine Learning Based Task Graph Embedding for Large-Scale Topology Mapping , 2017, 2017 IEEE International Conference on Cluster Computing (CLUSTER).

[31]  Franck Cappello,et al.  Computing Just What You Need: Online Data Analysis and Reduction at Extreme Scales , 2017, 2017 IEEE 24th International Conference on High Performance Computing (HiPC).

[32]  George Ostrouchov,et al.  Programming with BIG Data in R: Scaling Analytics from One to Thousands of Nodes , 2017, Big Data Res..

[33]  Franck Cappello,et al.  Z-checker: A framework for assessing lossy compression of scientific data , 2017, Int. J. High Perform. Comput. Appl..

[34]  Franck Cappello,et al.  Significantly Improving Lossy Compression for Scientific Data Sets Based on Multidimensional Prediction and Error-Controlled Quantization , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[35]  Carmine Spagnuolo,et al.  From desktop to Large-Scale Model Exploration with Swift/T , 2016, 2016 Winter Simulation Conference (WSC).

[36]  John West,et al.  Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis , 2016, SC.

[37]  Gunther H. Weber,et al.  Performance Analysis, Design Considerations, and Applications of Extreme-Scale In Situ Infrastructures , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[38]  Robert Ricci,et al.  Active Learning in Performance Analysis , 2016, 2016 IEEE International Conference on Cluster Computing (CLUSTER).

[39]  Scott Klasky,et al.  In Situ Methods, Infrastructures, and Applications on High Performance Computing Platforms , 2016, Comput. Graph. Forum.

[40]  Franck Cappello,et al.  Fast Error-Bounded Lossy HPC Data Compression with SZ , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[41]  Hao Huang,et al.  Streaming spectral clustering , 2016, 2016 IEEE 32nd International Conference on Data Engineering (ICDE).

[42]  Sven Leyffer,et al.  Optimal scheduling of in-situ analysis for large-scale scientific simulations , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[43]  Erik H. D'Hollander,et al.  Transition of Hpc Towards Exascale Computing , 2013, ParCo 2013.

[44]  Scott Klasky,et al.  DataSpaces: an interaction and coordination framework for coupled simulation workflows , 2012, HPDC '10.

[45]  Torsten Hoefler,et al.  Performance modeling for systematic performance tuning , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[46]  Tjerk P. Straatsma,et al.  NWChem: A comprehensive and scalable open-source solution for large scale molecular simulations , 2010, Comput. Phys. Commun..

[47]  Michael E. Papka,et al.  Runtime Visualization of the Human Arterial Tree , 2007, IEEE Transactions on Visualization and Computer Graphics.

[48]  Kwan-Liu Ma,et al.  In-situ processing and visualization for ultrascale simulations , 2007 .

[49]  Samuel Williams,et al.  The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .

[50]  Ian T. Foster,et al.  Scaling System-Level Science: Scientific Exploration and IT Implications , 2006, Computer.

[51]  Allen D. Malony,et al.  The Tau Parallel Performance System , 2006, Int. J. High Perform. Comput. Appl..

[52]  Martin Burtscher,et al.  Fast lossless compression of scientific floating-point data , 2006, Data Compression Conference (DCC'06).

[53]  Paul G. Spirakis,et al.  Weighted random sampling with a reservoir , 2006, Inf. Process. Lett..

[54]  R. Aymar,et al.  The ITER design , 2002 .

[55]  Prashant J. Shenoy,et al.  Rules of thumb in data engineering , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[56]  Charles D. Hansen,et al.  Interactive Simulation and Visualization , 1999, Computer.

[57]  Gregor von Laszewski,et al.  Distance Visualization: Data Exploration on the Grid , 1999, Computer.

[58]  David M. Beazley,et al.  Lightweight Computational Steering of Very Large Scale Molecular Dynamics Simulations , 1996, Proceedings of the 1996 ACM/IEEE Conference on Supercomputing.

[59]  Rajesh Gupta,et al.  Hardware/software co-design , 1996, Proc. IEEE.

[60]  David Abramson,et al.  Nimrod: a tool for performing parametrised simulations using distributed workstations , 1995, Proceedings of the Fourth IEEE International Symposium on High Performance Distributed Computing.

[61]  Thomas R. Gross,et al.  Exploiting task and data parallelism on a multicomputer , 1993, PPOPP '93.

[62]  Nicholas Carriero,et al.  Linda in context , 1989, CACM.

[63]  Bill Curtis,et al.  A field study of the software design process for large systems , 1988, CACM.

[64]  D. L. Parnas,et al.  On the criteria to be used in decomposing systems into modules , 1972, Software Pioneers.

[65]  T. Munson,et al.  Achieving 100X faster simulations of complex biological phenomena by coupling ML to HPC ensembles , 2021, ArXiv.

[66]  Arie Shoshani,et al.  In situ data processing for extreme-scale computing , 2011 .

[67]  Paul G. Spirakis,et al.  Weighted Random Sampling , 2008, Encyclopedia of Algorithms.

[68]  Ian Foster,et al.  Designing and building parallel programs , 1994 .