A View from ORNL: Scientific Data Research Opportunities in the Big Data Age

One of the core issues across computer and computational science today is adapting to, managing, and learning from the influx of "Big Data". In the commercial space, this problem has led to a huge investment in new technologies and capabilities that are well adapted to dealing with the sorts of human-generated logs, videos, texts, and other large-data artifacts that are processed and resulted in an explosion of useful platforms and languages (Hadoop, Spark, Pandas, etc.). However, translating this work from the enterprise space to the computational science and HPC community has proven somewhat difficult, in part because of some of the fundamental differences in type and scale of data and timescales surrounding its generation and use. We describe a forward-looking research and development plan which centers around the concept of making Input/Output (I/O) intelligent for users in the scientific community, whether they are accessing scalable storage or performing in situ workflow tasks. Much of our work is based on our experience with the Adaptable I/O System (ADIOS 1.X), and our next generation version of the software ADIOS 2.X [1].

[1]  Justin J. Miller,et al.  Graph Database Applications and Concepts with Neo4j , 2013 .

[2]  C. L. Philip Chen,et al.  Data-intensive applications, challenges, techniques and technologies: A survey on Big Data , 2014, Inf. Sci..

[3]  Robert Latham,et al.  I/O performance challenges at leadership scale , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[4]  Robert Latham,et al.  24/7 Characterization of petascale I/O workloads , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[5]  F. Jenko,et al.  Electron temperature gradient turbulence. , 2000, Physical review letters.

[6]  Scott Klasky,et al.  Exacution: Enhancing Scientific Data Management for Exascale , 2017, 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS).

[7]  José E. Moreira,et al.  Topology Mapping for Blue Gene/L Supercomputer , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[8]  Scott Klasky,et al.  Extending Skel to Support the Development and Optimization of Next Generation I/O Systems , 2017, 2017 IEEE International Conference on Cluster Computing (CLUSTER).

[9]  Karsten Schwan,et al.  GoldRush: Resource efficient in situ scientific data analytics using fine-grained interference aware execution , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[10]  Frank Jenko,et al.  The global version of the gyrokinetic turbulence code GENE , 2011, J. Comput. Phys..

[11]  Lipeng Wan,et al.  SSD-optimized workload placement with adaptive learning and classification in HPC environments , 2014, 2014 30th Symposium on Mass Storage Systems and Technologies (MSST).

[12]  Manish Parashar,et al.  Meteor: a middleware infrastructure for content‐based decoupled interactions in pervasive grid environments , 2008, Concurr. Comput. Pract. Exp..

[13]  Marianne Winslett,et al.  A Multiplatform Study of I/O Behavior on Petascale Supercomputers , 2015, HPDC.

[14]  Lipeng Wan,et al.  Optimizing checkpoint data placement with guaranteed burst buffer endurance in large-scale hierarchical storage systems , 2017, J. Parallel Distributed Comput..

[15]  Scott Klasky,et al.  Analysis and Modeling of the End-to-End I/O Performance on OLCF's Titan Supercomputer , 2017, 2017 IEEE 19th International Conference on High Performance Computing and Communications; IEEE 15th International Conference on Smart City; IEEE 3rd International Conference on Data Science and Systems (HPCC/SmartCity/DSS).

[16]  Scott Klasky,et al.  TGE: Machine Learning Based Task Graph Embedding for Large-Scale Topology Mapping , 2017, 2017 IEEE International Conference on Cluster Computing (CLUSTER).

[17]  Scott Klasky,et al.  Multilevel Techniques for Compression and Reduction of Scientific Data - The Multivariate Case , 2019, SIAM J. Sci. Comput..

[18]  Karsten Schwan,et al.  Event-based systems: opportunities and challenges at exascale , 2009, DEBS '09.

[19]  Guan Le,et al.  Survey on NoSQL database , 2011, 2011 6th International Conference on Pervasive Computing and Applications.

[20]  Karsten Schwan,et al.  Landrush: Rethinking In-Situ Analysis for GPGPU Workflows , 2016, 2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid).

[21]  Laurent Villard,et al.  Global and local gyrokinetic simulations of high-performance discharges in view of ITER , 2013 .

[22]  Laxmikant V. Kalé,et al.  Topology-aware task mapping for reducing communication contention on large parallel machines , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[23]  Torsten Hoefler,et al.  Generic topology mapping strategies for large-scale parallel architectures , 2011, ICS '11.

[24]  Ian T. Foster Computing Just What You Need: Online Data Analysis and Reduction at Extreme Scales , 2017, HiPC.

[25]  Scott Klasky,et al.  Moving the Code to the Data - Dynamic Code Deployment Using ActiveSpaces , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[26]  Scott Klasky,et al.  DataSpaces: an interaction and coordination framework for coupled simulation workflows , 2012, HPDC '10.

[27]  Todd Gamblin,et al.  Machine Learning Predictions of Runtime and IO Traffic on High-End Clusters , 2016, 2016 IEEE International Conference on Cluster Computing (CLUSTER).

[28]  Scott Klasky,et al.  Comprehensive Measurement and Analysis of the User-Perceived I/O Performance in a Production Leadership-Class Storage System , 2017, 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS).

[29]  Daniel S. Katz,et al.  Pegasus: A framework for mapping complex scientific workflows onto distributed systems , 2005, Sci. Program..

[30]  Jianwu Wang,et al.  Big data provenance: Challenges, state of the art and opportunities , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[31]  Bertram Ludäscher,et al.  Scientific workflow management and the Kepler system: Research Articles , 2006 .

[32]  Scott Klasky,et al.  Preparing for In Situ Processing on Upcoming Leading-edge Supercomputers , 2016, Supercomput. Front. Innov..

[33]  Arie Shoshani,et al.  Hello ADIOS: the challenges and lessons of developing leadership class I/O frameworks , 2014, Concurr. Comput. Pract. Exp..

[34]  Zhengji Zhao,et al.  I / O Performance on Cray XC 30 , 2014 .

[35]  Ada Gavrilovska,et al.  GPUShare: Fair-Sharing Middleware for GPU Clouds , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[36]  David Pugmire,et al.  Performance Modeling of In Situ Rendering , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[37]  Tao Lu,et al.  Canopus: Enabling Extreme-Scale Data Analytics on Big HPC Storage via Progressive Refactoring , 2017, HotStorage.

[38]  Scott Klasky,et al.  Exascale Storage Systems the SIRIUS Way , 2016 .

[39]  Scott Klasky,et al.  Visualization and Analysis Requirements for In Situ Processing for a Large-Scale Fusion Simulation Code , 2016, 2016 Second Workshop on In Situ Infrastructures for Enabling Extreme-Scale Analysis and Visualization (ISAV).

[40]  Karsten Schwan,et al.  SODA: Science-Driven Orchestration of Data Analytics , 2015, 2015 IEEE 11th International Conference on e-Science.

[41]  Karsten Schwan,et al.  Service Augmentation for High End Interactive Data Services , 2005, 2005 IEEE International Conference on Cluster Computing.

[42]  Nagiza F. Samatova,et al.  Compressed ion temperature gradient turbulence in diverted tokamak edge , 2009 .

[43]  Robert Hager,et al.  Gyrokinetic neoclassical study of the bootstrap current in the tokamak edge pedestal with fully non-linear Coulomb collisions , 2016 .

[44]  Kwan-Liu Ma,et al.  VTK-m: Accelerating the Visualization Toolkit for Massively Threaded Architectures , 2016, IEEE Computer Graphics and Applications.

[45]  Marta Mattoso,et al.  Handling Failures in Parallel Scientific Workflows Using Clouds , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[46]  David Pugmire,et al.  Global adjoint tomography: first-generation model , 2016 .