Analysis in the Data Path of an Object-Centric Data Management System

Emerging high performance computing (HPC) systems are expected to be deployed with an unprecedented level of complexity due to a deep system memory and storage hierarchy. Efficient and scalable methods of data management and movement through the multi-level storage hierarchy of upcoming HPC systems will be critical for scientific applications at exascale. In this paper, we propose in locus analysis that allows registering user-defined functions (UDFs) and running those functions automatically while the data is moving between levels of a storage hierarchy. We implement this analysis in the data path approach in our object-centric data management system, called Proactive Data Containers (PDC). The transparent invocation of analysis functions as part of PDC object mapping is an optimized approach to minimize latency to access data as it moves within the storage hierarchy. Because a user defined analysis or transform function will be invoked automatically by the PDC runtime, the user simply registers their functions for PDC to identify the function name as well as the required list of actual parameters. To demonstrate the validity and flexibility of this analysis approach, we have implemented several scientific analysis kernels to compare against other HPC analysis-oriented approaches.

[1]  Paul G. Brown,et al.  Overview of sciDB: large scale array storage, processing and analysis , 2010, SIGMOD Conference.

[2]  Suren Byna,et al.  Interfacing HDF5 with a scalable object‐centric storage system on hierarchical storage , 2020, Concurr. Comput. Pract. Exp..

[3]  Karsten Schwan,et al.  Flexible IO and integration for scientific codes through the adaptable IO system (ADIOS) , 2008, CLADE '08.

[4]  Karsten Schwan,et al.  Flexpath: Type-Based Publish/Subscribe System for Large-Scale Science Analytics , 2014, 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[5]  Utkarsh Ayachit,et al.  The SENSEI Generic In Situ Interface , 2016, 2016 Second Workshop on In Situ Infrastructures for Enabling Extreme-Scale Analysis and Visualization (ISAV).

[6]  Sébastien Jourdain,et al.  In Situ Summarization with VTK-m , 2017, ISAV@SC.

[7]  Houjun Tang,et al.  SoMeta: Scalable Object-Centric Metadata Management for High Performance Computing , 2017, 2017 IEEE International Conference on Cluster Computing (CLUSTER).

[8]  Peter Baumann,et al.  The multidimensional database system RasDaMan , 1998, SIGMOD '98.

[9]  James P. Ahrens,et al.  The ALPINE In Situ Infrastructure: Ascending from the Ashes of Strawman , 2017, ISAV@SC.

[10]  Georgios Gousios,et al.  Big Data Software Analytics with Apache Spark , 2018, 2018 IEEE/ACM 40th International Conference on Software Engineering: Companion (ICSE-Companion).

[11]  Scott Klasky,et al.  Visualization and Analysis Requirements for In Situ Processing for a Large-Scale Fusion Simulation Code , 2016, 2016 Second Workshop on In Situ Infrastructures for Enabling Extreme-Scale Analysis and Visualization (ISAV).

[12]  Michael E. Papka,et al.  Topology-aware data movement and staging for I/O acceleration on Blue Gene/P supercomputing systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[13]  Houjun Tang,et al.  A Transparent Server-Managed Object Storage System for HPC , 2018, 2018 IEEE International Conference on Cluster Computing (CLUSTER).

[14]  Surendra Byna,et al.  ArrayUDF: User-Defined Scientific Data Analysis on Arrays , 2017, HPDC.

[15]  Hank Childs,et al.  A flexible system for in situ triggers , 2018, ISAV@SC.

[16]  Hank Childs,et al.  Strawman: A Batch In Situ Visualization and Analysis Infrastructure for Multi-Physics Simulation Codes , 2015, ISAV@SC.

[17]  Mark Howison High-Throughput Compression of FASTQ Data with SeqDB , 2013, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[18]  Joel Closier,et al.  DIRAC: a community grid solution , 2008 .

[19]  Houjun Tang,et al.  Toward Scalable and Asynchronous Object-Centric Data Management for HPC , 2018, 2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).

[20]  Robert B. Ross,et al.  Mercury: Enabling remote procedure call for high-performance computing , 2013, 2013 IEEE International Conference on Cluster Computing (CLUSTER).

[21]  Scott Klasky,et al.  DataSpaces: an interaction and coordination framework for coupled simulation workflows , 2012, HPDC '10.

[22]  Utkarsh Ayachit,et al.  ParaView Catalyst: Enabling In Situ Data Analysis and Visualization , 2015, ISAV@SC.