Pilot-Streaming: A Stream Processing Framework for High-Performance Computing

An increasing number of scientific applications utilize stream processing to analyze data feeds of scientific instruments, sensors, and simulations. In this paper, we study the streaming and data processing requirements of light source experiments, which are projected to generate data at 20 GB/sec in the near future. As beamtimes available to users are typically short, it is essential that processing and analysis can be conducted in a streaming mode. The development and deployment of streaming applications is a complex task and requires the integration of heterogeneous, distributed infrastructure, frameworks, middleware and application components written in different languages and abstractions. Streaming applications may be extremely dynamic due to factors, such as variable data rates, network congestions, and application-specific characteristics, such as adaptive sampling techniques and the different processing techniques. Consequently, streaming system are often subject to back-pressures and instabilities requiring additional infrastructure to mitigate these issues. We propose Pilot-Streaming, a framework for supporting streaming applications and their resource management needs on HPC infrastructure. Underlying Pilot-Streaming is a unifying architecture that decouples important concerns and functions, such as message brokering, transport and communication, and processing. Pilot-Streaming simplifies the deployment of stream processing frameworks, such as Kafka and Spark Streaming, while providing a high-level abstraction for managing streaming infrastructure, e. g. adding/removing resources as required by the application at runtime. This capability is critical for balancing complex streaming pipelines. To address the complexity in the development of streaming applications, we present the Streaming Mini-Apps, which supports different plug-able algorithms for data generation and processing, e. g., for reconstructing light source images using different techniques. We use the streaming Mini-Apps to evaluate the Pilot-Streaming framework demonstrating its suitability for different use cases and workloads.

[1]  Patrick Dupont,et al.  Maximum-likelihood expectation-maximization reconstruction of sinograms with arbitrary noise distribution using NEC-transformations , 2001, IEEE Transactions on Medical Imaging.

[2]  Jun Rao,et al.  Building a Replicated Logging System with Apache Kafka , 2015, Proc. VLDB Endow..

[3]  Jay Kreps,et al.  Kafka : a Distributed Messaging System for Log Processing , 2011 .

[4]  Shantenu Jha,et al.  Pilot-Abstraction: A Valid Abstraction for Data-Intensive Applications on HPC, Hadoop and Cloud Infrastructures? , 2015, ArXiv.

[5]  K. Ramasamy,et al.  Low Latency Stream Processing : Twitter Heron with Infiniband and Omni-Path , 2017 .

[6]  Shantenu Jha,et al.  SAGA BigJob: An Extensible and Interoperable Pilot-Job Abstraction for Distributed Applications and Systems , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[7]  Allen D. Malony,et al.  Scaling Spark on HPC Systems , 2016, HPDC.

[8]  Shantenu Jha,et al.  ExTASY: Scalable and flexible coupling of MD simulations and advanced sampling techniques , 2016, 2016 IEEE 12th International Conference on e-Science (e-Science).

[9]  Craig Chambers,et al.  The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing , 2015, Proc. VLDB Endow..

[10]  Shantenu Jha,et al.  A Comprehensive Perspective on the Pilot-Job Abstraction , 2015, ArXiv.

[11]  Judy Qiu,et al.  A Tale of Two Data-Intensive Paradigms: Applications, Abstractions, and Architectures , 2014, 2014 IEEE International Congress on Big Data.

[12]  Geoffrey C. Fox,et al.  Towards an Understanding of Facets and Exemplars of Big Data Applications , 2014 .

[13]  Shantenu Jha,et al.  P∗: A model of pilot-abstractions , 2012, 2012 IEEE 8th International Conference on E-Science.

[14]  Eric Hand,et al.  X-ray free-electron lasers fire up , 2009, Nature.

[15]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[16]  Shantenu Jha,et al.  A Comprehensive Perspective on Pilot-Job Systems , 2015, ACM Comput. Surv..

[17]  Shantenu Jha,et al.  Using Pilot Systems to Execute Many Task Workloads on Supercomputers , 2015, JSSPP.

[18]  Mashrur Chowdhury,et al.  A Distributed Message Delivery Infrastructure for Connected Vehicle Technology Applications , 2018, IEEE Transactions on Intelligent Transportation Systems.

[19]  Ian T. Foster,et al.  Real-Time Data Analysis and Autonomous Steering of Synchrotron Light Source Experiments , 2017, 2017 IEEE 13th International Conference on e-Science (e-Science).

[20]  Jignesh M. Patel,et al.  Twitter Heron: Stream Processing at Scale , 2015, SIGMOD Conference.

[21]  Lisa Axe,et al.  Developments in synchrotron x-ray computed microtomography at the National Synchrotron Light Source , 1999, Optics & Photonics.

[22]  S. Hauf,et al.  Integrated Detector Control and Calibration Processing at the European XFEL , 2015 .

[23]  Mary Shaw The Impact of Modelling and Abstraction Concerns on Modern Programming Languages , 1982, On Conceptual Modelling.

[24]  Michael E. Papka,et al.  Optimal Execution of Co-analysis for Large-Scale Molecular Dynamics Simulations , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[25]  Shantenu Jha,et al.  Hadoop on HPC: Integrating Hadoop and Pilot-Based Dynamic Resource Management , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[26]  Francesco De Carlo,et al.  TomoPy: a framework for the analysis of synchrotron tomographic data , 2014, Journal of synchrotron radiation.

[27]  Shantenu Jha,et al.  RADICAL-Pilot: Scalable Execution of Heterogeneous and Dynamic Workloads on Supercomputers , 2015, ArXiv.

[28]  Geoffrey C. Fox,et al.  Towards High Performance Processing of Streaming Data in Large Data Centers , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[29]  Shantenu Jha,et al.  SAGA: A standardized access layer to heterogeneous Distributed Computing Infrastructure , 2015 .

[30]  Scott Shenker,et al.  Discretized streams: fault-tolerant streaming computation at scale , 2013, SOSP.

[31]  Ameet Talwalkar,et al.  MLlib: Machine Learning in Apache Spark , 2015, J. Mach. Learn. Res..

[32]  S. Krishnan myHadoop-Hadoop-on-Demand on Traditional HPC Resources , 2004 .

[33]  Shantenu Jha,et al.  Pilot-Data: An abstraction for distributed data , 2013, J. Parallel Distributed Comput..

[34]  Geoffrey Fox,et al.  Survey of Distributed Stream Processing , 2016 .

[35]  William E. White,et al.  Free-electron Lasers , 2022 .

[36]  Shantenu Jha,et al.  Synapse: Synthetic Application Profiler and Emulator , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).