PiCo: High-performance data analytics pipelines in modern C++

Abstract In this paper, we present a new C + + API with a fluent interface called PiCo (Pipeline Composition). PiCo’s programming model aims at making easier the programming of data analytics applications while preserving or enhancing their performance. This is attained through three key design choices: (1) unifying batch and stream data access models, (2) decoupling processing from data layout, and (3) exploiting a stream-oriented, scalable, efficient C + + 11 runtime system. PiCo proposes a programming model based on pipelines and operators that are polymorphic with respect to data types in the sense that it is possible to reuse the same algorithms and pipelines on different data models (e.g., streams, lists, sets, etc.). Preliminary results show that PiCo, when compared to Spark and Flink, can attain better performances in terms of execution times and can hugely improve memory utilization, both for batch and stream processing.

[1]  Antonio Brogi,et al.  QoS-Aware Deployment of IoT Applications Through the Fog , 2017, IEEE Internet of Things Journal.

[2]  Seif Haridi,et al.  Lightweight Asynchronous Snapshots for Distributed Dataflows , 2015, ArXiv.

[3]  Craig Chambers,et al.  The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing , 2015, Proc. VLDB Endow..

[4]  Peter Sanders,et al.  Thrill: High-performance algorithmic distributed batch data processing with C++ , 2016, 2016 IEEE International Conference on Big Data (Big Data).

[5]  Edward A. Lee,et al.  Dataflow process networks , 1995, Proc. IEEE.

[6]  Scott Shenker,et al.  Discretized streams: fault-tolerant streaming computation at scale , 2013, SOSP.

[7]  Marco Danelutto,et al.  FastFlow: High-level and Efficient Streaming on Multi-core , 2017 .

[8]  Shin-Yeh Tsai StreamBox : Modern Stream Processing on a Multicore Machine , 2017 .

[9]  Weisong Shi,et al.  The Promise of Edge Computing , 2016, Computer.

[10]  Jiang Zhu,et al.  Fog Computing: A Platform for Internet of Things and Analytics , 2014, Big Data and Internet of Things.

[11]  Felix Naumann,et al.  The Stratosphere platform for big data analytics , 2014, The VLDB Journal.

[12]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[13]  Marco Aldinucci,et al.  A Formal Semantics for Data Analytics Pipelines , 2017, ArXiv.

[14]  Marco Aldinucci,et al.  PiCo: A Novel Approach to Stream Data Analytics , 2017, Euro-Par Workshops.

[15]  Aruna Raja,et al.  Domain Specific Languages , 2010 .

[16]  Claudia Misale PiCo: A Domain-Specific Language for Data Analytics Pipelines , 2017 .

[17]  Bernd Burgstaller,et al.  Scalability and State: A Critical Assessment of Throughput Obtainable on Big Data Streaming Frameworks for Applications With and Without State Information , 2017, Euro-Par Workshops.