Extension of a Task-Based Model to Functional Programming

Recently, efforts have been made to bring together the areas of high-performance computing (HPC) and massive data processing (Big Data). Traditional HPC frameworks, like COMPSs, are mostly task-based, while popular big-data environments, like Spark, are based on functional programming principles. The earlier are know for their good performance for regular, matrix-based computations; on the other hand, for fine-grained, data-parallel workloads, the later has often been considered more successful. In this paper we present our experience with the integration of some dataflow techniques into COMPSs, a task-based framework, in an effort to bring together the best aspects of both worlds. We present our API, called DDF, which provides a new data abstraction that addresses the challenges of integrating Big Data application scenarios into COMPSs. DDF has a functional-based interface, similar to many Data Science tools, that allows us to use dynamic evaluation to adapt the task execution in runtime. Besides the performance optimization it provides, the API facilitates the development of applications by experts in the application domain. In this paper we evaluate DDF's effectiveness by comparing the resulting programs to their original versions in COMPSs and Spark. The results show that DDF can improve COMPSs execution time and even outperform Spark in many use cases.

[1]  Kenneth E. Batcher,et al.  Sorting networks and their applications , 1968, AFIPS Spring Joint Computing Conference.

[2]  Daniel S. Katz,et al.  Swift: A language for distributed parallel scripting , 2011, Parallel Comput..

[3]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[4]  Davide Anguita,et al.  Big Data Analytics in the Cloud: Spark on Hadoop vs MPI/OpenMP on Beowulf , 2015, INNS Conference on Big Data.

[5]  Seif Haridi,et al.  Apache Flink™: Stream and Batch Processing in a Single Engine , 2015, IEEE Data Eng. Bull..

[6]  Joseph K. Bradley,et al.  Spark SQL: Relational Data Processing in Spark , 2015, SIGMOD Conference.

[7]  Geoffrey C. Fox,et al.  Big Data, Simulations and HPC Convergence , 2015, WBDB.

[8]  Ameet Talwalkar,et al.  MLlib: Machine Learning in Apache Spark , 2015, J. Mach. Learn. Res..

[9]  Saba Sehrish,et al.  Exploring the Performance of Spark for a Scientific Use Case , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[10]  Yuan Tang,et al.  TF.Learn: TensorFlow's High-level Module for Distributed Machine Learning , 2016, ArXiv.

[11]  Wagner Meira,et al.  Lemonade: A Scalable and Efficient Spark-Based Platform for Data Analytics , 2017, 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).

[12]  Jordi Torres,et al.  PyCOMPSs: Parallel computational workflows in Python , 2016, Int. J. High Perform. Comput. Appl..

[13]  Jesús Carretero,et al.  Spark-DIY: A Framework for Interoperable Spark Operations with High Performance Block-Based Data Models , 2018, 2018 IEEE/ACM 5th International Conference on Big Data Computing Applications and Technologies (BDCAT).

[14]  Wagner Meira,et al.  Extens ão de um ambiente de computação de alto desempenho para o processamento de dados massivos , 2018 .

[15]  Franck Cappello,et al.  Big data and extreme-scale computing , 2018, Int. J. High Perform. Comput. Appl..

[16]  Gurhan Gunduz,et al.  Twister2: TSet High-Performance Iterative Dataflow , 2019, 2019 International Conference on High Performance Big Data and Intelligent Systems (HPBD&IS).