Upgrading a high performance computing environment for massive data processing

High-performance computing (HPC) and massive data processing (Big Data) are two trends that are beginning to converge. In that process, aspects of hardware architectures, systems support and programming paradigms are being revisited from both perspectives. This paper presents our experience on this path of convergence with the proposal of a framework that addresses some of the programming issues derived from such integration. Our contribution is the development of an integrated environment that integretes (i) COMPSs, a programming framework for the development and execution of parallel applications for distributed infrastructures; (ii) Lemonade, a data mining and analysis tool; and (iii) HDFS, the most widely used distributed file system for Big Data systems. To validate our framework, we used Lemonade to create COMPSs applications that access data through HDFS, and compared them with equivalent applications built with Spark, a popular Big Data framework. The results show that the HDFS integration benefits COMPSs by simplifying data access and by rearranging data transfer, reducing execution time. The integration with Lemonade facilitates COMPSs’s use and may help its popularization in the Data Science community, by providing efficient algorithm implementations for experts from the data domain that want to develop applications with a higher level abstraction.

[1]  Jesús Labarta,et al.  Task-based programming in COMPSs to converge from HPC to big data , 2018, Int. J. High Perform. Comput. Appl..

[2]  D. Andersen,et al.  A Fast Array of Wimpy Nodes , 2008 .

[3]  Leo Goodstadt,et al.  Ruffus: a lightweight Python library for computational pipelines , 2010, Bioinform..

[4]  Thorsten Meinl,et al.  KNIME - the Konstanz information miner: version 2.0 and beyond , 2009, SKDD.

[5]  Mike Eisler,et al.  Network File System (NFS) Version 4 Minor Version 1 Protocol , 2020 .

[6]  Jordi Torres,et al.  PyCOMPSs: Parallel computational workflows in Python , 2016, Int. J. High Perform. Comput. Appl..

[7]  Nada Lavrac,et al.  ClowdFlows: A Cloud Based Scientific Workflow Platform , 2012, ECML/PKDD.

[8]  Andrew J. Hutton,et al.  Lustre: Building a File System for 1,000-node Clusters , 2003 .

[9]  Haoyuan Li,et al.  Alluxio: A Virtual Distributed File System , 2018 .

[10]  Martin Mozina,et al.  Orange: data mining toolbox in python , 2013, J. Mach. Learn. Res..

[11]  Nada Lavrac,et al.  Orange4WS Environment for Service-Oriented Data Mining , 2012, Comput. J..

[12]  Jorge Ejarque,et al.  Transparent Orchestration of Task-based Parallel Applications in Containers Platforms , 2018, Journal of Grid Computing.

[13]  Peter J. Tonellato,et al.  COSMOS: Python library for massively parallel workflows , 2014, Bioinform..

[14]  Franck Cappello,et al.  Big data and extreme-scale computing , 2018, Int. J. High Perform. Comput. Appl..

[15]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[16]  Wes McKinney,et al.  pandas: a Foundational Python Library for Data Analysis and Statistics , 2011 .

[17]  Geoffrey Fox,et al.  Twister2: Design of a big data toolkit , 2020, Concurr. Comput. Pract. Exp..

[18]  Ignacio Blanquer,et al.  Enabling e-Science Applications on the Cloud with COMPSs , 2011, Euro-Par Workshops.

[19]  Domenico Talia,et al.  Enabling Cloud Interoperability with COMPSs , 2012, Euro-Par.

[20]  Carlos Maltzahn,et al.  Ceph: a scalable, high-performance distributed file system , 2006, OSDI '06.

[21]  Mike Eisler,et al.  Network File System (NFS) Version 4 Minor Version 1 External Data Representation Standard (XDR) Description , 2010, RFC.

[22]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[23]  Raj Jain,et al.  The art of computer systems performance analysis - techniques for experimental design, measurement, simulation, and modeling , 1991, Wiley professional computing.

[24]  Geoffrey C. Fox,et al.  Big Data, Simulations and HPC Convergence , 2015, WBDB.

[25]  Ingo Mierswa,et al.  YALE: rapid prototyping for complex data mining tasks , 2006, KDD '06.

[26]  Gianluigi Zanetti,et al.  Pydoop: a Python MapReduce and HDFS API for Hadoop , 2010, HPDC '10.

[27]  Milind Bhandarkar,et al.  HAWQ: a massively parallel processing SQL engine in hadoop , 2014, SIGMOD Conference.

[28]  Balázs Hidasi,et al.  Fast ALS-based tensor factorization for context-aware recommendation from implicit feedback , 2012, ECML/PKDD.

[29]  Jorge Ejarque,et al.  Energy-Aware Programming Model for Distributed Infrastructures , 2016, 2016 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP).

[30]  Matei Ripeanu,et al.  Amazon S3 for science grids: a viable solution? , 2008, DADC '08.

[31]  Wagner Meira,et al.  Lemonade: A Scalable and Efficient Spark-Based Platform for Data Analytics , 2017, 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).

[32]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[33]  Gurhan Gunduz,et al.  Twister2: TSet High-Performance Iterative Dataflow , 2019, 2019 International Conference on High Performance Big Data and Intelligent Systems (HPBD&IS).

[34]  Amar Phanishayee,et al.  FAWN: a fast array of wimpy nodes , 2009, SOSP '09.

[35]  Werner Vogels,et al.  Dynamo: amazon's highly available key-value store , 2007, SOSP.