Exploiting Dynamic Resource Allocation for Efficient Parallel Data Processing in the Cloud

In recent years ad hoc parallel data processing has emerged to be one of the killer applications for Infrastructure-as-a-Service (IaaS) clouds. Major Cloud computing companies have started to integrate frameworks for parallel data processing in their product portfolio, making it easy for customers to access these services and to deploy their programs. However, the processing frameworks which are currently used have been designed for static, homogeneous cluster setups and disregard the particular nature of a cloud. Consequently, the allocated compute resources may be inadequate for big parts of the submitted job and unnecessarily increase processing time and cost. In this paper, we discuss the opportunities and challenges for efficient parallel data processing in clouds and present our research project Nephele. Nephele is the first data processing framework to explicitly exploit the dynamic resource allocation offered by today's IaaS clouds for both, task scheduling and execution. Particular tasks of a processing job can be assigned to different types of virtual machines which are automatically instantiated and terminated during the job execution. Based on this new framework, we perform extended evaluations of MapReduce-inspired processing jobs on an IaaS cloud system and compare the results to the popular data processing framework Hadoop.

[1]  Dennis Gannon,et al.  Workflows for e-Science, Scientific Workflows for Grids , 2014 .

[2]  Dominic Battré,et al.  Nephele/PACTs: a programming model and execution framework for web-scale analytical processing , 2010, SoCC '10.

[3]  Anant Agarwal,et al.  An operating system for multicore and clouds: mechanisms and implementation , 2010, SoCC '10.

[4]  Odej Kao,et al.  Nephele: efficient parallel data processing in the cloud , 2009, MTAGS '09.

[5]  Lavanya Ramakrishnan,et al.  VGrADS: enabling e-Science workflows on grids and clouds with fault tolerance , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[6]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[7]  Bernd Freisleben,et al.  On-Demand Resource Provisioning for BPEL Workflows Using Amazon's Elastic Compute Cloud , 2009, 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid.

[8]  Yong Zhao,et al.  Many-task computing for grids and supercomputers , 2008, 2008 Workshop on Many-Task Computing on Grids and Supercomputers.

[9]  Jingren Zhou,et al.  SCOPE: easy and efficient parallel processing of massive data sets , 2008, Proc. VLDB Endow..

[10]  Rusty Russell,et al.  virtio: towards a de-facto standard for virtual I/O devices , 2008, OPSR.

[11]  B. Reed,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[12]  Yong Zhao,et al.  Falkon: a Fast and Light-weight tasK executiON framework , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[13]  Gregor von Laszewski,et al.  Swift: Fast, Reliable, Loosely Coupled Parallel Computation , 2007, 2007 IEEE Congress on Services (Services 2007).

[14]  Douglas Stott Parker,et al.  Map-reduce-merge: simplified relational data processing on large clusters , 2007, SIGMOD '07.

[15]  Rob Pike,et al.  Interpreting the data: Parallel analysis with Sawzall , 2005, Sci. Program..

[16]  Daniel S. Katz,et al.  Pegasus: A framework for mapping complex scientific workflows onto distributed systems , 2005, Sci. Program..

[17]  R. Davoli VDE: virtual distributed Ethernet , 2005, First International Conference on Testbeds and Research Infrastructures for the DEvelopment of NeTworks and COMmunities.

[18]  Robert D. Nowak,et al.  Maximum likelihood network topology identification from edge-based unicast measurements , 2002, SIGMETRICS '02.

[19]  Volker Markl,et al.  LEO - DB2's LEarning Optimizer , 2001, VLDB.

[20]  S. Tuecke,et al.  Condor-G: A Computation Management Agent for Multi-Institutional Grids , 2001, Proceedings 10th IEEE International Symposium on High Performance Distributed Computing.

[21]  Ian T. Foster,et al.  Globus: a Metacomputing Infrastructure Toolkit , 1997, Int. J. High Perform. Comput. Appl..

[22]  Sanjay Ghemawat,et al.  MapReduce: simplified data processing on large clusters , 2008, CACM.

[23]  Dmitrii Zagorodnov,et al.  Eucalyptus : A Technical Report on an Elastic Utility Computing Archietcture Linking Your Programs to Useful Systems , 2008 .

[24]  A. Kivity,et al.  kvm : the Linux Virtual Machine Monitor , 2007 .