Parallel Data Processing in the Cloud using Nephele

In recent years, Infrastructure-as-a-Service (IaaS) clouds have become increasingly popular as a flexible and inexpensive platform for ad-hoc parallel data processing. Major players in cloud computing have started to integrate frameworks for parallel data processing in their product portfolio, making it easy for customers to access these services and to deploy their programs. However, currently used processing frameworks have been designed for static, homogeneous cluster systems and do not support the new features which distinguish the cloud platform. In this paper discussion is being done on the research project Nephele. Nephele is the first data processing framework to explicitly exploit the dynamic resource allocation offered by today‟s IaaS clouds for both, task scheduling and execution. First performance results of Nephele are presented and its efficiency is compared with one of the well-known software, MapReduce. MapReduce is chosen for comparison since it is open source software and currently enjoys high popularity in the data processing community.

[1]  Yong Zhao,et al.  Falkon: a Fast and Light-weight tasK executiON framework , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[2]  S. Palaniswami,et al.  A Dynamic Resource Allocation Method for Parallel DataProcessing in Cloud Computing , 2012 .

[3]  Lavanya Ramakrishnan,et al.  VGrADS: enabling e-Science workflows on grids and clouds with fault tolerance , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[4]  Bernd Freisleben,et al.  On-Demand Resource Provisioning for BPEL Workflows Using Amazon's Elastic Compute Cloud , 2009, 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid.

[5]  Gregor von Laszewski,et al.  Swift: Fast, Reliable, Loosely Coupled Parallel Computation , 2007, 2007 IEEE Congress on Services (Services 2007).

[6]  Ian T. Foster,et al.  Condor-G: A Computation Management Agent for Multi-Institutional Grids , 2004, Cluster Computing.

[7]  Odej Kao,et al.  Exploiting Dynamic Resource Allocation for Efficient Parallel Data Processing in the Cloud , 2011, IEEE Transactions on Parallel and Distributed Systems.

[8]  Rares Vernica,et al.  Hyracks: A flexible and extensible foundation for data-intensive computing , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[9]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[10]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[11]  Daniel S. Katz,et al.  Pegasus: A framework for mapping complex scientific workflows onto distributed systems , 2005, Sci. Program..