In recent years ad hoc parallel data processing is one of the emerging applications for Infrastructure-as-aService (IaaS) cloud environment. The current processing frameworks have been designed for homogenous cloud setup, which consequently leads to increased processing time and cost. In this paper we present Nephele, a first data processing framework for exploiting dynamic resource allocation in a cloud environment. Particular tasks of a processing job can be assigned to different types of virtual machines which are automatically instantiated and terminated during the job execution. Based on this framework, we perform Map Reduce-inspired processing jobs on an IaaS cloud system and compare the results to the popular data processing framework Hadoop. I.INTRODUCTION In recent times, number of companies has to process large amounts of data in a cost-efficient manner. Classic examples for these companies are Google, Yahoo, or Microsoft. The huge amount of data they work with has made traditional database solutions expensive. Instead, these companies have gone for large number of commodity servers. To simplify the development of distributed applications, many of these companies have also built data processing frameworks. Google’s MapReduce, or Yahoo!’s Map-Reduce-Merge are examples of the above. They are classified into terms like highthroughput computing (HTC) or many-task computing (MTC), depending on the amount of data and the number of tasks involved in the computation. Though these two systems differ in design, but their programming models share similar objectives. The framework takes care of distributing the program among the nodes and executes each instance of the program on the appropriate fragment of data. Instead, Cloud computing has emerged as an great approach to rent a large IT infrastructure Amazon EC2 is an operator which allow their clients to access, allocate, and control a set of virtual machines (VMs) which run inside their data centers and charge them for the period of time the machines were used. The VMs are typically offered in different types, based on characteristics and cost. Since projects like Hadoop an existing open source implementation of Google’s MapReduce framework, already promoted their frameworks in the cloud. However, instead of embracing its dynamic resource allocation, current data processing frameworks expect the cloud to produce the static nature of the cluster environments they were originally designed .As a result, rented resources may be not sufficient for processing a job, which may lower the overall performance and increase the cost. In this paper, we want to present Nephele, a new processing framework designed for cloud environments. Nephele is the first and best data processing framework to perform dynamic allocation/deallocation of resources from a cloud during job execution. This paper includes details on scheduling strategies and results. The paper is structured as follows: In Section 3, we present basic Nephele architecture and describe how jobs are executed in the cloud. Section 4 provides some details on Nephele’s performance and optimizations. Finally, paper is concluded by related work. ISSN(Online): 2320-9801 ISSN (Print): 2320-9798 International Journal of Innovative Research in Computer and Communication Engineering (An ISO 3297: 2007 Certified Organization) Vol.2, Special Issue 1, March 2014 Proceedings of International Conference On Global Innovations In Computing Technology (ICGICT’14) Organized by Department of CSE, JayShriram Group of Institutions, Tirupur, Tamilnadu, India on 6 & 7 March 2014 Copyright @ IJIRCCE www.ijircce.com 1735 II.CHALLENGES In this section, we briefly discuss the challenges in efficient data parallel processing. The major challenge is cloud opaqueness. Current data processing techniques attempt to schedule the computed nodes with knowledge about network and thus avoid bottlenecks. In cloud, information about the topology is completely hidden from users, so this may create congestion in the network. So therefore the system must become aware of the cloud environment and the jobs executed. Also the paradigm used should be powerful to depict the dependencies. The system should be aware of when to allocate/deallocate the VM’s. Finally, the scheduler of such a processing framework must be able to determine which task of a job should be executed on which type of VM. To ensure locality between tasks of a processing is to execute these job’s task on the same VM. This may help in allocating fewer, but powerful VMs with multiple CPU cores. Scheduling the task in VM with multiple cores than single core machines ensures data locality. III.DESIGN Based on the challenges, we design Nephele, first data processing framework for cloud. A.ARCHITECTURE Nephele’s architecture follows a pattern as illustrated in Fig. 1. Fig 1.Nephele Architecture Before submitting a Nephele’s job, a VM must be started by the user in the cloud which runs the Job Manager (JM). The Job Manager receives the jobs, schedules them, and coordinates execution. It can communicate with the interface the cloud operator to provide the control to instantiate the VMs and this is termed as the Cloud Controller. Using the Cloud Controller the JM can allocate/deallocate VMs based on the current job execution. The term instance type is used to show difference between VMs with multiple different hardware characteristics. The execution of Nephele job tasks is carried out by a set of instances. Every instance runs a so-called Task Manager (TM). A Task Manager receives the tasks from the Job Manager, executes them, and t informs the Job Manager about their completion or possible errors. Unless the Job Manager gets a job, we assume the set of instances as empty. Upon receivable of job the Job Manager then decides, how many and ISSN(Online): 2320-9801 ISSN (Print): 2320-9798 International Journal of Innovative Research in Computer and Communication Engineering (An ISO 3297: 2007 Certified Organization) Vol.2, Special Issue 1, March 2014 Proceedings of International Conference On Global Innovations In Computing Technology (ICGICT’14) Organized by Department of CSE, JayShriram Group of Institutions, Tirupur, Tamilnadu, India on 6 & 7 March 2014 Copyright @ IJIRCCE www.ijircce.com 1736 what type of instances the job should be executed and when the corresponding instances must be allocated/deallocated to ensure a cost-efficient processing. B. SCHEDULING STRATEGIES: The Basic idea to refine the scheduling strategy for recurring jobs is to use feedback data. We develop a system for Nephele which continuously monitor’s running tasks and the instances. Based on the Java Management Extensions (JMX) the system is capable of breaking down its processing time that a task spends processing user code and the time it waits for data. With the collected data Nephele is able to detect computational and I/O bottlenecks. The computational bottlenecks suggests that higher degree of parallelization for the tasks, I/O bottlenecks provides hints to switch to faster channel types and reconsider the instance. Then Nephele generates a cryptographic signature for every task and recurring tasks can be identified and already recorded data can be exploited.Now, we use the profiling data to detect the bottlenecks and provide the user to choose annotations for the job A user can use the feedback to improve the job’s annotations. In advanced versions of Nephele, the system can automatically adjust to detected bottlenecks between continuous executions of the same job or at job’s execution at runtime. The allocation time of cloud instances is determined by the start times of the subtasks, there are different strategies for deallocation. Nephele can track the instances’ allocation times. An instance of a each type in the execution stage is not immediately deallocated if same instance type is required in an upcoming execution Stage. So, Nephele retains the instance allocated till the end of current lease period. If the preceding execution phase has begun before the end of the previous period, it is reassigned to an execution of precding stage, else it deallocates early enough not to cause any additional cost. IV EVALUTION In this section, we present the first performance results of Nephele and compare it to the Hadoop’s data processing framework. We have chosen Hadoop , because it is an open source software and enjoys high popularity in the data processing.Hadoop has been designed to run on a very large number of nodes (i.e., thousands of nodes) in current IaaS clouds.The general Map Reduce technique has been chosen to run on both the framework.In the preceding section A general Map Reduce on Hadoop and section B describes the general Map Reduce on Nephele framework. A. MAPREDUCE AND HADOOP: In order to execute the task with Hadoop, we created different MapReduce programs which were executed consecutively.The MapReduce job reads the input data set, counts number of occurences of each and writes them back to Hadoop’s HDFS file system. Since the MapReduce engine is internally designed to count the incoming data words between the map and the reduce phase. Instead, we simply used the word count code, which is well suited for these kinds of tasks. The result of this MapReduce job was a file with count of words in the input file. B. MAPREDUCE AND NEPHELE: For Nephele, we used the same MapReduce program we wrote for the previously described Hadoop experiment and execute them on top of Nephele In order to do , we develop a set of wrapper classes providing interface compatibility with Hadoop and required functionality. These classes allowes us ISSN(Online): 2320-9801 ISSN (Print): 2320-9798 International Journal of Innovative Research in Computer and Communication Engineering (An ISO 3297: 2007 Certified Organization) Vol.2, Special Issue 1, March 2014 Proceedings of International Conference On Global Innovations In Computing Technology
[1]
Odej Kao,et al.
Nephele: efficient parallel data processing in the cloud
,
2009,
MTAGS '09.
[2]
Yong Zhao,et al.
Falkon: a Fast and Light-weight tasK executiON framework
,
2007,
Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).
[3]
Yong Zhao,et al.
Many-task computing for grids and supercomputers
,
2008,
2008 Workshop on Many-Task Computing on Grids and Supercomputers.
[4]
Sanjay Ghemawat,et al.
MapReduce: Simplified Data Processing on Large Clusters
,
2004,
OSDI.
[5]
Rob Pike,et al.
Interpreting the data: Parallel analysis with Sawzall
,
2005,
Sci. Program..