论文信息 - Performance Optimization for Short MapReduce Job Execution in Hadoop

Performance Optimization for Short MapReduce Job Execution in Hadoop

Hadoop MapReduce is a widely used parallel computing framework for solving data-intensive problems. To be able to process large-scale datasets, the fundamental design of the standard Hadoop places more emphasis on high-throughput of data than on job execution performance. This causes performance limitation when we use Hadoop MapReduce to execute short jobs that requires quick responses. In order to speed up the execution of short jobs, this paper proposes optimization methods to improve the execution performance of MapReduce jobs. We made three major optimizations: first, we reduce the time cost during the initialization and termination stages of a job by optimizing its setup and cleanup tasks, second, we replace the pull-model task assignment mechanism with a push-model, third, we replace the heartbeat-based communication mechanism with an instant message communication mechanism for event notifications between the Job Tracker and Task Trackers. Experimental results show that the job execution performance of our improved version of Hadoop is about 23% faster on average than the standard Hadoop for our test application.

Rong Gu | Xiaoliang Yang | Chunfeng Yuan | Yihua Huang | Jinshuang Yan

[1] Shivnath Babu,et al. Towards automatic optimization of MapReduce programs , 2010, SoCC '10.

[2] Arlo Faria,et al. MapReduce : Distributed Computing for Machine Learning , 2006 .

[3] Magdalena Balazinska,et al. Estimating the progress of MapReduce pipelines , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[4] Scott Shenker,et al. Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling , 2010, EuroSys '10.

[5] Jin-Soo Kim,et al. HPMR: Prefetching and pre-shuffling in shared MapReduce computation environment , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[6] Kai Wang,et al. Accelerating MapReduce with Distributed Memory Cache , 2009, 2009 15th International Conference on Parallel and Distributed Systems.

[7] Randy H. Katz,et al. Improving MapReduce Performance in Heterogeneous Environments , 2008, OSDI.

[8] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[9] Chunfeng Yuan,et al. Parallelization of BLAST with MapReduce for Long Sequence Alignment , 2011, 2011 Fourth International Symposium on Parallel Architectures, Algorithms and Programming.

[10] Rob Pike,et al. Interpreting the data: Parallel analysis with Sawzall , 2005, Sci. Program..