MRapid: An Efficient Short Job Optimizer on Hadoop

Data have been generated and collected at an accelerating pace. Hadoop has made analyzing large scale data much simpler to developers/analysts using commodity hardware. Interestingly, it has been shown that most Hadoop jobs have small input size and do not run for long time. For example, higher level query languages, such as Hive and Pig, would handle a complex query by breaking it into smaller adhoc ones. Although Hadoop is designed for handling complex queries with large data sets, we found that it is highly inefficient to operate at small scale data, despite a new Uber mode was introduced specifically to handle jobs with small input size. In this paper, we propose an optimized Hadoop extension called MRapid, which significantly speeds up the execution of short jobs. It is completely backward compatible to Hadoop, and imposes negligible overhead. Our experiments on Microsoft Azure public cloud show that MRapid can improve performance by up to 88% compared to the original Hadoop.

[1]  Randy H. Katz,et al.  Improving MapReduce Performance in Heterogeneous Environments , 2008, OSDI.

[2]  Shengzhong Feng,et al.  Improving Data Locality of MapReduce by Scheduling in Homogeneous Computing Environments , 2011, 2011 IEEE Ninth International Symposium on Parallel and Distributed Processing with Applications.

[3]  Scott Shenker,et al.  Shark: SQL and rich analytics at scale , 2012, SIGMOD '13.

[4]  Hong Zhang,et al.  Dart: A Geographic Information System on Hadoop , 2015, 2015 IEEE 8th International Conference on Cloud Computing.

[5]  Liang Lin,et al.  Tenzing a SQL implementation on the MapReduce framework , 2011, Proc. VLDB Endow..

[6]  Yunming Zhang HJ-Hadoop: an optimized mapreduce runtime for multi-core systems , 2013, SPLASH '13.

[7]  Yi Yao,et al.  LsPS: A Job Size-Based Scheduler for Efficient Task Assignments in Hadoop , 2015, IEEE Transactions on Cloud Computing.

[8]  Hai Jin,et al.  Maestro: Replica-Aware Map Scheduling for MapReduce , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).

[9]  En-Jui Lee,et al.  Rapid 3D Seismic Source Inversion Using Windows Azure and Amazon EC2 , 2011, 2011 IEEE World Congress on Services.

[10]  Jacopo Urbani,et al.  AJIRA: A Lightweight Distributed Middleware for MapReduce and Stream Processing , 2014, 2014 IEEE 34th International Conference on Distributed Computing Systems.

[11]  Sean Owen,et al.  Advanced Analytics with Spark: Patterns for Learning from Data at Scale , 2015 .

[12]  Hong Zhang,et al.  SMARTH: Enabling Multi-pipeline Data Transfer in HDFS , 2014, 2014 43rd International Conference on Parallel Processing.

[13]  Ishai Menache,et al.  Network-Aware Scheduling for Data-Parallel Jobs: Plan When You Can , 2015, Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication.

[14]  Mohammad Hammoud,et al.  Locality-Aware Reduce Task Scheduling for MapReduce , 2011, 2011 IEEE Third International Conference on Cloud Computing Technology and Science.

[15]  He Huang,et al.  CAP3: A Cloud Auto-Provisioning Framework for Parallel Processing Using On-Demand and Spot Instances , 2013, 2013 IEEE Sixth International Conference on Cloud Computing.

[16]  Rong Gu,et al.  Performance Optimization for Short MapReduce Job Execution in Hadoop , 2012, 2012 Second International Conference on Cloud and Green Computing.

[17]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[18]  Khaled Elmeleegy,et al.  Piranha: Optimizing Short Jobs in Hadoop , 2013, Proc. VLDB Endow..

[19]  Benjamin Hindman,et al.  Dominant Resource Fairness: Fair Allocation of Multiple Resource Types , 2011, NSDI.

[20]  Zijiang Yang,et al.  Symbolic Analysis of Concurrency Errors in OpenMP Programs , 2013, 2013 42nd International Conference on Parallel Processing.

[21]  En-Jui Lee,et al.  Rapid Processing of Synthetic Seismograms Using Windows Azure Cloud , 2010, 2010 IEEE Second International Conference on Cloud Computing Technology and Science.