Extending MapReduce across Clouds with BStream

Today, batch processing frameworks like Hadoop MapReduce are difficult to scale to multiple clouds due to latencies involved in inter-cloud data transfer and synchronization overheads during shuffle-phase. This inhibits the MapReduce framework from guaranteeing performance at variable load surges without over-provisioning in the internal cloud (IC). We propose BStream, a cloud bursting framework for MapReduce that couples stream-processing in the external cloud (EC) with Hadoop in the internal cloud (IC). Stream processing in EC enables pipelined uploading, processing and downloading of data to minimize network latencies. We use this framework to meet job deadlines. BStream uses an analytical model to minimize the usage of EC. We propose different checkpointing strategies that overlap output transfer with input transfer/processing and simultaneously reduce the computation involved in merging the results from EC and IC. Checkpointing further reduces job completion time. We experimentally compare BStream with other related works and illustrate performance benefits due to stream processing and checkpointing strategies in EC. Lastly, we characterize the operational regime of BStream.

[1]  D. Janaki Ram,et al.  Optimizing Ordered Throughput Using Autonomic Cloud Bursting Schedulers , 2013, IEEE Transactions on Software Engineering.

[2]  Nesime Tatbul,et al.  Stream as You Go: The Case for Incremental Data Access and Processing in the Cloud , 2012, 2012 IEEE 28th International Conference on Data Engineering Workshops.

[3]  Manish Parashar,et al.  CometCloud: An Autonomic Cloud Engine , 2011, CloudCom 2011.

[4]  Luís Veiga,et al.  Internet-scale support for map-reduce processing , 2013, Journal of Internet Services and Applications.

[5]  Manish Parashar,et al.  Investigating MapReduce framework extensions for efficient processing of geographically scattered datasets , 2011, PERV.

[6]  Manish Parashar,et al.  Online Risk Analytics on the Cloud , 2009, 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid.

[7]  Wu-chun Feng,et al.  MOON: MapReduce On Opportunistic eNvironments , 2010, HPDC '10.

[8]  Judy Qiu,et al.  A hierarchical framework for cross-domain MapReduce execution , 2011, ECMLS '11.

[9]  Chao Tian,et al.  Nova: continuous Pig/Hadoop workflows , 2011, SIGMOD '11.

[10]  Chenyu Wang,et al.  Cross-Phase Optimization in MapReduce , 2013, 2013 IEEE International Conference on Cloud Engineering (IC2E).

[11]  Shin Gyu Kim,et al.  Improving Hadoop performance in intercloud environments , 2011, PERV.

[12]  Gagan Agrawal,et al.  A Framework for Data-Intensive Computing with Cloud Bursting , 2011, 2011 IEEE International Conference on Cluster Computing.

[13]  Roy H. Campbell,et al.  ARIA: automatic resource inference and allocation for mapreduce environments , 2011, ICAC '11.

[14]  Roy H. Campbell,et al.  Deadline-based workload management for MapReduce environments: Pieces of the performance puzzle , 2012, 2012 IEEE Network Operations and Management Symposium.

[15]  Rajkumar Buyya,et al.  Scaling MapReduce Applications Across Hybrid Clouds to Meet Soft Deadlines , 2013, 2013 IEEE 27th International Conference on Advanced Information Networking and Applications (AINA).

[16]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[17]  Domenico Talia,et al.  P2P-MapReduce: Parallel data processing in dynamic Cloud environments , 2012, J. Comput. Syst. Sci..

[18]  Komal Shringare,et al.  Apache Hadoop Goes Realtime at Facebook , 2015 .

[19]  Ramesh K. Sitaraman,et al.  Optimizing MapReduce for Highly Distributed Environments , 2012, ArXiv.

[20]  Gagan Agrawal,et al.  Time and Cost Sensitive Data-Intensive Computing on Hybrid Clouds , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).

[21]  Kemafor Anyanwu,et al.  Scheduling Hadoop Jobs to Meet Deadlines , 2010, 2010 IEEE Second International Conference on Cloud Computing Technology and Science.

[22]  Liang Dong,et al.  Starfish: A Self-tuning System for Big Data Analytics , 2011, CIDR.

[23]  Joseph M. Hellerstein,et al.  MapReduce Online , 2010, NSDI.

[24]  Rajkumar Buyya,et al.  Future Generation Computer Systems Deadline-driven Provisioning of Resources for Scientific Applications in Hybrid Clouds with Aneka , 2022 .

[25]  Ken Yocum,et al.  In-situ MapReduce for Log Processing , 2011, USENIX Annual Technical Conference.

[26]  Srikanth Kandula,et al.  Jockey: guaranteed job latency in data parallel clusters , 2012, EuroSys '12.

[27]  Chenyu Wang,et al.  Exploring MapReduce efficiency with highly-distributed data , 2011, MapReduce '11.