ComMapReduce: An improvement of MapReduce with lightweight communication mechanisms

As a parallel programming framework, MapReduce can process scalable and parallel applications with large scale datasets. The executions of Mappers and Reducers are independent of each other. There is no communication among Mappers, neither among Reducers. When the amount of final results is much smaller than the original data, it is a waste of time processing the unpromising intermediate data. We observe that this waste can be significantly reduced by simple communication mechanisms to enhance the performance of MapReduce. In this paper, we propose ComMapReduce, an efficient framework that extends and improves MapReduce for big data applications in the cloud. ComMapReduce can effectively obtain certain shared information with efficient lightweight communication mechanisms. Three basic communication strategies, Lazy, Eager and Hybrid, and two optimization communication strategies, Prepositive and Postpositive, are proposed to obtain the shared information and effectively process big data applications. We also illustrate the implementations of three typical applications with large scale datasets on ComMapReduce. Our extensive experiments demonstrate that ComMapReduce outperforms MapReduce in all metrics without affecting the existing characteristics of MapReduce.

[1]  Anthony K. H. Tung,et al.  MAP-JOIN-REDUCE: Toward Scalable and Efficient Data Analysis on Large Clusters , 2011, IEEE Transactions on Knowledge and Data Engineering.

[2]  Roberto J. Bayardo,et al.  PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce , 2009, Proc. VLDB Endow..

[3]  George Kollios,et al.  MRShare , 2010, Proc. VLDB Endow..

[4]  Chen Li,et al.  Efficient parallel set-similarity joins using MapReduce , 2010, SIGMOD Conference.

[5]  Beng Chin Ooi,et al.  An adaptable distributed query processing architecture , 2005, Data Knowl. Eng..

[6]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[7]  Jeffrey D. Ullman,et al.  Optimizing joins in a map-reduce environment , 2010, EDBT '10.

[8]  Vinay Setty,et al.  Hadoop++: Making a Yellow Elephant Run Like a Cheetah (Without It Even Noticing) , 2010, Proc. VLDB Endow..

[9]  Michael D. Ernst,et al.  HaLoop , 2010, Proc. VLDB Endow..

[10]  Jie Pan,et al.  Parallelizing multiple group-by query in share-nothing environment: a MapReduce study case , 2010, HPDC '10.

[11]  Lars Kulik,et al.  Analysis and evaluation of V*-kNN: an efficient algorithm for moving kNN queries , 2010, The VLDB Journal.

[12]  GhemawatSanjay,et al.  The Google file system , 2003 .

[13]  Dorin Carstoiu,et al.  Hbase - non SQL Database, Performances Evaluation , 2010, Int. J. Adv. Comp. Techn..

[14]  Michael Burrows,et al.  The Chubby Lock Service for Loosely-Coupled Distributed Systems , 2006, OSDI.

[15]  Abraham Silberschatz,et al.  HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads , 2009, Proc. VLDB Endow..

[16]  Shan Huang,et al.  ComMapReduce: An Improvement of MapReduce with Lightweight Communication Mechanisms , 2012, DASFAA.

[17]  Christopher Ré,et al.  Automatic Optimization for MapReduce Programs , 2011, Proc. VLDB Endow..

[18]  Tanzima Hashem,et al.  Privacy preserving group nearest neighbor queries , 2010, EDBT '10.

[19]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[20]  Naphtali Rishe,et al.  Experiences on Processing Spatial Data with MapReduce , 2009, SSDBM.

[21]  Jignesh M. Patel,et al.  A comparison of join algorithms for log processing in MaPreduce , 2010, SIGMOD Conference.

[22]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[23]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[24]  Douglas Stott Parker,et al.  Map-reduce-merge: simplified relational data processing on large clusters , 2007, SIGMOD '07.

[25]  Yufei Tao,et al.  Efficient top-k processing in large-scaled distributed environments , 2007, Data Knowl. Eng..

[26]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[27]  Daniel J. Abadi,et al.  Data Management in the Cloud: Limitations and Opportunities , 2009, IEEE Data Eng. Bull..

[28]  Younghoon Kim,et al.  Parallel Top-K Similarity Join Algorithms Using MapReduce , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[29]  Feifei Li,et al.  Efficient parallel kNN joins for large data in MapReduce , 2012, EDBT '12.

[30]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[31]  Michael Stonebraker,et al.  A comparison of approaches to large-scale data analysis , 2009, SIGMOD Conference.

[32]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[33]  Mirek Riedewald,et al.  Processing theta-joins using MapReduce , 2011, SIGMOD '11.

[34]  Jeffrey D. Ullman,et al.  Map-reduce extensions and recursive queries , 2011, EDBT/ICDT '11.

[35]  Li Guo,et al.  Continuous data stream query in the cloud , 2011, CIKM '11.

[36]  Joseph M. Hellerstein,et al.  Online aggregation and continuous query support in MapReduce , 2010, SIGMOD Conference.

[37]  Shuigeng Zhou,et al.  Adapting Skyline Computation to the MapReduce Framework: Algorithms and Experiments , 2011, DASFAA Workshops.

[38]  Zhenhua Wang,et al.  Continuously Maintaining Sliding Window Skylines in a Sensor Network , 2007, DASFAA.