ConMR: Concurrent MapReduce Programming Model for Large Scale Shared-Data Applications

The rapid growth of large-data processing has brought in the MapReduce programming model as a widely accepted solution. However, MapReduce limits itself to a one map-to-one-reduce framework. Meanwhile, it lacks built-in support and optimization when the input datasets are shared among concurrent applications and/or jobs. The performance might be improved when the shared and frequently accessed data is read from local instead of distributed file system.To enhance the performance of big data applications, this paper presents Concurrent MapReduce, a new programming model built on top of MapReduce that deals with large amount of shared data items. Concurrent MapReduce provides support for processing heterogeneous sources of input datasets and offers optimization when the datasets are partially or fully shared. Experimental evaluation has shown an execution runtime speedup of 4X compared to traditional nonconcurrent MapReduce implementation with a manageable time overhead.

[1]  Douglas Stott Parker,et al.  Map-reduce-merge: simplified relational data processing on large clusters , 2007, SIGMOD '07.

[2]  Pramod Bhatotia,et al.  Incoop: MapReduce for incremental computations , 2011, SoCC.

[3]  Kun-Lung Wu,et al.  DEDUCE: at the intersection of MapReduce and stream processing , 2010, EDBT '10.

[4]  Michael D. Ernst,et al.  HaLoop , 2010, Proc. VLDB Endow..

[5]  Jingren Zhou,et al.  SCOPE: easy and efficient parallel processing of massive data sets , 2008, Proc. VLDB Endow..

[6]  Carlos Guestrin,et al.  Distributed GraphLab : A Framework for Machine Learning and Data Mining in the Cloud , 2012 .

[7]  Mirek Riedewald,et al.  Processing theta-joins using MapReduce , 2011, SIGMOD '11.

[8]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[9]  Min Wang,et al.  Efficient Multi-way Theta-Join Processing Using MapReduce , 2012, Proc. VLDB Endow..

[10]  Jeffrey D. Ullman,et al.  Optimizing joins in a map-reduce environment , 2010, EDBT '10.

[11]  Abhijit Gosavi,et al.  Simulation-Based Optimization: Parametric Optimization Techniques and Reinforcement Learning , 2003 .

[12]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[13]  Leonardo Neumeyer,et al.  S4: Distributed Stream Computing Platform , 2010, 2010 IEEE International Conference on Data Mining Workshops.

[14]  Songting Chen,et al.  Cheetah , 2010, Proc. VLDB Endow..

[15]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[16]  Ashraf Aboulnaga,et al.  ReStore: Reusing Results of MapReduce Jobs , 2012, Proc. VLDB Endow..

[17]  Jeffrey D. Ullman,et al.  Optimizing Multiway Joins in a Map-Reduce Environment , 2011, IEEE Transactions on Knowledge and Data Engineering.

[18]  Anthony K. H. Tung,et al.  MAP-JOIN-REDUCE: Toward Scalable and Efficient Data Analysis on Large Clusters , 2011, IEEE Transactions on Knowledge and Data Engineering.

[19]  Fan Zhang,et al.  Dataset Scaling and MapReduce Performance , 2013, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum.

[20]  Michael Isard,et al.  DryadInc: Reusing Work in Large-scale Computations , 2009, HotCloud.

[21]  Jignesh M. Patel,et al.  A comparison of join algorithms for log processing in MaPreduce , 2010, SIGMOD Conference.

[22]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[23]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[24]  Chao Tian,et al.  Nova: continuous Pig/Hadoop workflows , 2011, SIGMOD '11.

[25]  Joseph M. Hellerstein,et al.  GraphLab: A New Framework For Parallel Machine Learning , 2010, UAI.