Assignment Problems of Different-Sized Inputs in MapReduce

A MapReduce algorithm can be described by a mapping schema, which assigns inputs to a set of reducers, such that for each required output there exists a reducer that receives all the inputs participating in the computation of this output. Reducers have a capacity that limits the sets of inputs they can be assigned. However, individual inputs may vary in terms of size. We consider, for the first time, mapping schemas where input sizes are part of the considerations and restrictions. One of the significant parameters to optimize in any MapReduce job is communication cost between the map and reduce phases. The communication cost can be optimized by minimizing the number of copies of inputs sent to the reducers. The communication cost is closely related to the number of reducers of constrained capacity that are used to accommodate appropriately the inputs, so that the requirement of how the inputs must meet in a reducer is satisfied. In this work, we consider a family of problems where it is required that each input meets with each other input in at least one reducer. We also consider a slightly different family of problems in which each input of a list, X, is required to meet each input of another list, Y, in at least one reducer. We prove that finding an optimal mapping schema for these families of problems is NP-hard, and present a bin-packing-based approximation algorithm for finding a near optimal mapping schema.

[1]  Klaudia Frankfurter Computers And Intractability A Guide To The Theory Of Np Completeness , 2016 .

[2]  Jiawei Han,et al.  ACM Transactions on Knowledge Discovery from Data: Introduction , 2007 .

[3]  Jeffrey D. Ullman Designing good MapReduce algorithms , 2012, XRDS.

[4]  David R. Karger,et al.  Efficient Algorithms for Fixed-Precision Instances of Bin Packing and Euclidean TSP , 2008, APPROX-RANDOM.

[5]  Eli Upfal,et al.  Space-round tradeoffs for MapReduce computations , 2011, ICS '12.

[6]  Jure Leskovec,et al.  Mining of Massive Datasets, 2nd Ed , 2014 .

[7]  Jeffrey D. Ullman,et al.  Matching bounds for the all-pairs MapReduce problem , 2013, IDEAS '13.

[8]  Sergei Vassilvitskii,et al.  A model of computation for MapReduce , 2010, SODA '10.

[9]  Chen Li,et al.  Efficient parallel set-similarity joins using MapReduce , 2010, SIGMOD Conference.

[10]  Edward G. Coffman,et al.  Approximation algorithms for bin packing: a survey , 1996 .

[11]  Jeffrey D. Ullman,et al.  Upper and Lower Bounds on the Cost of a Map-Reduce Computation , 2012, Proc. VLDB Endow..

[12]  Michael T. Goodrich,et al.  Simulating Parallel Algorithms in the MapReduce Framework with Applications to Parallel Computational Geometry , 2010, ArXiv.

[13]  Jeffrey Xu Yu,et al.  Efficient similarity joins for near-duplicate detection , 2011, TODS.

[14]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[15]  Roberto J. Bayardo,et al.  Scaling up all pairs similarity search , 2007, WWW '07.

[16]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[17]  Dorit S. Hochba,et al.  Approximation Algorithms for NP-Hard Problems , 1997, SIGA.

[18]  Din J. Wasem,et al.  Mining of Massive Datasets , 2014 .

[19]  Jeffrey D. Ullman,et al.  Assignment of Different-Sized Inputs in MapReduce , 2015, DISC.

[20]  Silvio Lattanzi,et al.  Filtering: a method for solving graph problems in MapReduce , 2011, SPAA '11.

[21]  David S. Johnson,et al.  Near-optimal bin packing algorithms , 1973 .