Optimizing Data Partitioning for Data-Parallel Computing

Performance of data-parallel computing (e.g., MapReduce, DryadLINQ) heavily depends on its data partitions. Solutions implemented by the current state of the art systems are far from optimal. Techniques proposed by the database community to find optimal data partitions are not directly applicable when complex user-defined functions and data models are involved. We outline our solution, which draws expertise from various fields such as programming languages and optimization, and present our preliminary results.

[1]  Sumit Gulwani,et al.  SPEED: precise and efficient static estimation of program computational complexity , 2009, POPL '09.

[2]  Peter J. Haas,et al.  On synopses for distinct-value estimation under multiset operations , 2007, SIGMOD '07.

[3]  Magdalena Balazinska,et al.  Skew-resistant parallel processing of feature-extracting scientific user-defined functions , 2010, SoCC '10.

[4]  Michael Isard,et al.  Distributed aggregation for data-parallel computing: interfaces and implementations , 2009, SOSP '09.

[5]  Larry Wasserman,et al.  All of Statistics , 2004 .

[6]  Moses Charikar,et al.  Finding frequent items in data streams , 2002, Theor. Comput. Sci..

[7]  David J. DeWitt,et al.  Practical Skew Handling in Parallel Joins , 1992, VLDB.

[8]  Michael Isard,et al.  Distributed data-parallel computing using a high-level programming language , 2009, SIGMOD Conference.

[9]  Abdelkader Hameurlain,et al.  A Cost Evaluator for Parallel Database Systems , 1995, DEXA.

[10]  Michael Isard,et al.  DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language , 2008, OSDI.

[11]  Linda M. Wills,et al.  Extracting an explicitly data-parallel representation of image-processing programs , 2003, 10th Working Conference on Reverse Engineering, 2003. WCRE 2003. Proceedings..

[12]  Yao Zhao,et al.  BotGraph: Large Scale Spamming Botnet Detection , 2009, NSDI.

[13]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[14]  Michael Stonebraker,et al.  A comparison of approaches to large-scale data analysis , 2009, SIGMOD Conference.

[15]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[16]  Simon Goldsmith,et al.  Measuring empirical computational complexity , 2007, ESEC-FSE '07.

[17]  Reidar Conradi,et al.  A data parallel programming model based on distributed objects , 2002, Proceedings. IEEE International Conference on Cluster Computing.

[18]  Mahadev Satyanarayanan,et al.  Predictive Resource Management for Wearable Computing , 2003, MobiSys '03.

[19]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[20]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[21]  Rajeev Motwani,et al.  Random sampling for histogram construction: how much is enough? , 1998, SIGMOD '98.

[22]  Cameron David Rose,et al.  The Hive Project , 2011 .