Spotting Code Optimizations in Data-Parallel Pipelines through PeriSCOPE

To minimize the amount of data-shuffling I/O that occurs between the pipeline stages of a distributed data-parallel program, its procedural code must be optimized with full awareness of the pipeline that it executes in. Unfortunately, neither pipeline optimizers nor traditional compilers examine both the pipeline and procedural code of a data-parallel program so programmers must either hand-optimize their program across pipeline stages or live with poor performance. To resolve this tension between performance and programmability, this paper describes PeriSCOPE, which automatically optimizes a data-parallel program's procedural code in the context of data flow that is reconstructed from the program's pipeline topology. Such optimizations eliminate unnecessary code and data, perform early data filtering, and calculate small derived values (e.g., predicates) earlier in the pipeline, so that less data - sometimes much less data - is transferred between pipeline stages. PeriSCOPE further leverages symbolic execution to enlarge the scope of such optimizations by eliminating dead code. We describe how PeriSCOPE is implemented and evaluate its effectiveness on real production jobs.

[1]  Dawson R. Engler,et al.  KLEE: Unassisted and Automatic Generation of High-Coverage Tests for Complex Systems Programs , 2008, OSDI.

[2]  PikeRob,et al.  Interpreting the data , 2005 .

[3]  Ken Kennedy,et al.  Conversion of control dependence to data dependence , 1983, POPL '83.

[4]  Junfeng Yang,et al.  Optimizing Data Partitioning for Data-Parallel Computing , 2011, HotOS.

[5]  Michael Isard,et al.  Steno: automatic optimization of declarative queries , 2011, PLDI '11.

[6]  Joseph M. Hellerstein,et al.  MapReduce Online , 2010, NSDI.

[7]  Goetz Graefe The Cascades Framework for Query Optimization , 1995, IEEE Data Eng. Bull..

[8]  Zheng Shao,et al.  Hive - a petabyte scale data warehouse using Hadoop , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[9]  Steven S. Muchnick,et al.  Advanced Compiler Design and Implementation , 1997 .

[10]  David W. Binkley,et al.  Program slicing , 2008, 2008 Frontiers of Software Maintenance.

[11]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[12]  Christopher Olston,et al.  Automatic Optimization of Parallel Dataflow Programs , 2008, USENIX Annual Technical Conference.

[13]  Douglas Stott Parker,et al.  Map-reduce-merge: simplified relational data processing on large clusters , 2007, SIGMOD '07.

[14]  Christopher Olston,et al.  Building a HighLevel Dataflow System on top of MapReduce: The Pig Experience , 2009, Proc. VLDB Endow..

[15]  Willy Zwaenepoel,et al.  HadoopToSQL: a mapReduce query optimizer , 2010, EuroSys '10.

[16]  Nicolas Bruno,et al.  SCOPE: parallel databases meet MapReduce , 2012, The VLDB Journal.

[17]  Chengkai Li,et al.  New ideas track: testing mapreduce-style programs , 2011, ESEC/FSE '11.

[18]  Jingren Zhou,et al.  Incorporating partitioning and parallel plans into the SCOPE optimizer , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[19]  Alfred V. Aho,et al.  Compilers: Principles, Techniques, and Tools , 1986, Addison-Wesley series in computer science / World student series edition.

[20]  Wei Lin,et al.  Microsoft Bing Peking University , 2022 .

[21]  Michael D. Ernst,et al.  HaLoop , 2010, Proc. VLDB Endow..

[22]  Srikanth Kandula,et al.  Reoptimizing Data Parallel Computing , 2012, NSDI.

[23]  Michael Isard,et al.  Distributed aggregation for data-parallel computing: interfaces and implementations , 2009, SOSP '09.

[24]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[25]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[26]  Ken Kennedy,et al.  Optimizing Compilers for Modern Architectures: A Dependence-based Approach , 2001 .

[27]  Robert H. Halstead,et al.  MULTILISP: a language for concurrent symbolic computation , 1985, TOPL.

[28]  Jingren Zhou,et al.  SCOPE: easy and efficient parallel processing of massive data sets , 2008, Proc. VLDB Endow..

[29]  Rob Pike,et al.  Interpreting the data: Parallel analysis with Sawzall , 2005, Sci. Program..

[30]  Christopher Ré,et al.  Automatic Optimization for MapReduce Programs , 2011, Proc. VLDB Endow..

[31]  Nikolai Tillmann,et al.  Pex-White Box Test Generation for .NET , 2008, TAP.

[32]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[33]  Craig Chambers,et al.  FlumeJava: easy, efficient data-parallel pipelines , 2010, PLDI '10.

[34]  Chao Tian,et al.  Nova: continuous Pig/Hadoop workflows , 2011, SIGMOD '11.

[35]  Andrew V. Goldberg,et al.  Experimental study of minimum cut algorithms , 1997, SODA '97.

[36]  Mark N. Wegman,et al.  Efficiently computing static single assignment form and the control dependence graph , 1991, TOPL.

[37]  Michael Isard,et al.  DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language , 2008, OSDI.

[38]  Manuel Fähndrich,et al.  On the Relative Completeness of Bytecode Analysis Versus Source Code Analysis , 2008, CC.

[39]  Brian Beckman,et al.  LINQ: reconciling object, relations and XML in the .NET framework , 2006, SIGMOD Conference.

[40]  Songtao Xia,et al.  Inferring Dataflow Properties of User Defined Table Processors , 2009, SAS.