Reuse-based Optimization for Pig Latin

Pig Latin is a popular language which is widely used for parallel processing of massive data sets. Currently, subexpressions occurring repeatedly in Pig Latin scripts are executed as many times as they appear, and the current Pig Latin optimizer does not identify reuse opportunities. We present a novel optimization approach aiming at identifying and reusing repeated subexpressions in Pig Latin scripts. Our optimization algorithm, named PigReuse, identifies subexpression merging opportunities, selects the best ones to execute based on a cost function, and reuses their results as needed in order to compute exactly the same output as the original scripts. Our experiments demonstrate the effectiveness of our approach.

[1]  Prasan Roy,et al.  Efficient and extensible algorithms for multi query optimization , 1999, SIGMOD '00.

[2]  Peter Alvaro,et al.  Multi-Query Optimization for Parallel Dataflow Systems , 2009 .

[3]  Hakan Hacigümüs,et al.  Opportunistic physical design for big data analytics , 2014, SIGMOD Conference.

[4]  Jingren Zhou,et al.  Exploiting Common Subexpressions for Cloud Query Processing , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[5]  Guido Moerkotte,et al.  Generating optimal DAG-structured query evaluation plans , 2009, Computer Science - Research and Development.

[6]  Stratis Viglas,et al.  Recycling in pipelined query evaluation , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[7]  Guoping Wang,et al.  Multi-Query Optimization in MapReduce Framework , 2013, Proc. VLDB Endow..

[8]  Timos K. Sellis,et al.  Multiple-query optimization , 1988, TODS.

[9]  Christopher Olston,et al.  Automatic Optimization of Parallel Dataflow Programs , 2008, USENIX Annual Technical Conference.

[10]  George Kollios,et al.  MRShare , 2010, Proc. VLDB Endow..

[11]  Ioana Manolescu,et al.  PigReuse: A Reuse-based Optimizer for Pig Latin , 2016 .

[12]  Goetz Graefe,et al.  The Volcano optimizer generator: extensibility and efficient search , 1993, Proceedings of IEEE 9th International Conference on Data Engineering.

[13]  Tova Milo,et al.  Towards Tractable Algebras for Bags , 1996, J. Comput. Syst. Sci..

[14]  Ashraf Aboulnaga,et al.  ReStore: Reusing Results of MapReduce Jobs , 2012, Proc. VLDB Endow..

[15]  Parag Agrawal,et al.  Scheduling shared scans of large data files , 2008, Proc. VLDB Endow..

[16]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[17]  Ioana Manolescu,et al.  Delta: Scalable Data Dissemination under Capacity Constraints , 2013, Proc. VLDB Endow..

[18]  Jian Yang,et al.  Algorithms for Materialized View Design in Data Warehousing Environment , 1997, VLDB.

[19]  Matthias Jarke,et al.  Common Subexpression Isolation in Multiple Query Optimization , 1984, Query Processing in Database Systems.

[20]  Gustavo Alonso,et al.  Shared Workload Optimization , 2014, Proc. VLDB Endow..

[21]  Wolfgang Lehner,et al.  Efficient exploitation of similar subexpressions for query processing , 2007, SIGMOD '07.