A study of partitioning and parallel UDF execution with the SAP HANA database

Large-scale data analysis relies on custom code both for preparing the data for analysis as well as for the core analysis algorithms. The map-reduce framework offers a simple model to parallelize custom code, but it does not integrate well with relational databases. Likewise, the literature on optimizing queries in relational databases has largely ignored user-defined functions (UDFs). In this paper, we discuss annotations for user-defined functions that facilitate optimizations that both consider relational operators and UDFs. In this paper we focus on optimizations that enable the parallel execution of relational operators and UDFs for a number of typical patterns. A study on real-world data investigates the opportunities for parallelization of complex data flows containing both relational operators and UDFs.

[1]  Wolfgang Lehner,et al.  Bridging two worlds with RICE , 2011, Proc. VLDB Endow..

[2]  W. Marsden I and J , 2012 .

[3]  Norman May,et al.  SQLScript: Efficiently Analyzing Big Enterprise Data in SAP HANA , 2013, BTW.

[4]  Norman May,et al.  Advanced Analytics with the SAP HANA Database , 2013, DATA.

[5]  Srikanth Kandula,et al.  Reoptimizing Data Parallel Computing , 2012, NSDI.

[6]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[7]  Felix Naumann,et al.  SOFA: An extensible logical optimizer for UDF-heavy data flows , 2015, Inf. Syst..

[8]  Michael D. Ernst,et al.  HaLoop , 2010, Proc. VLDB Endow..

[9]  Bernhard Mitschang,et al.  On parallel processing of aggregate and scalar functions in object-relational DBMS , 1998, SIGMOD '98.

[10]  Bernhard Mitschang,et al.  User-Defined Table Operators: Enhancing Extensibility for ORDBMS , 1999, VLDB.

[11]  Jingren Zhou,et al.  SCOPE: easy and efficient parallel processing of massive data sets , 2008, Proc. VLDB Endow..

[12]  Michael Stonebraker,et al.  Predicate migration: optimizing queries with expensive predicates , 1992, SIGMOD Conference.

[13]  Kunle Olukotun,et al.  Map-Reduce for Machine Learning on Multicore , 2006, NIPS.

[14]  Dominic Battré,et al.  Massively parallel data analysis with PACTs on Nephele , 2010, Proc. VLDB Endow..

[15]  Michael Stonebraker,et al.  A comparison of approaches to large-scale data analysis , 2009, SIGMOD Conference.

[16]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[17]  J. Hopcroft,et al.  Algorithm 447: efficient algorithms for graph manipulation , 1973, CACM.

[18]  John Cieslewicz,et al.  SQL/MapReduce: A practical approach to self-describing, polymorphic, and parallelizable user-defined functions , 2009, Proc. VLDB Endow..

[19]  Sven Helmer,et al.  On the optimal ordering of maps and selections under factorization , 2005, 21st International Conference on Data Engineering (ICDE'05).

[20]  Jingren Zhou,et al.  Incorporating partitioning and parallel plans into the SCOPE optimizer , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[21]  Arlo Faria,et al.  MapReduce : Distributed Computing for Machine Learning , 2006 .

[22]  Garret Swart,et al.  Oracle in-database hadoop: when mapreduce meets RDBMS , 2012, SIGMOD Conference.