SQL/MapReduce: A practical approach to self-describing, polymorphic, and parallelizable user-defined functions

A user-defined function (UDF) is a powerful database feature that allows users to customize database functionality. Though useful, present UDFs have numerous limitations, including install-time specification of input and output schema and poor ability to parallelize execution. We present a new approach to implementing a UDF, which we call SQL/MapReduce (SQL/MR), that overcomes many of these limitations. We leverage ideas from the MapReduce programming paradigm to provide users with a straightforward API through which they can implement a UDF in the language of their choice. Moreover, our approach allows maximum flexibility as the output schema of the UDF is specified by the function itself at query plan-time. This means that a SQL/MR function is polymorphic. It can process arbitrary input because its behavior as well as output schema are dynamically determined by information available at query plan-time, such as the function's input schema and arbitrary user-provided parameters. This also increases reusability as the same SQL/MR function can be used on inputs with many different schemas or with different user-specified parameters. In this paper we describe the motivation for this new approach to UDFs as well as the implementation within Aster Data Systems' nCluster database. We demonstrate that in the context of massively parallel, shared-nothing database systems, this model of computation facilitates highly scalable computation within the database. We also include examples of new applications that take advantage of this novel UDF framework.

[1]  Laura M. Haas,et al.  Extensible database management systems , 1990, SGMD.

[2]  Michael Stonebraker,et al.  Predicate migration: optimizing queries with expensive predicates , 1992, SIGMOD Conference.

[3]  David J. DeWitt,et al.  Parallel database systems: the future of high performance database systems , 1992, CACM.

[4]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[5]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[6]  Michael McLaughlin,et al.  Oracle Database 11g PL/SQL Programming : Develop Robust, Database , 2008 .

[7]  Surajit Chaudhuri,et al.  Optimization of queries with user-defined predicates , 1996, TODS.

[8]  Jingren Zhou,et al.  SCOPE: easy and efficient parallel processing of massive data sets , 2008, Proc. VLDB Endow..

[9]  Michael Stonebraker,et al.  The Implementation of Postgres , 1990, IEEE Trans. Knowl. Data Eng..

[10]  Sanjeev Khanna,et al.  Space-efficient online computation of quantile summaries , 2001, SIGMOD '01.

[11]  Michael Stonebraker,et al.  Extending a database system with procedures , 1987, TODS.

[12]  Michael Stonebraker,et al.  Inclusion of new types in relational data base systems , 1986, 1986 IEEE Second International Conference on Data Engineering.

[13]  Bernhard Mitschang,et al.  On parallel processing of aggregate and scalar functions in object-relational DBMS , 1998, SIGMOD '98.

[14]  Bernhard Mitschang,et al.  User-Defined Table Operators: Enhancing Extensibility for ORDBMS , 1999, VLDB.

[15]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[16]  Michael Stonebraker,et al.  The POSTGRES next generation database management system , 1991, CACM.

[17]  Jeffrey F. Naughton,et al.  Query execution techniques for caching expensive methods , 1996, SIGMOD '96.