Running analytics computation inside database engines through the use of UDFs (User Defined Functions) has been extensively investigated, but not yet become a scalable approach due to two major limitations. One limitation lies in that the existent UDFs are not relation-in, relation-out and schema-aware, unable to model complex applications, and cannot be composed with relational operators in a SQL query. Another limitation lies in the difficulty of programming UDFs for efficient interaction with query processing, since that requires hard-to-follow system knowledge beyond the analytics expertise. These limitations actually keep away most users from using UDFs for their analytics applications.
To solve these problems, we extend the UDF technology in both semantic and system dimensions. We first expand our investigation on Relation Valued Functions (RVFs) with the goal of having RVF executions tightly integrated with query processing, but allowing RVF developers to be liberated from DBMS internal details. We separate an RVF into two parts: RVF shell that contains the system utilities, and user-function that contains application logic only. We provided focused system support based on the notion of invocation pattern , and developed the mechanism for generating an RVF-shell automatically based on the schemas of its argument and return relations, the well understood invocation pattern, and the common data conversion protocol. A complete RVF is made by plugging the "user function" in the RVF-shell.
We have prototyped the proposed approach on the open-sourced database engine Postgres. Our experience reveals its advantages in making UDF tightly integrated with the query executor but relieving analytics users from dealing with system details --- a fundamental data engineering requirement to make UDF technology practically usable for converging data intensive analytics and data management.
[1]
G LoweDavid,et al.
Distinctive Image Features from Scale-Invariant Keypoints
,
2004
.
[2]
Andrew Novick.
Transact SQL User Defined Functions
,
2004
.
[3]
Jeffrey Dean,et al.
Keynote talk: Experiences with MapReduce, an abstraction for large-scale computation
,
2006,
2006 International Conference on Parallel Architectures and Compilation Techniques (PACT).
[4]
Jingren Zhou,et al.
SCOPE: easy and efficient parallel processing of massive data sets
,
2008,
Proc. VLDB Endow..
[5]
Umeshwar Dayal,et al.
A Transactional Model for Long-Running Activities
,
1991,
VLDB.
[6]
Qiming Chen,et al.
Inter-enterprise collaborative business process management
,
2001,
Proceedings 17th International Conference on Data Engineering.
[7]
Qiming Chen,et al.
Data-Continuous SQL Process Model
,
2008,
OTM Conferences.
[8]
Hans-Arno Jacobsen,et al.
PNUTS: Yahoo!'s hosted data serving platform
,
2008,
Proc. VLDB Endow..
[9]
Carlos Ordonez,et al.
Vector and matrix operations programmed with UDFs in a relational DBMS
,
2006,
CIKM '06.
[10]
Bernhard Mitschang,et al.
User-Defined Table Operators: Enhancing Extensibility for ORDBMS
,
1999,
VLDB.
[11]
David J. DeWitt,et al.
Clustera: an integrated computation and data management system
,
2008,
Proc. VLDB Endow..