Building statistical models and scoring with UDFs

Multidimensional statistical models are generally computed outside a relational DBMS, exporting data sets. This article explains how fundamental multidimensional statistical models are computed inside the DBMS in a single table scan exploiting SQL and User-Defined Functions (UDFs). The techniques described herein are used in a commercial data mining tool, called Teradata Warehouse Miner. Specifically, we explain how correlation, linear regression, PCA and clustering, are integrated into the Teradata DBMS. Two major database processing tasks are discussed: building a model and scoring a data set based on a model. To build a model two summary matrices are shown to be common and essential for all linear models: the linear sum of points and the quadratic sum of cross-products of points. Since such matrices are generally significantly smaller than the data set, we explain how the remaining matrix operations to build the model can be quickly performed outside the DBMS. We first explain how to efficiently compute summary matrices with plain SQL queries. Then we present two sets of UDFs that work in a single table scan: an aggregate UDF to compute summary matrices and a set of scalar UDFs to score data sets. Experiments compare UDFs and SQL queries (running inside the DBMS) with C++ (running outside on exported files). In general, UDFs are faster than SQL queries and UDFs are more efficient than C++, due to long export times. Statistical models based on the summary matrices can be built outside the DBMS in just a few seconds. Aggregate and scalar UDFs scale linearly and require only one table scan, making them ideal to process large data sets.

[1]  Paul S. Bradley,et al.  Scaling Clustering Algorithms to Large Databases , 1998, KDD.

[2]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[3]  Surajit Chaudhuri,et al.  An overview of query optimization in relational systems , 1998, PODS.

[4]  Giuseppe Psaila,et al.  An Extension to SQL for Mining Association Rules , 1998, Data Mining and Knowledge Discovery.

[5]  Carlos Ordonez,et al.  Vector and matrix operations programmed with UDFs in a relational DBMS , 2006, CIKM '06.

[6]  Bernhard Mitschang,et al.  On parallel processing of aggregate and scalar functions in object-relational DBMS , 1998, SIGMOD '98.

[7]  Kai-Uwe Sattler,et al.  SQL database primitives for decision tree classifiers , 2001, CIKM '01.

[8]  Surajit Chaudhuri,et al.  On the Efficient Gathering of Sufficient Statistics for Classification from Large SQL Databases , 1998, KDD.

[9]  Ramez Elmasri,et al.  Fundamentals of Database Systems , 1989 .

[10]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[11]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[12]  Michael L. Heytens,et al.  NonStop SQL/MX primitives for knowledge discovery , 1999, KDD '99.

[13]  Carlos Ordonez,et al.  SQLEM: fast clustering in SQL using the EM algorithm , 2000, SIGMOD '00.

[14]  Surajit Chaudhuri,et al.  Integrating data mining with SQL databases: OLE DB for data mining , 2001, Proceedings 17th International Conference on Data Engineering.

[15]  Abhinav Gupta,et al.  Spreadsheets in RDBMS for OLAP , 2003, SIGMOD '03.

[16]  Carlos Ordonez,et al.  Integrating K-means clustering with a relational DBMS using SQL , 2006, IEEE Transactions on Knowledge and Data Engineering.

[17]  Sunita Sarawagi,et al.  Integrating Association Rule Mining with Relational Database Systems: Alternatives and Implications , 1998, SIGMOD '98.

[18]  Wolfgang Lehner,et al.  COMBI-Operator: Database Support for Data Mining Applications , 2003, VLDB.

[19]  Johannes Gehrke,et al.  BOAT—optimistic decision tree construction , 1999, SIGMOD '99.

[20]  Ahmad Ghazal,et al.  Outer Join Elimination in the Teradata RDBMS , 2004, DEXA.

[21]  Zoubin Ghahramani,et al.  A Unifying Review of Linear Gaussian Models , 1999, Neural Computation.

[22]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.