Efficient Distance Computation Using SQL Queries and UDFs

Distance computation is one of the most computationally intensive operations employed by many data mining algorithms. Performing such matrix computations within a DBMS creates many optimization challenges. We propose techniques to efficiently compute Euclidean distance using SQL queries and user-defined functions (UDFs). We concentrate on efficient Euclidean distance computation for the well-known K-means clustering algorithm. We present SQL query optimizations and a scalar UDF to compute Euclidean distance. We experimentally evaluate performance and scalability of our proposed SQL queries and UDF with large data sets on a modern DBMS. We benchmark distance computation on two important data mining techniques: clustering and classification. In general, UDFs are faster than SQL queries because they are executed in main memory. Data set size is the main factor impacting performance, followed by data set dimensionality.

[1]  U. Fayyad,et al.  Scaling EM (Expectation Maximization) Clustering to Large Databases , 1998 .

[2]  Carlo Zaniolo,et al.  ATLAS: A Small but Complete SQL Extension for Data Mining and Data Streams , 2003, VLDB.

[3]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[4]  Surajit Chaudhuri,et al.  Integrating data mining with SQL databases: OLE DB for data mining , 2001, Proceedings 17th International Conference on Data Engineering.

[5]  Carlos Ordonez,et al.  Vector and matrix operations programmed with UDFs in a relational DBMS , 2006, CIKM '06.

[6]  Bernhard Mitschang,et al.  On parallel processing of aggregate and scalar functions in object-relational DBMS , 1998, SIGMOD '98.

[7]  Kai-Uwe Sattler,et al.  SQL database primitives for decision tree classifiers , 2001, CIKM '01.

[8]  Carlos Ordonez Programming the K-means clustering algorithm in SQL , 2004, KDD '04.

[9]  Zoubin Ghahramani,et al.  A Unifying Review of Linear Gaussian Models , 1999, Neural Computation.

[10]  Michael L. Heytens,et al.  NonStop SQL/MX primitives for knowledge discovery , 1999, KDD '99.

[11]  Carlo Zaniolo,et al.  User defined aggregates in object-relational systems , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[12]  Carlos Ordonez,et al.  SQLEM: fast clustering in SQL using the EM algorithm , 2000, SIGMOD '00.

[13]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[14]  Surajit Chaudhuri,et al.  An overview of query optimization in relational systems , 1998, PODS.

[15]  Sunita Sarawagi,et al.  Integrating association rule mining with relational database systems: alternatives and implications , 1998, SIGMOD '98.

[16]  Carlos Ordonez,et al.  Integrating K-means clustering with a relational DBMS using SQL , 2006, IEEE Transactions on Knowledge and Data Engineering.

[17]  Carlos Ordonez Building statistical models and scoring with UDFs , 2007, SIGMOD '07.