Architectures and optimizations for integrating data mining algorithms with database systems
暂无分享,去创建一个
Data mining on large data warehouses is becoming increasingly important. In support of this trend, we consider a spectrum of architectural alternatives for integrating mining with database systems. These alternatives include loose-coupling through a SQL cursor interface; encapsulation of the mining algorithm in a stored procedure; caching the data to a file system on-the-fly and mining; tight-coupling using primarily user-defined functions; and SQL implementations for processing in the DBMS. First, we comprehensively study the option of expressing the association rule mining algorithm in the form of SQL queries. We consider four options in SQL-92 and six options in SQL enhanced with object-relational extensions (SQL-OR). Our evaluation of the different architectural alternatives shows that from a performance perspective, the Cache-Mine option is superior, although the SQL-OR option comes a close second. Both the Cache-Mine and the SQL-OR approaches incur a higher storage penalty than the loose-coupling approach which performance-wise is a factor of 3 to 4 worse than Cache-Mine We also compare these alternatives on the basis of qualitative factors like automatic parallelization, development ease, portability and interoperability.
We further analyze the SQL-92 approaches with the twin goals of studying how best can a DBMS without any object-relational extensions execute these queries and to identify ways of incorporating the semantics of mining into cost-based query optimizers. We develop cost formulae for the mining queries based on the input data parameters and relational operator costs. We also identify certain optimizations which improve the performance. Next, we study generalized association rule and sequential pattern mining and develop SQL formulations for them there by demonstrating that more complex mining operations can be handled in the SQL frame work.
We develop an incremental association rule mining algorithm which does not need to examine the old data if the frequent itemsets do not change. Even otherwise, access to the old database can be limited to just one scan. We categorize the various kinds of constraints on the items that are useful in the context of interactive mining to facilitate goal-oriented mining. We show how the incremental mining technique can be adapted to handle constraints and certain kinds of constraint relaxation. We also show the applicability of the incremental algorithm to other classes of data mining and decision support problems. Finally, we identify certain primitive operators that are useful for a large class of data mining and decision support applications. Supporting them natively in the DBMS could enable these applications to run faster.