Set-Oriented Data Mining in relational Databases

Data mining is an important real-life application for businesses. It is critical to find efficient ways of mining large data sets. In order to benefit from the experience with relational databases, a set-oriented approach to mining data is needed. In such an approach, the data mining operations are expressed in terms of relational or set-oriented operations. Query optimization technology can then be used for efficient processing. In this paper, we describe set-oriented algorithms for mining association rules. Such algorithms imply performing multiple joins and thus may appear to be inherently less efficient than special-purpose algorithms. We develop new algorithms that can be expressed as SQL queries, and discuss optimization of these algorithms. After analytical evaluation, an algorithm named SETM emerges as the algorithm of choice. Algorithm SETM uses only simple database primitives, viz., sorting and merge-scan join. Algorithm SETM is simple, fast, and stable over the range of parameter values. It is easily parallelized and we suggest several additional optimizations. The set-oriented nature of Algorithm SETM makes it possible to develop extensions easily and its performance makes it feasible to build interactive data mining tools for large databases.

[1]  Jason Catlett,et al.  Megainduction: A Test Flight , 1991, ML.

[2]  Tomasz Imielinski,et al.  An Interval Classifier for Database Mining Applications , 1992, VLDB.

[3]  Gregory Piatetsky-Shapiro,et al.  Discovery, Analysis, and Presentation of Strong Rules , 1991, Knowledge Discovery in Databases.

[4]  Herbert A. Simon,et al.  Scientific discovery: compulalional explorations of the creative process , 1987 .

[5]  David J. DeWitt,et al.  Parallel sorting on a shared-nothing architecture using probabilistic splitting , 1991, [1991] Proceedings of the First International Conference on Parallel and Distributed Information Systems.

[6]  Tomasz Imielinski,et al.  Database Mining: A Performance Perspective , 1993, IEEE Trans. Knowl. Data Eng..

[7]  Herbert A. Simon,et al.  Scientific discovery: compulalional explorations of the creative process , 1987 .

[8]  Shamkant B. Navathe,et al.  Knowledge mining by imprecise querying: a classification-based approach , 1992, [1992] Eighth International Conference on Data Engineering.

[9]  Shalom Tsur,et al.  Data Dredging , 1990, IEEE Data Eng. Bull..

[10]  William Frawley,et al.  Knowledge Discovery in Databases , 1991 .

[11]  Tomasz Imielinski,et al.  Research Directions in Knowledge Discovery , 1991, SIGMOD Rec..

[12]  Jiawei Han,et al.  Knowledge Discovery in Databases: An Attribute-Oriented Approach , 1992, VLDB.

[13]  James Kelly,et al.  AutoClass: A Bayesian Classification System , 1993, ML.

[14]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[15]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[16]  Gomer Thomas,et al.  Practitioner problems in need of database research , 1991, SGMD.

[17]  M.A.W. Houtsma,et al.  Set-Oriented Mining for Association Rules , 1993, ICDE 1993.

[18]  Christos Faloutsos,et al.  Efficient Similarity Search In Sequence Databases , 1993, FODO.