Efficient evaluation of queries with mining predicates

Modern relational database systems are beginning to support ad-hoc queries on data mining models. In this paper, we explore novel techniques for optimizing queries that apply mining models to relational data. For such queries, we use the internal structure of the mining model to automatically derive traditional database predicates. We present algorithms for deriving such predicates for some popular discrete mining models: decision trees, naive Bayes, and clustering. Our experiments on a Microsoft SQL Server 2000 demonstrate that these derived predicates can significantly reduce the cost of evaluating such queries.

[1]  Surajit Chaudhuri,et al.  An Efficient Cost-Driven Index Selection Tool for Microsoft SQL Server , 1997, VLDB.

[2]  Surajit Chaudhuri,et al.  Materialized view and index selection tool for Microsoft SQL server 2000 , 2001, SIGMOD '01.

[3]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[4]  Surajit Chaudhuri,et al.  Data warehousing and OLAP for decision support , 1997, SIGMOD '97.

[5]  Isidore Rigoutsos,et al.  An algorithm for point clustering and grid generation , 1991, IEEE Trans. Syst. Man Cybern..

[6]  金田 重郎,et al.  C4.5: Programs for Machine Learning (書評) , 1995 .

[7]  Ron Kohavi,et al.  Data mining using /spl Mscr//spl Lscr//spl Cscr/++ a machine learning library in C++ , 1996, Proceedings Eighth IEEE International Conference on Tools with Artificial Intelligence.

[8]  Peter C. Cheeseman,et al.  Bayesian Classification (AutoClass): Theory and Results , 1996, Advances in Knowledge Discovery and Data Mining.

[9]  Sunita Sarawagi,et al.  Integrating association rule mining with relational database systems: alternatives and implications , 1998, SIGMOD '98.

[10]  Geoffrey J. McLachlan,et al.  Mixture models : inference and applications to clustering , 1989 .

[11]  Surajit Chaudhuri Data Mining and Database Systems: Where is the Intersection? , 1998, IEEE Data Eng. Bull..

[12]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[13]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[14]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[15]  Surajit Chaudhuri,et al.  Integrating data mining with SQL databases: OLE DB for data mining , 2001, Proceedings 17th International Conference on Data Engineering.

[16]  Jude W. Shavlik,et al.  Using neural networks for data mining , 1997, Future Gener. Comput. Syst..

[17]  Surajit Chaudhuri,et al.  Finding Nonrecursive Envelopes for Datalog Predicates. , 1992 .

[18]  William W. Cohen Fast Effective Rule Induction , 1995, ICML.

[19]  Joseph C. Culberson,et al.  Covering a simple orthogonal polygon with a minimum number of orthogonally convex polygons , 1987, SCG '87.

[20]  Surajit Chaudhuri,et al.  Can Datalog be approximated? , 1994, J. Comput. Syst. Sci..

[21]  Tomasz Imielinski,et al.  An Interval Classifier for Database Mining Applications , 1992, VLDB.

[22]  D. S. FRANZBLAUf PERFORMANCE GUARANTEES ON A SWEEP-LINE HEURISTIC FOR COVERING RECTILINEAR POLYGONS WITH RECTANGLES * , 2022 .

[23]  A. Richard Newton,et al.  Selected Papers on Logic Synthesis for Integrated Circuit Design , 1987 .

[24]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[25]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[26]  C. Mohan,et al.  Single Table Access Using Multiple Indexes: Optimization, Execution, and Concurrency Control Techniques , 1990, EDBT.

[27]  Surajit Chaudhuri,et al.  An overview of data warehousing and OLAP technology , 1997, SGMD.

[28]  Surajit Chaudhuri,et al.  Finding nonrecursive envelopes for Datalog predicate , 1993, PODS '93.

[29]  Daniel L. Ostapko,et al.  MINI: A Heuristic Approach for Logic Minimization , 1974, IBM J. Res. Dev..

[30]  Ron Kohavi,et al.  Supervised and Unsupervised Discretization of Continuous Features , 1995, ICML.

[31]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[32]  Surajit Chaudhuri,et al.  Optimization of queries with user-defined predicates , 1996, TODS.

[33]  Guido Moerkotte,et al.  Efficient Dynamic Programming Algorithms for Ordering Expensive Joins and Selections , 1998, EDBT.

[34]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[35]  Hongjun Lu,et al.  NeuroRule: A Connectionist Approach to Data Mining , 1995, VLDB.

[36]  Ron Kohavi,et al.  Data Mining Using MLC a Machine Learning Library in C++ , 1996, Int. J. Artif. Intell. Tools.

[37]  Giuseppe Psaila,et al.  An Extension to SQL for Mining Association Rules , 1998, Data Mining and Knowledge Discovery.

[38]  Michael Stonebraker,et al.  Predicate migration: optimizing queries with expensive predicates , 1992, SIGMOD Conference.