Identifying Relevant Databases for Multidatabase Mining

Various tools and systems for knowledge discovery and data mining are developed and available for applications. However, when we are immersed in heaps of databases, an immediate question facing practitioners is where we should start mining. In this paper, breaking away from the conventional data mining assumption that many databases be joined into one, we argue that the first step for multidatabase mining is to identify databases that are most likely relevant to an application; without doing so, the mining process can be lengthy, aimless and ineffective. A relevance measure is thus proposed to identify relevant databases for mining tasks with an objective to find patterns or regularities about certain attributes. An efficient implementation for identifying relevant databases is described. Experiments are conducted to validate the measure's performance and to show its promising applications.

[1]  Tomasz Imielinski,et al.  Database Mining: A Performance Perspective , 1993, IEEE Trans. Knowl. Data Eng..

[2]  Gregory Piatetsky-Shapiro,et al.  KDD-93: Progress and Challenges in Knowledge Discovery in Databases , 1994, AI Mag..

[3]  Wynne Hsu,et al.  Post-Analysis of Learned Rules , 1996, AAAI/IAAI, Vol. 1.

[4]  Gregory Piatetsky-Shapiro,et al.  Discovery, Analysis, and Presentation of Strong Rules , 1991, Knowledge Discovery in Databases.

[5]  Jan M. Zytkow,et al.  From Contingency Tables to Various Forms of Knowledge in Databases , 1996, Advances in Knowledge Discovery and Data Mining.

[6]  Ramakrishnan Srikant,et al.  Mining quantitative association rules in large relational tables , 1996, SIGMOD '96.

[7]  Christopher M. Bishop,et al.  Classification and regression , 1997 .

[8]  Rajjan Shinghal,et al.  Evaluating the Interestingness of Characteristic Rules , 1996, KDD.

[9]  Abraham Silberschatz,et al.  On Subjective Measures of Interestingness in Knowledge Discovery , 1995, KDD.

[10]  J. Hong,et al.  Incremental Discovery of Rules and Structure by Hierarchical and Parallel Clustering , 1991, Knowledge Discovery in Databases.

[11]  Nada Lavrac,et al.  The Multi-Purpose Incremental Learning System AQ15 and Its Testing Application to Three Medical Domains , 1986, AAAI.

[12]  S. Morishita On Classi cation and Regression , 1998 .

[13]  Padhraic Smyth,et al.  From Data Mining to Knowledge Discovery: An Overview , 1996, Advances in Knowledge Discovery and Data Mining.

[14]  Gregory Piatetsky-Shapiro,et al.  The interestingness of deviations , 1994 .

[15]  Jiawei Han,et al.  Attribute-Oriented Induction in data Mining , 1996, Advances in Knowledge Discovery and Data Mining.

[16]  Gregory Piatetsky-Shapiro,et al.  Selecting and reporting What Is Interesting , 1996, Advances in Knowledge Discovery and Data Mining.