Efficient discovery of interesting statements in databases

The Explora system supportsDiscovery in Databases by large scale search for interesting instances of statistical patterns. In this paper we describe how Explora assessesinterestingness and achievescomputational efficiency. These problems arise because of the variety of patterns and the immense combinatorial possibilities of generating instances when studying relations between variables in subsets of data. First, the user must be saved from getting overwhelmed with a deluge of findings. To restrict the search with respect to the analysis goals, the user can focus each discovery task performed during an interactive and iterative exploration process. Some basic organization principles of search can further limit the search effort. One principle is to organize search hierarchically and to evaluate first the statistical or information theoretic evidence of the general hypotheses. Then more special hypotheses can be eliminated from further search, if a more general hypothesis was already verified. But this approach alone has some drawbacks and even in moderately sized data does not prevent large sets of findings. Therefore, in a second evaluation phase, further aspects of interestingness are assessed. A refinement strategy selects the most interesting of the statistically significant statements. A second problem for discovery systems is efficiency. Each hypothesis evaluation requires many data accesses. We describe strategies that reduce data accesses and speed up computation.

[1]  Willi Klösgen,et al.  A Support System for Interpreting Statistical Data , 1991, Knowledge Discovery in Databases.

[2]  Willi Klösgen,et al.  Problems for knowledge discovery in databases and their treatment in the statistics interpreter explora , 1992, Int. J. Intell. Syst..

[3]  Katharina Morik,et al.  Knowledge Acquisition and Machine Learning: Theory, Methods, and Applications , 1993 .

[4]  William Frawley,et al.  Knowledge Discovery in Databases , 1991 .

[5]  Jan M. Zytkow,et al.  Database exploration in search of regularities , 1993, Journal of Intelligent Information Systems.

[6]  F. Gebhardt Discovering interesting statements from a database , 1994 .

[7]  Philip K. Chan,et al.  Systems for Knowledge Discovery in Databases , 1993, IEEE Trans. Knowl. Data Eng..

[8]  Wesley W. Chu,et al.  Pattern-based clustering for database attribute values , 1993 .

[9]  J. R. Quinlan Learning Logical Definitions from Relations , 1990 .

[10]  Salvatore J. Stolfo,et al.  Toward parallel and distributed learning by meta-learning , 1993 .

[11]  Jan M. Zytkow,et al.  Interactive Mining of Regularities in Databases , 1991, Knowledge Discovery in Databases.

[12]  Gregory Piatetsky-Shapiro,et al.  Knowledge Discovery in Databases: An Overview , 1992, AI Mag..

[13]  Jan M. Zytkow,et al.  Scientific Model-Building as Search in Matrix Spaces , 1993, AAAI.

[14]  L. H. Koopmans An Introduction to Contemporary Statistics , 1981 .

[15]  Gregory Piatetsky-Shapiro,et al.  Knowledge discovery workbench for exploring business databases , 1992, Int. J. Intell. Syst..

[16]  Friedrich Gebhardt,et al.  Choosing among competing generalizations , 1989 .