Ranking the Interestingness of Summaries from Data Mining Systems

We study data rn~rdng where the task is description by summarization, the representation language is generalized relations, the evaluation criteria are based on heuristic measures of interestingness, and the method for searching is the Multi-Attribute Generalization algorithm for domain generalization graphs. We present and empirically compare four heuristics for ranking the interestingness of generalized relations (or summaries). The measures are based on common measures of the diversity of a population, statistical variance, the Simpson index, and the Shannon index. All four measures rank less complex summaries (i.e., those with few tuples and/or non-ANY attributes) as most interesting. Highly ranked summaries provide a reasonable starting point for fixrther analysis of discovered knowledge.

[1]  E. H. Simpson Measurement of Diversity , 1949, Nature.

[2]  Nick Cercone,et al.  Parallel Knowledge Discovery Using Domain Generalization Graphs , 1997, PKDD.

[3]  Nick Cercone,et al.  Share Based Measures for Itemsets , 1997, PKDD.

[4]  Alex Alves Freitas,et al.  On Objective Measures of Rule Surprisingness , 1998, PKDD.

[5]  Ido Dagan,et al.  Knowledge Discovery in Textual Databases (KDT) , 1995, KDD.

[6]  Inderpal Bhandari,et al.  Attribute focusing: machine-assisted knowledge discovery applied to software production process control , 1993 .

[7]  Usama M. Fayyad,et al.  Knowledge Discovery in Databases: An Overview , 1997, ILP.

[8]  Howard J. Hamilton,et al.  Generalization Lattices , 1998, PKDD.

[9]  L. Goddard Information Theory , 1962, Nature.

[10]  Padhraic Smyth,et al.  From Data Mining to Knowledge Discovery in Databases , 1996, AI Mag..

[11]  Yiyu Yao,et al.  On Information-Theoretic Measures of Attribute Importance , 1999, PAKDD.

[12]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[13]  Padhraic Smyth,et al.  Rule Induction Using Information Theory , 1991, Knowledge Discovery in Databases.

[14]  Hongjun Lu,et al.  Identifying Relevant Databases for Multidatabase Mining , 1998, PAKDD.

[15]  Winson Taam Introduction to Probability and Statistics for Scientists and Engineers , 1999, Technometrics.

[16]  Howard J. Hamilton,et al.  ESTIMATING DBLEARN'S POTENTIAL FOR KNOWLEDGE DISCOVERY IN DATABASES , 1995, Comput. Intell..

[17]  Howard J. Hamilton,et al.  Temporal Generalization with Domain Generalization Graphs , 1999, Int. J. Pattern Recognit. Artif. Intell..

[18]  Howard J. Hamilton,et al.  Performance evaluation of attribute-oriented algorithms for knowledge discovery from databases , 1995, Proceedings of 7th IEEE International Conference on Tools with Artificial Intelligence.

[19]  Nick Cercone,et al.  Mining Market Basket Data Using Share Measures and Characterized Itemsets , 1998, PAKDD.

[20]  Gregory Piatetsky-Shapiro,et al.  Discovery, Analysis, and Presentation of Strong Rules , 1991, Knowledge Discovery in Databases.