Heuristic Measures of Interestingness

The tuples in a generalized relation (i.e., a summary generated from a database) are unique, and therefore, can be considered to be a population with a structure that can be described by some probability distribution. In this paper, we present and empirically compare sixteen heuristic measures that evaluate the structure of a summary to assign a single real-valued index that represents its interestingness relative to other summaries generated from the same database. The heuristics are based upon well-known measures of diversity, dispersion, dominance, and inequality used in several areas of the physical, social, ecological, management, information, and computer sciences. Their use for ranking summaries generated from databases is a new application area. All sixteen heuristics rank less complex summaries (i.e., those with few tuples and/or few non-ANY attributes) as most interesting. We demonstrate that for sample data sets, the order in which some of the measures rank summaries is highly correlated.

[1]  R. Whittaker Evolution and measurement of species diversity , 1972 .

[2]  Hongjun Lu,et al.  Identifying Relevant Databases for Multidatabase Mining , 1998, PAKDD.

[3]  Winson Taam Introduction to Probability and Statistics for Scientists and Engineers , 1999, Technometrics.

[4]  Frank A. Cowell,et al.  Measurement of income inequality: Experimental test by questionnaire , 1992 .

[5]  Jean-Gabriel Ganascia,et al.  Accounting for Domain Knowledge in the Construction of a Generalization Space , 1997, ICCS.

[6]  Yiyu Yao,et al.  On Information-Theoretic Measures of Attribute Importance , 1999, PAKDD.

[7]  Xindong Wu,et al.  Research and Development in Knowledge Discovery and Data Mining , 1998, Lecture Notes in Computer Science.

[8]  L. Goddard Information Theory , 1962, Nature.

[9]  Hajime Wago,et al.  The Measurement of Income Inequality , 1978 .

[10]  Howard J. Hamilton,et al.  Generalization Lattices , 1998, PKDD.

[11]  Nick Cercone,et al.  Parallel Knowledge Discovery Using Domain Generalization Graphs , 1997, PKDD.

[12]  Rokia Missaoui,et al.  INCREMENTAL CONCEPT FORMATION ALGORITHMS BASED ON GALOIS (CONCEPT) LATTICES , 1995, Comput. Intell..

[13]  R. Macarthur PATTERNS OF SPECIES DIVERSITY , 1965 .

[14]  Howard J. Hamilton,et al.  Ranking the Interestingness of Summaries from Data Mining Systems , 1999, FLAIRS.

[15]  J. T. Curtis,et al.  An Ordination of the Upland Forest Communities of Southern Wisconsin , 1957 .

[16]  E. H. Simpson Measurement of Diversity , 1949, Nature.

[17]  Gerd Stumme,et al.  Conceptual Knowledge Discovery in Databases Using Formal Concept Analysis Methods , 1998, PKDD.

[18]  R. P. McIntosh An Index of Diversity and the Relation of Certain Concepts to Diversity , 1967 .

[19]  Howard J. Hamilton,et al.  Heuristic for Ranking the Interestigness of Discovered Knowledge , 1999, PAKDD.

[20]  W. Berger,et al.  Diversity of Planktonic Foraminifera in Deep-Sea Sediments , 1970, Science.

[21]  Henri Theil,et al.  Economics and information theory , 1967 .

[22]  Alex Alves Freitas,et al.  On Objective Measures of Rule Surprisingness , 1998, PKDD.

[23]  R. A. Leibler,et al.  On Information and Sufficiency , 1951 .

[24]  Jan Komorowski,et al.  Principles of Data Mining and Knowledge Discovery , 2001, Lecture Notes in Computer Science.