Measuring the interestingness of discovered knowledge: A principled approach

When mining a large database, the number of patterns discovered can easily exceed the capabilities of a human user to identify interesting results. To address this problem, various techniques have been suggested to reduce and/or order the patterns prior to presenting them to the user. In this paper, our focus is on ranking summaries generated from a single dataset, where attributes can be generalized in many different ways and to many levels of granularity according to taxonomic hierarchies. We theoretically and empirically evaluate twelve diversity measures used as heuristic measures of interestingness for ranking summaries generated from databases. The twelve diversity measures have previously been utilized in various disciplines, such as information theory, statistics, ecology, and economics. We describe five principles that any measure must satisfy to be considered useful for ranking summaries. Theoretical results show that the proposed principles define a partial order on the ranked summaries in most cases, and in some cases, define a total order. Theoretical results also show that seven of the twelve diversity measures satisfy all of the five principles. We empirically analyze the rank order of the summaries as determined by each of the twelve measures. These empirical results show that the measures tend to rank the less complex summaries as most interesting. Finally, we analyze the distribution of the index values generated by each of the twelve diversity measures. Empirical results, obtained using synthetic data, show that the distribution of index values generated tend to be highly skewed about the mean, median, and middle index values. Finally, we demonstrate a technique, based upon our principles, for visualizing the relative interestingness of summaries. The objective of this work is to gain some insight into the behaviour that can be expected from our principled approach in practice.

[1]  Claude E. Shannon,et al.  A mathematical theory of communication , 1948, MOCO.

[2]  Yiyu Yao,et al.  Peculiarity Oriented Multi-database Mining , 1999, PKDD.

[3]  Heikki Mannila,et al.  Finding interesting rules from large sets of discovered association rules , 1994, CIKM '94.

[4]  Ira Horowitz,et al.  Entropy, Markov Processes and Competition in the Brewing Industry , 1968 .

[5]  R. Rousseau,et al.  Transfer principles and a classification of concentration measures , 1991, J. Am. Soc. Inf. Sci..

[6]  Howard J. Hamilton,et al.  Heuristic for Ranking the Interestigness of Discovered Knowledge , 1999, PAKDD.

[7]  P. Allison Inequality and Scientific Productivity , 1980 .

[8]  J. Ray,et al.  Measuring the Concentration of Power in the International System , 1973 .

[9]  Allan D. Pratt,et al.  A measure of class concentration in bibliometrics , 1977, J. Am. Soc. Inf. Sci..

[10]  Wynne Hsu,et al.  Using General Impressions to Analyze Discovered Classification Rules , 1997, KDD.

[11]  Charles F. Hockett,et al.  A mathematical theory of communication , 1948, MOCO.

[12]  Howard J. Hamilton,et al.  Performance evaluation of attribute-oriented algorithms for knowledge discovery from databases , 1995, Proceedings of 7th IEEE International Conference on Tools with Artificial Intelligence.

[13]  Nick Cercone,et al.  Mining Association Rules from Market Basket Data using Share Measures and Characterized Itemsets , 1998, Int. J. Artif. Intell. Tools.

[14]  Jiawei Han,et al.  Data-Driven Discovery of Quantitative Rules in Relational Databases , 1993, IEEE Trans. Knowl. Data Eng..

[15]  Carlos Bento,et al.  A Metric for Selection of the Most Promising Rules , 1998, PKDD.

[16]  R. Lewontin The Apportionment of Human Diversity , 1972 .

[17]  J. W. Snow,et al.  An Entropy Measure of Relative Aggregate Concentration: Reply , 1970 .

[18]  Howard J. Hamilton,et al.  Applying Objective Interestingness Measures in Data Mining Systems , 2000, PKDD.

[19]  Howard J. Hamilton,et al.  Heuristic Measures of Interestingness , 1999, PKDD.

[20]  Nick Cercone,et al.  Share Based Measures for Itemsets , 1997, PKDD.

[21]  Hongjun Lu,et al.  Identifying Relevant Databases for Multidatabase Mining , 1998, PAKDD.

[22]  Howard J. Hamilton,et al.  ESTIMATING DBLEARN'S POTENTIAL FOR KNOWLEDGE DISCOVERY IN DATABASES , 1995, Comput. Intell..

[23]  Howard J. Hamilton,et al.  Principles for mining summaries using objective measures of interestingness , 2000, Proceedings 12th IEEE Internationals Conference on Tools with Artificial Intelligence. ICTAI 2000.

[24]  E. H. Simpson Measurement of Diversity , 1949, Nature.

[25]  Jean-Gabriel Ganascia,et al.  Accounting for Domain Knowledge in the Construction of a Generalization Space , 1997, ICCS.

[26]  A. Magurran Ecological Diversity and Its Measurement , 1988, Springer Netherlands.

[27]  Alex Alves Freitas,et al.  On Objective Measures of Rule Surprisingness , 1998, PKDD.

[28]  Jerry Gaston,et al.  The reward system in British and American science , 1979 .

[29]  Stanley Lieberson,et al.  An Extension of Greenberg’s Linguistic Diversity Measures , 1964 .

[30]  Howard J. Hamilton,et al.  Knowledge discovery and measures of interest , 2001 .

[31]  Robert K. Peet,et al.  The Measurement of Species Diversity , 1974 .

[32]  R. Macarthur PATTERNS OF SPECIES DIVERSITY , 1965 .

[33]  J. T. Curtis,et al.  An Ordination of the Upland Forest Communities of Southern Wisconsin , 1957 .

[34]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[35]  Henri Theil,et al.  Economics and information theory , 1967 .

[36]  R. Whittaker Evolution and measurement of species diversity , 1972 .

[37]  P. E. Hart Entropy and Other Measures of Concentration , 1971 .

[38]  Balaji Padmanabhan,et al.  A Belief-Driven Method for Discovering Unexpected Patterns , 1998, KDD.

[39]  L. Bulla An Index of Evenness and Its Associated Diversity Measure , 1994 .

[40]  Ramakrishnan Srikant,et al.  Fast algorithms for mining association rules , 1998, VLDB 1998.

[41]  Abraham Silberschatz,et al.  On Subjective Measures of Interestingness in Knowledge Discovery , 1995, KDD.

[42]  P. Allison Measures of Inequality , 1978 .

[43]  Gerd Stumme,et al.  Conceptual Knowledge Discovery in Databases Using Formal Concept Analysis Methods , 1998, PKDD.

[44]  Hongjun Lu,et al.  Efficient Search of Reliable Exceptions , 1999, PAKDD.

[45]  Jesús Molinari,et al.  A calibrated index for the measurement of evenness , 1989 .

[46]  Howard J. Hamilton,et al.  Visualizing data mining results with domain generalization graphs , 2001 .

[47]  Nick Cercone,et al.  Mining Market Basket Data Using Share Measures and Characterized Itemsets , 1998, PAKDD.

[48]  Maria E. Orlowska,et al.  CCAIIA: Clustering Categorial Attributed into Interseting Accociation Rules , 1998, PAKDD.

[49]  Gregory Piatetsky-Shapiro,et al.  Discovery, Analysis, and Presentation of Strong Rules , 1991, Knowledge Discovery in Databases.

[50]  Nick Cercone,et al.  Data Mining in Large Databases Using Domain Generalization Graphs , 1999, Journal of Intelligent Information Systems.

[51]  H. Dalton The Measurement of the Inequality of Incomes , 1920 .

[52]  M. Hill Diversity and Evenness: A Unifying Notation and Its Consequences , 1973 .

[53]  Rajjan Shinghal,et al.  Evaluating the Interestingness of Characteristic Rules , 1996, KDD.

[54]  M. Attaran,et al.  An Information Theory Approach to Measuring Industrial Diversification , 1989 .

[55]  Nick Cercone,et al.  Parallel Knowledge Discovery Using Domain Generalization Graphs , 1997, PKDD.

[56]  Gregory Piatetsky-Shapiro,et al.  Selecting and reporting What Is Interesting , 1996, Advances in Knowledge Discovery and Data Mining.

[57]  A. Atkinson On the measurement of inequality , 1970 .

[58]  Rokia Missaoui,et al.  INCREMENTAL CONCEPT FORMATION ALGORITHMS BASED ON GALOIS (CONCEPT) LATTICES , 1995, Comput. Intell..

[59]  Rauno V. Alatalo,et al.  Problems in the measurement of evenness in ecology , 1981 .

[60]  Howard J. Hamilton,et al.  Machine Learning of Credible Classifications , 1997, Australian Joint Conference on Artificial Intelligence.

[61]  Padhraic Smyth,et al.  Rule Induction Using Information Theory , 1991, Knowledge Discovery in Databases.

[62]  Jinyan Li,et al.  Interestingness of Discovered Association Rules in Terms of Neighborhood-Based Unexpectedness , 1998, PAKDD.

[63]  S. Hurlbert The Nonconcept of Species Diversity: A Critique and Alternative Parameters. , 1971, Ecology.

[64]  G. Patil,et al.  Diversity as a Concept and its Measurement , 1982 .

[65]  Howard J. Hamilton,et al.  Efficient Attribute-Oriented Generalization for Knowledge Discovery from Large Databases , 1998, IEEE Trans. Knowl. Data Eng..

[66]  Frank A. Cowell,et al.  Measurement of income inequality: Experimental test by questionnaire , 1992 .

[67]  Derek Partridge,et al.  Software Diversity: Practical Statistics for Its Measurement and Exploitation | Draft Currently under Revision , 1996 .