Data Mining in Large Databases Using Domain Generalization Graphs

Attribute-oriented generalization summarizes the information in a relational database by repeatedly replacing specific attribute values with more general concepts according to user-defined concept hierarchies. We introduce domain generalization graphs for controlling the generalization of a set of attributes and show how they are constructed. We then present serial and parallel versions of the Multi-Attribute Generalization algorithm for traversing the generalization state space described by joining the domain generalization graphs for multiple attributes. Based upon a generate-and-test approach, the algorithm generates all possible summaries consistent with the domain generalization graphs. Our experimental results show that significant speedups are possible by partitioning path combinations from the DGGs across multiple processors. We also rank the interestingness of the resulting summaries using measures based upon variance and relative entropy. Our experimental results also show that these measures provide an effective basis for analyzing summary data generated from relational databases. Variance appears more useful because it tends to rank the less complex summaries (i.e., those with few attributes and/or tuples) as more interesting.

[1]  Jiawei Han,et al.  Towards Efficient Induction Mechanisms in Database Systems , 1994, Theor. Comput. Sci..

[2]  Rokia Missaoui,et al.  INCREMENTAL CONCEPT FORMATION ALGORITHMS BASED ON GALOIS (CONCEPT) LATTICES , 1995, Comput. Intell..

[3]  R. Wille Concept lattices and conceptual knowledge systems , 1992 .

[4]  Philip S. Yu,et al.  An effective hash-based algorithm for mining association rules , 1995, SIGMOD '95.

[5]  Henri Theil,et al.  Economics and information theory , 1967 .

[6]  Hajime Wago,et al.  The Measurement of Income Inequality , 1978 .

[7]  Howard J. Hamilton,et al.  ESTIMATING DBLEARN'S POTENTIAL FOR KNOWLEDGE DISCOVERY IN DATABASES , 1995, Comput. Intell..

[8]  Kyuseok Shim,et al.  Fast Similarity Search in the Presence of Noise, Scaling, and Translation in Time-Series Databases , 1995, VLDB.

[9]  R. A. Leibler,et al.  On Information and Sufficiency , 1951 .

[10]  Hannu T. T. Toivonen,et al.  Samplinglarge databases for finding association rules , 1996, VLDB 1996.

[11]  J. T. Curtis,et al.  An Ordination of the Upland Forest Communities of Southern Wisconsin , 1957 .

[12]  Rajeev Motwani,et al.  Dynamic itemset counting and implication rules for market basket data , 1997, SIGMOD '97.

[13]  Frank A. Cowell,et al.  Measurement of income inequality: Experimental test by questionnaire , 1992 .

[14]  Rajeev Motwani,et al.  Beyond market baskets: generalizing association rules to correlations , 1997, SIGMOD '97.

[15]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[16]  Gerd Stumme,et al.  Conceptual Knowledge Discovery in Databases Using Formal Concept Analysis Methods , 1998, PKDD.

[17]  Nick Cercone,et al.  Mining Market Basket Data Using Share Measures and Characterized Itemsets , 1998, PAKDD.

[18]  Gregory Piatetsky-Shapiro,et al.  Discovery, Analysis, and Presentation of Strong Rules , 1991, Knowledge Discovery in Databases.

[19]  Nick Cercone,et al.  Share Based Measures for Itemsets , 1997, PKDD.

[20]  Ramakrishnan Srikant,et al.  Mining Sequential Patterns: Generalizations and Performance Improvements , 1996, EDBT.

[21]  Howard J. Hamilton,et al.  Efficient Attribute-Oriented Generalization for Knowledge Discovery from Large Databases , 1998, IEEE Trans. Knowl. Data Eng..

[22]  Ramakrishnan Srikant,et al.  Mining generalized association rules , 1995, Future Gener. Comput. Syst..

[23]  Howard J. Hamilton,et al.  A Comparison of Attribute Selection Strategies for Attribute-Oriented Generalization , 1997, ISMIS.

[24]  Hannu Toivonen,et al.  Sampling Large Databases for Association Rules , 1996, VLDB.

[25]  Ada Wai-Chee Fu,et al.  Efficient Algorithms for Attribute-Oriented Induction , 1995, KDD.

[26]  R. Whittaker Evolution and measurement of species diversity , 1972 .

[27]  Ramakrishnan Srikant,et al.  Mining sequential patterns , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[28]  Howard J. Hamilton,et al.  A fast, on-line generalization algorithm for knowledge discovery , 1995 .

[29]  R. Macarthur PATTERNS OF SPECIES DIVERSITY , 1965 .

[30]  Howard J. Hamilton,et al.  Ranking the Interestingness of Summaries from Data Mining Systems , 1999, FLAIRS.

[31]  A. Atkinson On the measurement of inequality , 1970 .

[32]  Jiawei Han,et al.  Knowledge Discovery in Databases: An Attribute-Oriented Approach , 1992, VLDB.

[33]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[34]  Jiawei Han,et al.  Advances of the DBLearn System for Knowledge Discovery in Large Databases , 1995, IJCAI.

[35]  Nick Cercone,et al.  Parallel Knowledge Discovery Using Domain Generalization Graphs , 1997, PKDD.

[36]  Wesley W. Chu,et al.  An error-based conceptual clustering method for providing approximate query answers , 1996, CACM.

[37]  Tom Michael Mitchell Version spaces: an approach to concept learning. , 1979 .

[38]  Jiawei Han,et al.  Discovery of Multiple-Level Association Rules from Large Databases , 1995, VLDB.

[39]  Howard J. Hamilton,et al.  Heuristic for Ranking the Interestigness of Discovered Knowledge , 1999, PAKDD.

[40]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[41]  Howard J. Hamilton,et al.  Data visualization in the DB-Discover system , 1997, Proceedings Ninth IEEE International Conference on Tools with Artificial Intelligence.

[42]  Ryszard S. Michalski,et al.  A theory and methodology of inductive learning , 1993 .

[43]  Jiawei Han,et al.  Attribute-Oriented Induction in Relational Databases , 1991, Knowledge Discovery in Databases.

[44]  Ido Dagan,et al.  Knowledge Discovery in Textual Databases (KDT) , 1995, KDD.

[45]  Howard J. HamiltonDepartment,et al.  Heuristics for Ranking the Interestingnessof Discovered , 1999 .

[46]  Howard J. Hamilton,et al.  Performance evaluation of attribute-oriented algorithms for knowledge discovery from databases , 1995, Proceedings of 7th IEEE International Conference on Tools with Artificial Intelligence.

[47]  Nick Cercone,et al.  Mining Association Rules from Market Basket Data using Share Measures and Characterized Itemsets , 1998, Int. J. Artif. Intell. Tools.

[48]  Jiawei Han,et al.  Data-Driven Discovery of Quantitative Rules in Relational Databases , 1993, IEEE Trans. Knowl. Data Eng..