From Local Patterns to Global Models: The LeGo Approach to Data Mining

In this paper we present LeGo, a generic framework that utilizes existing local pattern mining techniques for global modeling in a variety of diverse data mining tasks. In the spirit of well known KDD process models, our work identifies different phases within the data mining step, each of which is formulated in terms of different formal constraints. It starts with a phase of mining patterns that are individually promising. Later phases establish the context given by the global data mining task by selecting groups of diverse and highly informative patterns, which are finally combined to one or more global models that address the overall data mining task(s). The paper discusses the connection to various learning techniques, and illustrates that our framework is broad enough to cover and leverage frequent pattern mining, subgroup discovery, pattern teams, multi-view learning, and several other popular algorithms. The Safarii learning toolbox serves as a proof-of-concept of its high potential for practical data mining applications. Finally, we point out several challenging open research questions that naturally emerge in a constraint-based local-to-global pattern mining, selection, and combination framework.

[1]  Laks V. S. Lakshmanan,et al.  Exploratory mining and pruning optimizations of constrained associations rules , 1998, SIGMOD '98.

[2]  Heikki Mannila,et al.  The Pattern Ordering Problem , 2003, PKDD.

[3]  Johannes Fürnkranz,et al.  ROC ‘n’ Rule Learning—Towards a Better Understanding of Covering Algorithms , 2005, Machine Learning.

[4]  Katharina Morik,et al.  Local Pattern Detection, International Seminar, Dagstuhl Castle, Germany, April 12-16, 2004, Revised Selected Papers , 2005, Local Pattern Detection.

[5]  Jiawei Han,et al.  CPAR: Classification based on Predictive Association Rules , 2003, SDM.

[6]  William W. Cohen Fast Effective Rule Induction , 1995, ICML.

[7]  Jilles Vreeken,et al.  Item Sets that Compress , 2006, SDM.

[8]  Peter A. Flach,et al.  Propositionalization approaches to relational data mining , 2001 .

[9]  Geoffrey I. Webb Discovering Significant Patterns , 2007, Machine Learning.

[10]  Johannes Fürnkranz,et al.  From Local to Global Patterns: Evaluation Issues in Rule Learning Algorithms , 2004, Local Pattern Detection.

[11]  Bruno Crémilleux,et al.  Efficient Mining Under Rich Constraints Derived from Various Datasets , 2006, KDID.

[12]  Joost N. Kok,et al.  Frequent graph mining and its application to molecular databases , 2004, 2004 IEEE International Conference on Systems, Man and Cybernetics (IEEE Cat. No.04CH37583).

[13]  Arno J. Knobbe,et al.  Maximally informative k-itemsets and their efficient discovery , 2006, KDD '06.

[14]  Yun Chi,et al.  Frequent Subtree Mining - An Overview , 2004, Fundam. Informaticae.

[15]  Henrik Boström,et al.  Resolving rule conflicts with double induction , 2004, Intell. Data Anal..

[16]  Roberto J. Bayardo Brute-Force Mining of High-Confidence Classification Rules , 1997, KDD.

[17]  DžeroskiSašo,et al.  5th international workshop on knowledge discovery in inductive databases (KDID'06) , 2007 .

[18]  Stefan Wrobel,et al.  Finding the Most Interesting Patterns in a Database Quickly by Using Sequential Sampling , 2003, J. Mach. Learn. Res..

[19]  Ke Wang,et al.  Clustering transactions using large items , 1999, CIKM '99.

[20]  Katharina Morik,et al.  07181 Introduction -- Parallel Universes and Local Patterns , 2007, Parallel Universes and Local Patterns.

[21]  Tobias Scheffer Finding association rules that trade support optimally against confidence , 2005 .

[22]  Martin Scholz,et al.  Boosting in PN Spaces , 2006, ECML.

[23]  Martin Scholz,et al.  Sampling-based sequential subgroup mining , 2005, KDD '05.

[24]  Peter A. Flach,et al.  Rule Evaluation Measures: A Unifying View , 1999, ILP.

[25]  Luc De Raedt,et al.  CorClass: Correlated Association Rule Mining for Classification , 2004, Discovery Science.

[26]  Albrecht Zimmermann,et al.  The Chosen Few: On Identifying Valuable Patterns , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[27]  Heikki Mannila,et al.  A database perspective on knowledge discovery , 1996, CACM.

[28]  Stefan Kramer,et al.  Optimizing Feature Sets for Structured Data , 2007, ECML.

[29]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[30]  Johannes Fürnkranz,et al.  On Trading Off Consistency and Coverage in Inductive Rule Learning , 2006, LWA.

[31]  Jian Pei,et al.  CMAR: accurate and efficient classification based on multiple class-association rules , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[32]  Stefan Mutter,et al.  Using Classification to Evaluate the Output of Confidence-Based Association Rule Mining , 2004, Australian Conference on Artificial Intelligence.

[33]  Padhraic Smyth,et al.  From Data Mining to Knowledge Discovery in Databases , 1996, AI Mag..

[34]  Hannu Toivonen,et al.  TreeDT: tree pattern mining for gene mapping , 2006, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[35]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[36]  Luc De Raedt,et al.  A Theory of Inductive Query Answering , 2010, Inductive Databases and Constraint-Based Data Mining.

[37]  Jaideep Srivastava,et al.  Selecting the right interestingness measure for association patterns , 2002, KDD.

[38]  Bruno Crémilleux,et al.  An Efficient Framework for Mining Flexible Constraints , 2005, PAKDD.

[39]  Nicolas Durand,et al.  ECCLAT: a New Approach of Clusters Discovery in Categorical Data , 2003 .

[40]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[41]  Stefan Wrobel,et al.  An Algorithm for Multi-relational Discovery of Subgroups , 1997, PKDD.

[42]  Johannes Fürnkranz,et al.  Meta-Learning a Rule Learning Heuristic , 2007 .

[43]  Jian Pei,et al.  Mining frequent patterns without candidate generation , 2000, SIGMOD '00.

[44]  Frank Puppe,et al.  SD-Map - A Fast Algorithm for Exhaustive Subgroup Discovery , 2006, PKDD.

[45]  Jean-François Boulicaut,et al.  A Survey on Condensed Representations for Frequent Sets , 2004, Constraint-Based Mining and Inductive Databases.

[46]  Luc De Raedt,et al.  Don't Be Afraid of Simpler Patterns , 2006, PKDD.

[47]  David J. Hand,et al.  Pattern Detection and Discovery , 2002, Pattern Detection and Discovery.

[48]  Bart Goethals,et al.  Frequent Set Mining , 2010, Data Mining and Knowledge Discovery Handbook.

[49]  Willi Klösgen,et al.  Explora: A Multipattern and Multistrategy Discovery Assistant , 1996, Advances in Knowledge Discovery and Data Mining.

[50]  Katharina Morik,et al.  Parallel Universes and Local Patterns, 01.05. - 04.05.2007 , 2007, Parallel Universes and Local Patterns.

[51]  Luc De Raedt,et al.  A perspective on inductive databases , 2002, SKDD.

[52]  Wynne Hsu,et al.  Integrating Classification and Association Rule Mining , 1998, KDD.

[53]  Ruggero G. Pensa,et al.  A Bi-clustering Framework for Categorical Data , 2005, PKDD.

[54]  Arno J. Knobbe,et al.  Pattern Teams , 2006, PKDD.

[55]  Ivan Bratko,et al.  Why Is Rule Learning Optimistic and How to Correct It , 2006, ECML.

[56]  Nada Lavrac,et al.  Classification Rule Learning with APRIORI-C , 2001, EPIA.

[57]  Luc De Raedt,et al.  Constraint-Based Pattern Set Mining , 2007, SDM.

[58]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Analysis , 1999, UAI.

[59]  Mohammed J. Zaki,et al.  SPADE: An Efficient Algorithm for Mining Frequent Sequences , 2004, Machine Learning.