On Interactive Pattern Mining from Relational Databases

In this paper we present ConQueSt, a constraint based querying system devised with the aim of supporting the intrinsically exploratory (i.e., human-guided, interactive, iterative) nature of pattern discovery. Following the inductive database vision, our framework provides users with an expressive constraint based query language which allows the discovery process to be effectively driven toward potentially interesting patterns. Such constraints are also exploited to reduce the cost of pattern mining computation. We implemented a comprehensive mining system that can access real world relational databases from which extract data. After a preprocessing step, mining queries are answered by an efficient pattern mining engine which entails several data and search space reduction techniques. Resulting patterns are then presented to the user, and possibly stored in the database. New user-defined constraints can be easily added to the system in order to target the particular application considered.

[1]  Ulrich Güntzer,et al.  Is pushing constraints deeply into the mining algorithms really what we want?: an alternative approach for association rule mining , 2002, SKDD.

[2]  Olivier Coudert,et al.  A New Viewpoint on Two-Level Logic Minimization , 1993, 30th ACM/IEEE Design Automation Conference.

[3]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[4]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[5]  Toon Calders,et al.  Mining All Non-derivable Frequent Itemsets , 2002, PKDD.

[6]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[7]  R. Michalski Attributional Calculus: A Logic and Representation Language for Natural Induction , 2004 .

[8]  Timothy W. Finin,et al.  KQML as an agent communication language , 1994, CIKM '94.

[9]  Shin-ichi Minato,et al.  Finding Simple Disjoint Decompositions in Frequent Itemset Data Using Zero-suppressed BDDs , 2005 .

[10]  Ryszard S. Michalski,et al.  A Rules-to-Trees Conversion in the Inductive Database System VINLEN , 2005, Intelligent Information Systems.

[11]  Hui Xiong,et al.  Generalizing the notion of support , 2004, KDD.

[12]  Ramakrishnan Srikant,et al.  Mining Association Rules with Item Constraints , 1997, KDD.

[13]  Luc De Raedt,et al.  Constraint-Based Mining and Inductive Databases: European Workshop on Inductive Databases and Constraint Based Mining, Hinterzarten, Germany, March 11-13, ... / Lecture Notes in Artificial Intelligence) , 2006 .

[14]  Stefan Kramer,et al.  Inductive Databases in the Relational Model: The Data as the Bridge , 2005, KDID.

[15]  Heikki Mannila,et al.  Rule Discovery from Time Series , 1998, KDD.

[16]  Rakesh Agarwal,et al.  Fast Algorithms for Mining Association Rules , 1994, VLDB 1994.

[17]  Bart Goethals,et al.  On Supporting Interactive Association Rule Mining , 2000, DaWaK.

[18]  Pedro M. Domingos MetaCost: a general method for making classifiers cost-sensitive , 1999, KDD '99.

[19]  S. Minato Binary Decision Diagrams and Applications for VLSI CAD , 1995 .

[20]  Jian Pei,et al.  Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach , 2006, Sixth IEEE International Conference on Data Mining - Workshops (ICDMW'06).

[21]  Luc De Raedt,et al.  A perspective on inductive databases , 2002, SKDD.

[22]  HanJiawei,et al.  Exploratory mining and pruning optimizations of constrained associations rules , 1998 .

[23]  Christophe Rigotti,et al.  Quantitative Episode Trees , 2006 .

[24]  Mark Levene,et al.  Database design for incomplete relations , 1999, TODS.

[25]  Tomasz Imielinski,et al.  DataMine: Application Programming Interface and Query Language for Database Mining , 1996, KDD.

[26]  Stefano Bistarelli,et al.  Extending the Soft Constraint Based Mining Paradigm , 2006, KDID.

[27]  Marie-Odile Cordier,et al.  An Inductive Database for Mining Temporal Patterns in Event Sequences , 2005, IJCAI.

[28]  Mohammed J. Zaki,et al.  SPADE: An Efficient Algorithm for Mining Frequent Sequences , 2004, Machine Learning.

[29]  Dimitrios Gunopulos,et al.  Data mining, hypergraph transversals, and machine learning (extended abstract) , 1997, PODS.

[30]  Johannes Gehrke,et al.  MAFIA: a maximal frequent itemset algorithm for transactional databases , 2001, Proceedings 17th International Conference on Data Engineering.

[31]  Luc De Raedt,et al.  Molecular feature mining in HIV data , 2001, KDD '01.

[32]  Jérôme Lang,et al.  Uncertainty in Constraint Satisfaction Problems: a Probalistic Approach , 1993, ECSQARU.

[33]  Salvatore Orlando,et al.  ConQueSt: a Constraint-based Querying System for Exploratory Pattern Discovery , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[34]  W. D. Seeman,et al.  The CLUSTER3 System For Goal-orientedConceptual Clustering: Method And PreliminaryResults , 2006 .

[35]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[36]  J. Derisi,et al.  The Transcriptome of the Intraerythrocytic Developmental Cycle of Plasmodium falciparum , 2003, PLoS biology.

[37]  Wei Wang,et al.  DMQL: A Data Mining Query Language for Relational Databases , 2007 .

[38]  Nicolas Pasquier,et al.  Discovering Frequent Closed Itemsets for Association Rules , 1999, ICDT.

[39]  Luc De Raedt,et al.  A Logical Database Mining Query Language , 2000, ILP.

[40]  Shin-ichi Minato,et al.  Zero-suppressed BDDs and their applications , 2001, International Journal on Software Tools for Technology Transfer.

[41]  Laks V. S. Lakshmanan,et al.  Optimization of constrained frequent set queries with 2-variable constraints , 1999, SIGMOD '99.

[42]  Annie Y. S. Lau,et al.  Mining Patterns of Dyspepsia Symptoms Across Time Points Using Constraint Association Rules , 2003, PAKDD.

[43]  Ruggero G. Pensa,et al.  Constraint-Based Mining of Fault-Tolerant Patterns from Boolean Data , 2005, KDID.

[44]  Mohammed J. Zaki Scalable Algorithms for Association Mining , 2000, IEEE Trans. Knowl. Data Eng..

[45]  Toon Calders,et al.  Minimal k-Free Representations of Frequent Sets , 2003, PKDD.

[46]  Bruno Crémilleux,et al.  Mining Plausible Patterns from Genomic Data , 2006, 19th IEEE Symposium on Computer-Based Medical Systems (CBMS'06).

[47]  Laks V. S. Lakshmanan,et al.  Exploratory mining and pruning optimizations of constrained associations rules , 1998, SIGMOD '98.

[48]  Mohammed J. Zaki Generating non-redundant association rules , 2000, KDD '00.

[49]  Toon Calders,et al.  Integrating Pattern Mining in Relational Databases , 2006, PKDD.

[50]  Laks V. S. Lakshmanan,et al.  On dual mining: from patterns to circumstances, and back , 2001, Proceedings 17th International Conference on Data Engineering.

[51]  Diane J. Cook,et al.  Approximate Association Rule Mining , 2001, FLAIRS Conference.

[52]  Larry Kerschberg,et al.  Mining for knowledge in databases: The INLEN architecture, initial implementation and first results , 2004, Journal of Intelligent Information Systems.

[53]  Roberto J. Bayardo The Hows, Whys, and Whens of Constraints in Itemset and Rule Discovery , 2004, Constraint-Based Mining and Inductive Databases.

[54]  Shin-ichi Minato,et al.  Zero-Suppressed BDDs for Set Manipulation in Combinatorial Problems , 1993, 30th ACM/IEEE Design Automation Conference.

[55]  Bart Goethals,et al.  Survey on Frequent Pattern Mining , 2003 .

[56]  Heikki Mannila,et al.  Fast Discovery of Association Rules , 1996, Advances in Knowledge Discovery and Data Mining.

[57]  Cláudia Antunes,et al.  Constraint Relaxations for Discovering Unknown Sequential Patterns , 2004, KDID.

[58]  Christophe Rigotti,et al.  A condensed representation to find frequent patterns , 2001, PODS '01.

[59]  Paul E. Utgoff,et al.  Incremental Induction of Decision Trees , 1989, Machine Learning.

[60]  Stefan Kramer,et al.  Quantitative association rules based on half-spaces: an optimization approach , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[61]  Tetsuya Iizuka,et al.  Mining sequential patterns including time intervals , 2000, SPIE Defense + Commercial Sensing.

[62]  Xiaobing Wu Knowledge Representation and Inductive Learning with XML , 2004, IEEE/WIC/ACM International Conference on Web Intelligence (WI'04).

[63]  Anthony K. H. Tung,et al.  Carpenter: finding closed patterns in long biological datasets , 2003, KDD '03.

[64]  Peter D. Turney Cost-Sensitive Classification: Empirical Evaluation of a Hybrid Genetic Decision Tree Induction Algorithm , 1994, J. Artif. Intell. Res..

[65]  Heikki Mannila,et al.  A database perspective on knowledge discovery , 1996, CACM.

[66]  Norberto F. Ezquerra,et al.  Mining constrained association rules to predict heart disease , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[67]  C. Becquet,et al.  Strong-association-rule mining for large-scale gene-expression data analysis: a case study on human SAGE data , 2002, Genome Biology.

[68]  Hayato Yamana,et al.  Sequential Pattern Mining with Time Intervals , 2006, PAKDD.

[69]  Alessandro Campi,et al.  Mining Association Rules from XML Data , 2002, DaWaK.

[70]  Dino Pedreschi,et al.  Efficient Mining of Temporally Annotated Sequences , 2006, SDM.

[71]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[72]  Heikki Mannila,et al.  Levelwise Search and Borders of Theories in Knowledge Discovery , 1997, Data Mining and Knowledge Discovery.

[73]  Ruggero G. Pensa,et al.  Assessment of discretization techniques for relevant pattern discovery from gene expression data , 2004, BIOKDD.

[74]  Francesco Bonchi,et al.  On closed constrained frequent pattern mining , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[75]  Heikki Mannila,et al.  Discovery of Frequent Episodes in Event Sequences , 1997, Data Mining and Knowledge Discovery.

[76]  Ji Huang,et al.  [Serial analysis of gene expression]. , 2002, Yi chuan = Hereditas.

[77]  Heikki Mannila,et al.  TASA: Telecommunication Alarm Sequence Analyzer or how to enjoy faults in your network , 1996, Proceedings of NOMS '96 - IEEE Network Operations and Management Symposium.

[78]  Jerzy W. Grzymala-Busse,et al.  A Comparison of Several Approaches to Missing Attribute Values in Data Mining , 2000, Rough Sets and Current Trends in Computing.

[79]  Szymon Jaroszewicz,et al.  Support Approximations Using Bonferroni-Type Inequalities , 2002, PKDD.

[80]  Francesca Rossi,et al.  Semiring-based constraint solving and optimization , 1997 .

[81]  Giuseppe Psaila,et al.  An Extension to SQL for Mining Association Rules , 1998, Data Mining and Knowledge Discovery.

[82]  Gerd Stumme,et al.  Mining Minimal Non-redundant Association Rules Using Frequent Closed Itemsets , 2000, Computational Logic.

[83]  Wei Wang,et al.  OP-cluster: clustering by tendency in high dimensional space , 2003, Third IEEE International Conference on Data Mining.

[84]  Rosa Meo,et al.  Answering constraint-based mining queries on itemsets using previous materialized results , 2006, Journal of Intelligent Information Systems.

[85]  Franco Turini,et al.  Specifying mining algorithms with iterative user-defined aggregates , 2004, IEEE Transactions on Knowledge and Data Engineering.

[86]  Giuseppe Psaila,et al.  A tightly-coupled architecture for data mining , 1998, Proceedings 14th International Conference on Data Engineering.

[87]  George Loizou,et al.  Extraction de règles d'association pour la prédiction de valeurs manquantes , 2005, ARIMA J..

[88]  Philip S. Yu,et al.  Clustering by pattern similarity in large data sets , 2002, SIGMOD '02.

[89]  Bruno Crémilleux,et al.  MVC - a preprocessing method to deal with missing values , 1999, Knowl. Based Syst..

[90]  Tomasz Imielinski,et al.  MSQL: A Query Language for Database Mining , 1999, Data Mining and Knowledge Discovery.

[91]  Giuseppe Psaila,et al.  A New SQL-like Operator for Mining Association Rules , 1996, VLDB.

[92]  François Rioult,et al.  Extraction de connaissances dans les bases de donn'ees comportant des valeurs manquantes ou un grand nombre d'attributs , 2005 .

[93]  Szymon Jaroszewicz,et al.  Mining rank-correlated sets of numerical attributes , 2006, KDD '06.

[94]  Jian Pei,et al.  Can we push more constraints into frequent pattern mining? , 2000, KDD '00.

[95]  Robert Szymacha,et al.  Knowledge Visualization Using Optimized General Logic Diagrams , 2005, Intelligent Information Systems.

[96]  Daniel Kifer,et al.  DualMiner: A Dual-Pruning Algorithm for Itemsets with Constraints , 2002, Data Mining and Knowledge Discovery.

[97]  François Rioult,et al.  Extraction de propriétés correctes dans des bases de données incomplètes , 2006 .

[98]  Bruno Crémilleux,et al.  Condensed Representations in Presence of Missing Values , 2003, IDA.

[99]  Kyuseok Shim,et al.  Building Decision Trees with Constraints , 2001 .

[100]  Dino Pedreschi,et al.  ExAMiner: optimized level-wise frequent pattern mining with monotone constraints , 2003, Third IEEE International Conference on Data Mining.

[101]  Arlindo L. Oliveira,et al.  Biclustering algorithms for biological data analysis: a survey , 2004, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[102]  Gregory Piatetsky-Shapiro,et al.  The KDD process for extracting useful knowledge from volumes of data , 1996, CACM.

[103]  Laks V. S. Lakshmanan,et al.  Constraint-Based Multidimensional Data Mining , 1999, Computer.

[104]  Jean-François Boulicaut,et al.  Free-Sets: A Condensed Representation of Boolean Data for the Approximation of Frequency Queries , 2004, Data Mining and Knowledge Discovery.

[105]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[106]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[107]  Ramakrishnan Srikant,et al.  Mining sequential patterns , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[108]  Kenneth A. Kaufman,et al.  Multitype Pattern Discovery Via AQ21: A Brief Description of the Method and Its Novel Features , 2006 .

[109]  Laks V. S. Lakshmanan,et al.  The 3W Model and Algebra for Unified Data Mining , 2000, VLDB.

[110]  Hendrik Blockeel,et al.  Integrating Decision Tree Learning into Inductive Databases , 2006, KDID.

[111]  Saso Dzeroski,et al.  Constraint Based Induction of Multi-objective Regression Trees , 2005, KDID.

[112]  Bruno Crémilleux,et al.  An Efficient Framework for Mining Flexible Constraints , 2005, PAKDD.

[113]  Stefano Bistarelli,et al.  Interestingness is Not a Dichotomy: Introducing Softness in Constrained Pattern Mining , 2005, PKDD.

[114]  Francesca Rossi,et al.  Abstracting soft constraints: Framework, properties, examples , 2002, Artif. Intell..

[115]  Peter A. Flach,et al.  Editorial: Inductive Logic Programming is Coming of Age , 2004, Machine Learning.

[116]  Bruno Crémilleux,et al.  Représentation condensée en présence de valeurs manquantes , 2004, INFORSID.

[117]  Luc De Raedt,et al.  Top-Down Induction of Clustering Trees , 1998, ICML.

[118]  Osmar R. Zaïane,et al.  An associative classifier based on positive and negative rules , 2004, DMKD '04.

[119]  Randal E. Bryant,et al.  Graph-Based Algorithms for Boolean Function Manipulation , 1986, IEEE Transactions on Computers.

[120]  Ryszard S. Michalski,et al.  The LEM3 implementation of learnable evolution model and its testing on complex function optimization problems , 2006, GECCO.

[121]  Fabrizio Silvestri,et al.  Adaptive and resource-aware mining of frequent sets , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[122]  Laks V. S. Lakshmanan,et al.  Mining frequent itemsets with convertible constraints , 2001, Proceedings 17th International Conference on Data Engineering.

[123]  Jean-François Boulicaut,et al.  Constraint-based concept mining and its application to microarray data analysis , 2005, Intell. Data Anal..

[124]  D. Botstein,et al.  Genomic expression programs in the response of yeast cells to environmental changes. , 2000, Molecular biology of the cell.

[125]  Kimmo Hätönen,et al.  Remarks on the Industrial Application of Inductive Database Technologies , 2004, Constraint-Based Mining and Inductive Databases.

[126]  Christophe Dousson,et al.  Discovering Chronicles with Numerical Time Constraints from Alarm Logs for Monitoring Dynamic Systems , 1999, IJCAI.

[127]  Chad Creighton,et al.  Mining gene expression databases for association rules , 2003, Bioinform..

[128]  Shin-ichi Minato Efficient combinatorial item set analysis based on zero-suppressed BDDs , 2005 .

[129]  Hendrik Blockeel Experiment Databases: A Novel Methodology for Experimental Research , 2005, KDID.

[130]  Baptiste Jeudy,et al.  Database Transposition for Constrained (Closed) Pattern Mining , 2004, KDID.

[131]  Hiroshi G. Okuno,et al.  On the Properties of Combination Set Operations , 1998, Inf. Process. Lett..

[132]  Ryszard S. Michalski,et al.  Reasoning with Meta-values in AQ Learning , 2005 .

[133]  Francesco Bonchi,et al.  Pushing Tougher Constraints in Frequent Pattern Mining , 2005, PAKDD.

[134]  Christophe Rigotti,et al.  Constraint-Based Mining of Episode Rules and Optimal Window Sizes , 2004, PKDD.

[135]  Sven Bergmann,et al.  Iterative signature algorithm for the analysis of large-scale gene expression data. , 2002, Physical review. E, Statistical, nonlinear, and soft matter physics.

[136]  Dino Pedreschi,et al.  ExAnte: Anticipated Data Reduction in Constrained Pattern Mining , 2003, PKDD.

[137]  Heikki Mannila,et al.  Multiple Uses of Frequent Sets and Condensed Representations (Extended Abstract) , 1996, KDD.

[138]  Curtis E. Dyreson,et al.  A Bibliography on Uncertainty Management in Information Systems , 1996, Uncertainty Management in Information Systems.

[139]  J. Hartigan Direct Clustering of a Data Matrix , 1972 .

[140]  Paul S. Bradley,et al.  Refining Initial Points for K-Means Clustering , 1998, ICML.

[141]  Stefano Bistarelli,et al.  Soft constraint based pattern mining , 2007, Data Knowl. Eng..

[142]  Christophe Rigotti,et al.  Mining episode rules in STULONG dataset , 2004 .

[143]  Osmar R. Zaïane,et al.  Mining Positive and Negative Association Rules: An Approach for Confined Rules , 2004, PKDD.