Skypattern mining: From pattern condensed representations to dynamic constraint satisfaction problems

Data mining is the study of how to extract information from data and express it as useful knowledge. One of its most important subfields, pattern mining, involves searching and enumerating interesting patterns in data. Various aspects of pattern mining are studied in the theory of computation and statistics. In the last decade, the pattern mining community has witnessed a sharp shift from efficiency-based approaches to methods which can extract more meaningful patterns. Recently, new methods adapting results from studies of economic efficiency and multi-criteria decision analyses such as Pareto efficiency, or skylines, have been studied. Within pattern mining, this novel line of research allows the easy expression of preferences according to a dominance relation. This approach is useful from a user-preference point of view and tends to promote the use of pattern mining algorithms for non-experts. We present a significant extension of our previous work [1,2] on the discovery of skyline patterns (or skypatterns) based on the theoretical relationships with condensed representations of patterns. We show how these relationships facilitate the computation of skypatterns and we exploit them to propose a flexible and efficient approach to mine skypatterns using a dynamic constraint satisfaction problems (CSP) framework.We present a unified methodology of our different approaches towards the same goal. This work is supported by an extensive experimental study allowing us to illustrate the strengths and weaknesses of each approach.

[1]  Heikki Mannila,et al.  Levelwise Search and Borders of Theories in Knowledge Discovery , 1997, Data Mining and Knowledge Discovery.

[2]  Jilles Vreeken,et al.  Tell me what i need to know: succinctly summarizing data with itemsets , 2011, KDD.

[3]  Bruno Crémilleux,et al.  Extracting and summarizing the frequent emerging graph patterns from a dataset of graphs , 2011, Journal of Intelligent Information Systems.

[4]  Salvatore J. Stolfo,et al.  Using artificial anomalies to detect unknown and known network intrusions , 2003, Knowledge and Information Systems.

[5]  R. S. Laundy,et al.  Multiple Criteria Optimisation: Theory, Computation and Application , 1989 .

[6]  Anton Dries,et al.  Dominance Programming for Itemset Mining , 2013, 2013 IEEE 13th International Conference on Data Mining.

[7]  Edward Omiecinski,et al.  Alternative Interest Measures for Mining Associations in Databases , 2003, IEEE Trans. Knowl. Data Eng..

[8]  Bernhard Seeger,et al.  Progressive skyline computation in database systems , 2005, TODS.

[9]  Igor V. Tetko,et al.  ToxAlerts: A Web Server of Structural Alerts for Toxic Chemicals and Compounds with Potential Adverse Reactions , 2012, J. Chem. Inf. Model..

[10]  Matti Nykänen,et al.  Efficient Discovery of Statistically Significant Association Rules , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[11]  Chid Apte,et al.  Proceedings of the 2009 SIAM International Conference on Data Mining , 2009 .

[12]  Patrice Boizumault,et al.  Mining (Soft-) Skypatterns Using Dynamic CSP , 2014, CPAIOR.

[13]  Christophe Lecoutre,et al.  Constraint Networks: Techniques and Algorithms , 2009 .

[14]  Ronan Bureau,et al.  Automated detection of structural alerts (chemical fragments) in (eco)toxicology , 2013, Computational and structural biotechnology journal.

[15]  Lawrence B. Holder,et al.  Graph-Based Data Mining , 2000, IEEE Intell. Syst..

[16]  Arno J. Knobbe,et al.  Pattern Teams , 2006, PKDD.

[17]  Hendrik Blockeel,et al.  SCCQL: A constraint-based clustering system , 2013 .

[18]  David B. Skillicorn,et al.  Proceedings of the 2006 SIAM International Conference on Data Mining , 2006 .

[19]  Albrecht Zimmermann,et al.  The Chosen Few: On Identifying Valuable Patterns , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[20]  Yannis Manolopoulos,et al.  SkyGraph: an algorithm for important subgraph discovery in relational graphs , 2008, Data Mining and Knowledge Discovery.

[21]  Jean-François Boulicaut,et al.  Free-Sets: A Condensed Representation of Boolean Data for the Approximation of Frequency Queries , 2004, Data Mining and Knowledge Discovery.

[22]  Siau-Cheng Khoo,et al.  Mining and Ranking Generators of Sequential Pattern , 2008, SDM 2008.

[23]  Chedy Raïssi,et al.  Mining Dominant Patterns in the Sky , 2011, 2011 IEEE 11th International Conference on Data Mining.

[24]  Jeffrey Xu Yu,et al.  Top-k Correlative Graph Mining , 2009, SDM.

[25]  Bruno Crémilleux,et al.  Mining constraint-based patterns using automatic relaxation , 2009, Intell. Data Anal..

[26]  Jon M. Kleinberg,et al.  Optimizing web traffic via the media scheduling problem , 2009, KDD.

[27]  Ronan Bureau,et al.  Introduction of Jumping Fragments in Combination with QSARs for the Assessment of Classification in Ecotoxicology , 2010, J. Chem. Inf. Model..

[28]  J. Gasteiger,et al.  Chemoinformatics: A Textbook , 2003 .

[29]  C. Hansch,et al.  Comparative QSAR evidence for a free-radical mechanism of phenol-induced toxicity. , 2000, Chemico-biological interactions.

[30]  Salvatore Orlando,et al.  ConQueSt: a Constraint-based Querying System for Exploratory Pattern Discovery , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[31]  Chid Apte,et al.  Proceedings of the 2008 SIAM International Conference on Data Mining , 2008 .

[32]  HanJiawei,et al.  Exploratory mining and pruning optimizations of constrained associations rules , 1998 .

[33]  Luc De Raedt,et al.  Constraint programming for itemset mining , 2008, KDD.

[34]  Jiawei Han,et al.  TFP: an efficient algorithm for mining top-k frequent closed itemsets , 2005, IEEE Transactions on Knowledge and Data Engineering.

[35]  Jon M. Kleinberg,et al.  Group formation in large social networks: membership, growth, and evolution , 2006, KDD '06.

[36]  Jirí Matousek,et al.  Computing Dominances in E^n , 1991, Inf. Process. Lett..

[37]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[38]  Jilles Vreeken,et al.  Item Sets that Compress , 2006, SDM.

[39]  Jinyan Li,et al.  Mining and Ranking Generators of Sequential Patterns , 2008, SDM.

[40]  Patrice Boizumault,et al.  Mining Relevant Sequence Patterns with CP-Based Framework , 2014, 2014 IEEE 26th International Conference on Tools with Artificial Intelligence.

[41]  Donald Kossmann,et al.  The Skyline operator , 2001, Proceedings 17th International Conference on Data Engineering.

[42]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[43]  Bruno Crémilleux,et al.  Adequate condensed representations of patterns , 2008, Data Mining and Knowledge Discovery.

[44]  Oscar Cordón,et al.  MOSubdue: a Pareto dominance-based multiobjective Subdue algorithm for frequent subgraph mining , 2011, Knowledge and Information Systems.

[45]  H. T. Kung,et al.  On the Average Number of Maxima in a Set of Vectors and Applications , 1978, JACM.

[46]  Laks V. S. Lakshmanan,et al.  Exploratory mining and pruning optimizations of constrained associations rules , 1998, SIGMOD '98.

[47]  Emmanuel Coquery,et al.  A SAT-Based Approach for Discovering Frequent, Closed and Maximal Patterns in a Sequence , 2012, ECAI.

[48]  Matthijs van Leeuwen,et al.  Discovering Skylines of Subgroup Sets , 2013, ECML/PKDD.

[49]  Patrice Boizumault,et al.  Computing Skypattern Cubes , 2014, ECAI.

[50]  Tijl De Bie,et al.  An Information-Theoretic Approach to Finding Informative Noisy Tiles in Binary Databases , 2010, SDM.

[51]  Srinivasan Parthasarathy,et al.  Proceedings of the 2010 SIAM International Conference on Data Mining , 2010 .

[52]  Aristides Gionis,et al.  Assessing data mining results via swap randomization , 2007, TKDD.

[53]  Luc De Raedt,et al.  Constraint-Based Pattern Set Mining , 2007, SDM.

[54]  Beng Chin Ooi,et al.  Efficient Progressive Skyline Computation , 2001, VLDB.

[55]  Amedeo Napoli,et al.  The Model of Most Informative Patterns and Its Application to Knowledge Extraction from Graph Databases , 2009, ECML/PKDD.

[56]  Gérard Verfaillie,et al.  Constraint Solving in Uncertain and Dynamic Environments: A Survey , 2005, Constraints.

[57]  Patrice Boizumault,et al.  Constraint Programming for Mining n-ary Patterns , 2010, CP.

[58]  Chid Apte,et al.  Proceedings of the 2007 SIAM International Conference on Data Mining , 2007 .

[59]  Luc De Raedt,et al.  Itemset mining: A constraint programming perspective , 2011, Artif. Intell..

[60]  Nicolas Pasquier,et al.  Efficient Mining of Association Rules Using Closed Itemset Lattices , 1999, Inf. Syst..

[61]  Jean-François Boulicaut,et al.  A Survey on Condensed Representations for Frequent Sets , 2004, Constraint-Based Mining and Inductive Databases.

[62]  Geoffrey I. Webb,et al.  Supervised Descriptive Rule Discovery: A Unifying Survey of Contrast Set, Emerging Pattern and Subgroup Mining , 2009, J. Mach. Learn. Res..

[63]  Toon Calders,et al.  Non-derivable itemset mining , 2007, Data Mining and Knowledge Discovery.

[64]  Rina Dechter,et al.  Belief Maintenance in Dynamic Constraint Networks , 1988, AAAI.

[65]  Stefano Bistarelli,et al.  Soft constraint based pattern mining , 2007, Data Knowl. Eng..

[66]  Bart Goethals,et al.  Tiling Databases , 2004, Discovery Science.

[67]  Nada Lavrac,et al.  Closed Sets for Labeled Data , 2006, PKDD.

[68]  Nello Cristianini,et al.  MINI: Mining Informative Non-redundant Itemsets , 2007, PKDD.

[69]  Stefan Wrobel,et al.  An Algorithm for Multi-relational Discovery of Subgroups , 1997, PKDD.

[70]  David Cohen,et al.  Principles and Practice of Constraint Programming - CP 2010 - 16th International Conference, CP 2010, St. Andrews, Scotland, UK, September 6-10, 2010. Proceedings , 2010, CP.

[71]  Ronan Bureau,et al.  Emerging Patterns as Structural Alerts for Computational Toxicology , 2013, Contrast Data Mining.

[72]  Mohammed J. Zaki,et al.  Data Mining in Computational Biology , 2009, Encyclopedia of Database Systems.