Mining Dominant Patterns in the Sky

Pattern discovery is at the core of numerous data mining tasks. Although many methods focus on efficiency in pattern mining, they still suffer from the problem of choosing a threshold that influences the final extraction result. The goal of our study is to make the results of pattern mining useful from a user-preference point of view. To this end, we integrate into the pattern discovery process the idea of skyline queries in order to mine skyline patterns in a threshold-free manner. Because the skyline patterns satisfy a formal property of dominations, they not only have a global interest but also have semantics that are easily understood by the user. In this work, we first establish theoretical relationships between pattern condensed representations and skyline pattern mining. We also show that it is possible to compute automatically a subset of measures involved in the user query which allows the patterns to be condensed and thus facilitates the computation of the skyline patterns. This forms the basis for a novel approach to mining skyline patterns. We illustrate the efficiency of our approach over several data sets including a use case from chemo informatics and show that small sets of dominant patterns are produced under various measures.

[1]  Jeffrey Xu Yu,et al.  Top-k Correlative Graph Mining , 2009, SDM.

[2]  Jean-François Boulicaut,et al.  A Survey on Condensed Representations for Frequent Sets , 2004, Constraint-Based Mining and Inductive Databases.

[3]  Bruno Crémilleux,et al.  Mining constraint-based patterns using automatic relaxation , 2009, Intell. Data Anal..

[4]  Albrecht Zimmermann,et al.  The Chosen Few: On Identifying Valuable Patterns , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[5]  Yannis Manolopoulos,et al.  SkyGraph: an algorithm for important subgraph discovery in relational graphs , 2008, Data Mining and Knowledge Discovery.

[6]  Nikolaj Tatti,et al.  Probably the best itemsets , 2010, KDD.

[7]  Jiawei Han,et al.  TFP: an efficient algorithm for mining top-k frequent closed itemsets , 2005, IEEE Transactions on Knowledge and Data Engineering.

[8]  Donald Kossmann,et al.  The Skyline operator , 2001, Proceedings 17th International Conference on Data Engineering.

[9]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[10]  Siau-Cheng Khoo,et al.  Mining and Ranking Generators of Sequential Pattern , 2008, SDM 2008.

[11]  A. Zambon,et al.  Assessment of chloroaniline toxicity by the submitochondrial particle assay , 2001, Environmental toxicology and chemistry.

[12]  Salvatore Orlando,et al.  ConQueSt: a Constraint-based Querying System for Exploratory Pattern Discovery , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[13]  Jürgen Bajorath,et al.  Emerging Chemical Patterns: A New Methodology for Molecular Classification and Compound Selection. , 2007 .

[14]  Jirí Matousek,et al.  Computing Dominances in E^n , 1991, Inf. Process. Lett..

[15]  Tijl De Bie,et al.  An Information-Theoretic Approach to Finding Informative Noisy Tiles in Binary Databases , 2010, SDM.

[16]  Marc Despontin,et al.  Multiple Criteria Optimization: Theory, Computation, and Application, Ralph E. Steuer (Ed.). Wiley, Palo Alto, CA (1986) , 1987 .

[17]  Luc De Raedt,et al.  Constraint-Based Pattern Set Mining , 2007, SDM.

[18]  Amedeo Napoli,et al.  The Model of Most Informative Patterns and Its Application to Knowledge Extraction from Graph Databases , 2009, ECML/PKDD.

[19]  Arno J. Knobbe,et al.  Pattern Teams , 2006, PKDD.

[20]  Nada Lavrac,et al.  Closed Sets for Labeled Data , 2008, J. Mach. Learn. Res..

[21]  Bart Goethals,et al.  Tiling Databases , 2004, Discovery Science.

[22]  Das Amrita,et al.  Mining Association Rules between Sets of Items in Large Databases , 2013 .

[23]  Stefano Bistarelli,et al.  Soft constraint based pattern mining , 2007, Data Knowl. Eng..

[24]  Laks V. S. Lakshmanan,et al.  Exploratory mining and pruning optimizations of constrained associations rules , 1998, SIGMOD '98.

[25]  Ronan Bureau,et al.  Introduction of Jumping Fragments in Combination with QSARs for the Assessment of Classification in Ecotoxicology , 2010, J. Chem. Inf. Model..

[26]  C. Hansch,et al.  Comparative QSAR evidence for a free-radical mechanism of phenol-induced toxicity. , 2000, Chemico-biological interactions.

[27]  Bruno Crémilleux,et al.  Adequate Condensed Representations of Patterns , 2008, ECML/PKDD.

[28]  Jilles Vreeken,et al.  Item Sets that Compress , 2006, SDM.

[29]  Jinyan Li,et al.  Mining and Ranking Generators of Sequential Patterns , 2008, SDM.

[30]  Geoffrey I. Webb Self-sufficient itemsets: An approach to screening potentially interesting associations between items , 2010, TKDD.

[31]  Mohammed J. Zaki,et al.  Data Mining in Computational Biology , 2009, Encyclopedia of Database Systems.

[32]  Heikki Mannila,et al.  Levelwise Search and Borders of Theories in Knowledge Discovery , 1997, Data Mining and Knowledge Discovery.