Condensed Representations for Data Mining

INTRODUCTION Condensed representations have been proposed in (Mannila & Toivonen, 1996) as a useful concept for the optimization of typical data mining tasks. It appears as a key concept Raedt, 2002) and this paper introduces this research domain, its achievements in the context of frequent itemset mining (FIM) from transactional data and its future trends. Within the inductive database framework, knowledge discovery processes are considered as querying processes. Inductive databases (IDBs) contain not only data, but also patterns. In an IDB, ordinary queries can be used to access and manipulate data, while inductive queries can be used to generate (mine), manipulate, and apply patterns. To motivate the need for condensed representations, let us start from the simple model proposed in (Mannila & Toivonen, 1997). Many data mining tasks can be abstracted into the computation of a theory. Given a language L of patterns (e.g., itemsets), a database instance r (e.g., a transactional database) and a selection predicate q which specifies whether a given pattern is interesting or not (e.g., the itemset is frequent in r), a data mining task can be formalized as the computation of Th(L,q,r) = {φ ∈ L | q(φ,r) is true}. This can be also considered as the evaluation for the inductive query q. Notice that it specifies that every pattern which satisfies q has to be computed. This completeness assumption is quite common for local pattern discovery tasks but is generally not acceptable for more complex tasks (e.g., accuracy optimization for predictive model mining). The selection predicate q can be defined in terms of a Boolean expression over some primitive constraints (e.g., a minimal frequency constraint used in conjunction with a syntactic constraint which enforces the presence or the absence of some sub-patterns). Some of the primitive constraints generally refer to the " behavior " of a pattern in the data by using the so-called evaluation functions (e.g. frequency). To support the whole knowledge discovery process, it is important to support the computation of many different but correlated theories. It is well known that a " generate and test " approach that would enumerate the sentences of L and then test the selection predicate q is generally impossible. A huge effort has been made by data mining researchers to make an active use of the primitive constraints occurring in q to achieve a tractable evaluation of useful mining queries. It is the domain of constraint-based …

[1]  Laks V. S. Lakshmanan,et al.  Exploratory mining and pruning optimizations of constrained associations rules , 1998, SIGMOD '98.

[2]  Jinyan Li,et al.  Efficient mining of emerging patterns: discovering trends and differences , 1999, KDD '99.

[3]  Christophe Rigotti,et al.  A condensed representation to find frequent patterns , 2001, PODS '01.

[4]  Jian Pei,et al.  On computing condensed frequent pattern bases , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[5]  Heikki Mannila,et al.  Multiple Uses of Frequent Sets and Condensed Representations (Extended Abstract) , 1996, KDD.

[6]  Bart Goethals,et al.  Advances in frequent itemset mining implementations: report on FIMI'03 , 2004, SKDD.

[7]  Jean-François Boulicaut,et al.  Optimization of association rule mining queries , 2002, Intell. Data Anal..

[8]  Roberto J. Bayardo,et al.  Efficiently mining long patterns from databases , 1998, SIGMOD '98.

[9]  Heikki Mannila,et al.  Levelwise Search and Borders of Theories in Knowledge Discovery , 1997, Data Mining and Knowledge Discovery.

[10]  Gerd Stumme,et al.  Mining frequent patterns with counting inference , 2000, SKDD.

[11]  Daniel Kifer,et al.  DualMiner: A Dual-Pruning Algorithm for Itemsets with Constraints , 2002, Data Mining and Knowledge Discovery.

[12]  Jean-François Boulicaut,et al.  Approximation of Frequency Queris by Means of Free-Sets , 2000, PKDD.

[13]  Luc De Raedt,et al.  The Levelwise Version Space Algorithm and its Application to Molecular Fragment Finding , 2001, IJCAI.

[14]  Jean-François Boulicaut,et al.  Frequent Closures as a Concise Representation for Binary Data Mining , 2000, PAKDD.

[15]  Luc De Raedt,et al.  A perspective on inductive databases , 2002, SKDD.

[16]  Nicolas Pasquier,et al.  Efficient Mining of Association Rules Using Closed Itemset Lattices , 1999, Inf. Syst..

[17]  Tom M. Mitchell,et al.  Generalization as Search , 2002 .

[18]  Jean-François Boulicaut,et al.  Modeling KDD Processes within the Inductive Database Framework , 1999, DaWaK.

[19]  Toon Calders,et al.  Minimal k-Free Representations of Frequent Sets , 2003, PKDD.

[20]  Jean-François Boulicaut,et al.  Free-Sets: A Condensed Representation of Boolean Data for the Approximation of Frequency Queries , 2004, Data Mining and Knowledge Discovery.

[21]  Xifeng Yan,et al.  CloSpan: Mining Closed Sequential Patterns in Large Datasets , 2003, SDM.

[22]  Mohammed J. Zaki,et al.  CHARM: An Efficient Algorithm for Closed Itemset Mining , 2002, SDM.

[23]  Heikki Mannila,et al.  Fast Discovery of Association Rules , 1996, Advances in Knowledge Discovery and Data Mining.

[24]  Lotfi Lakhal,et al.  Cube Lattices: A Framework for Multidimensional Data Mining , 2003, SDM.

[25]  Marzena Kryszkiewicz,et al.  Dataless Transitions Between Concise Representations of Frequent Patterns , 2004, Journal of Intelligent Information Systems.

[26]  Toon Calders,et al.  Mining All Non-derivable Frequent Itemsets , 2002, PKDD.

[27]  Heikki Mannila,et al.  A database perspective on knowledge discovery , 1996, CACM.