Theoretical Bounds on the Size of Condensed Representations

Recent studies demonstrate the usefulness of condensed representations as a semantic compression technique for the frequent itemsets. Especially in inductive databases, condensed representations are a useful tool as an intermediate format to support exploration of the itemset space. In this paper we establish theoretical upper bounds on the maximal size of an itemset in different condensed representations. A central notion in the development of the bounds are the l-free sets, that form the basis of many well-known representations. We will bound the maximal cardinality of an l-free set based on the size of the database. More concrete, we compute a lower bound for the size of the database in terms of the size of the l-free set, and when the database size is smaller than this lower bound, we know that the set cannot be l-free. An efficient method for calculating the exact value of the bound, based on combinatorial identities of partial row sums, is presented. We also present preliminary results on a statistical approximation of the bound and we illustrate the results with some simulations.

[1]  Christophe Rigotti,et al.  A condensed representation to find frequent patterns , 2001, PODS '01.

[2]  Heikki Mannila,et al.  Multiple Uses of Frequent Sets and Condensed Representations (Extended Abstract) , 1996, KDD.

[3]  Jeffrey D. Ullman,et al.  Principles of Database Systems , 1980 .

[4]  Marzena Kryszkiewicz,et al.  Why to Apply Generalized Disjunction-Free Generators Representation of Frequent Patterns? , 2002, ISMIS.

[5]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[6]  Alexandre V. Evfimievski,et al.  Privacy preserving mining of association rules , 2002, Inf. Syst..

[7]  Bart Goethals,et al.  A tight upper bound on the number of candidate patterns , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[8]  Jean-François Boulicaut,et al.  Approximation of Frequency Queris by Means of Free-Sets , 2000, PKDD.

[9]  Marzena Kryszkiewicz,et al.  Concise Representation of Frequent Patterns Based on Generalized Disjunction-Free Generators , 2002, PAKDD.

[10]  Marzena Kryszkiewicz Concise representation of frequent patterns based on disjunction-free generators , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[11]  William Frawley,et al.  Knowledge Discovery in Databases , 1991 .

[12]  Toon Calders,et al.  Mining All Non-derivable Frequent Itemsets , 2002, PKDD.

[13]  Tgk Toon Calders Axiomatization and deduction rules for the frequency of itemsets , 2003 .

[14]  Toon Calders,et al.  Minimal k-Free Representations of Frequent Sets , 2003, PKDD.

[15]  Nicolas Pasquier,et al.  Discovering Frequent Closed Itemsets for Association Rules , 1999, ICDT.

[16]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[17]  Marzena Kryszkiewicz Upper bound on the length of generalized disjunction-free patterns , 2004, Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004..

[18]  Jan Komorowski,et al.  Principles of Data Mining and Knowledge Discovery , 2001, Lecture Notes in Computer Science.

[19]  Rakesh Agrawal,et al.  Privacy-preserving data mining , 2000, SIGMOD 2000.

[20]  Toon Calders,et al.  Deducing Bounds on the Frequency of Itemsets , 2002 .