The Pattern Ordering Problem

Many pattern discovery methods provide fast tools for finding the frequently occurring patterns in large data sets. Such pattern collections can also be used to approximate the underlying joint distribution, and they summarize the data set well. However, a large set of patterns is unintuitive and not necessarily easy to use. In this paper we consider the problem of ordering a collection of patterns so that each prefix of the ordering gives as good a summary of the data as possible. We formulate this problem for general loss functions, show that the problem has an efficient solution, and prove that its natural variant is NP-complete but the greedy approximation algorithm gives an e/(e-1) ≈ 1.58 approximation quality. We apply the general technique to approximation of frequencies of frequent sets, and show that the method gives good empirical results.

[1]  Vladimir Gurvich,et al.  On the Complexity of Generating Maximal Frequent and Minimal Infrequent Sets , 2002, STACS.

[2]  Christophe Rigotti,et al.  A condensed representation to find frequent patterns , 2001, PODS '01.

[3]  H. Mannila,et al.  Discovering all most specific sentences , 2003, TODS.

[4]  Toon Calders,et al.  Mining All Non-derivable Frequent Itemsets , 2002, PKDD.

[5]  David A. Kessler,et al.  Inclusion-Exclusion Redux , 2002 .

[6]  Gerd Stumme,et al.  Computing iceberg concept lattices with T , 2002, Data Knowl. Eng..

[7]  Jian Pei,et al.  On computing condensed frequent pattern bases , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[8]  Heikki Mannila,et al.  Multiple Uses of Frequent Sets and Condensed Representations (Extended Abstract) , 1996, KDD.

[9]  Donald Ervin Knuth,et al.  The Art of Computer Programming , 1968 .

[10]  Samir Khuller,et al.  Greedy strikes back: improved facility location algorithms , 1998, SODA '98.

[11]  Jean-François Boulicaut,et al.  Free-Sets: A Condensed Representation of Boolean Data for the Approximation of Frequency Queries , 2004, Data Mining and Knowledge Discovery.

[12]  Jan Komorowski,et al.  Principles of Data Mining and Knowledge Discovery , 2001, Lecture Notes in Computer Science.

[13]  Dimitrios Gunopulos,et al.  Workshop report: 2000 ACM SIGMOD workshop on research issues in data mining and knowledge discovery , 2000, SKDD.

[14]  Henry D. Shapiro,et al.  An Exact Characterization of Greedy Structures , 1993, IPCO.

[15]  Tsau Young Lin,et al.  Proceedings of the 2001 IEEE International Conference on Data Mining, 29 November - 2 December 2001, San Jose, California, USA , 2001 .

[16]  Heikki Mannila,et al.  Beyond Independence: Probabilistic Models for Query Approximation on Binary Transaction Data , 2003, IEEE Trans. Knowl. Data Eng..

[17]  Jian Pei,et al.  CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets , 2000, ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery.

[18]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[19]  Nicolas Pasquier,et al.  Discovering Frequent Closed Itemsets for Association Rules , 1999, ICDT.

[20]  Heikki Mannila,et al.  Prediction with local patterns using cross-entropy , 1999, KDD '99.

[21]  Robin Milner,et al.  On Observing Nondeterminism and Concurrency , 1980, ICALP.

[22]  Marzena Kryszkiewicz,et al.  Concise Representation of Frequent Patterns Based on Generalized Disjunction-Free Generators , 2002, PAKDD.

[23]  Mohammed J. Zaki,et al.  Efficiently mining maximal frequent itemsets , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[24]  Marzena Kryszkiewicz Concise representation of frequent patterns based on disjunction-free generators , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[25]  Malcolm P. Atkinson,et al.  Issues Raised by Three Years of Developing PJama: An Orthogonally Persistent Platform for Java , 1999, ICDT.

[26]  Giorgio Gambosi,et al.  Complexity and Approximation , 1999, Springer Berlin Heidelberg.

[27]  Giorgio Gambosi,et al.  Complexity and approximation: combinatorial optimization problems and their approximability properties , 1999 .

[28]  Ulrich Güntzer,et al.  Algorithms for association rule mining — a general survey and comparison , 2000, SKDD.

[29]  Jaideep Srivastava,et al.  Selecting the right interestingness measure for association patterns , 2002, KDD.

[30]  Mohammed J. Zaki,et al.  CHARM: An Efficient Algorithm for Closed Itemset Mining , 2002, SDM.

[31]  David J. Hand,et al.  Pattern Detection and Discovery , 2002, Pattern Detection and Discovery.

[32]  Heikki Mannila,et al.  Local and Global Methods in Data Mining: Basic Techniques and Open Problems , 2002, ICALP.

[33]  Jean-François Boulicaut,et al.  Frequent Closures as a Concise Representation for Binary Data Mining , 2000, PAKDD.