Mining Top-K Patterns from Binary Datasets in Presence of Noise

The discovery of patterns in binary dataset has many applications, e.g. in electronic commerce, TCP/IP networking, Web usage logging, etc. Still, this is a very challenging task in many respects: overlapping vs. non overlapping patterns, presence of noise, extraction of the most important patterns only. In this paper we formalize the problem of discovering the Top-K patterns from binary datasets in presence of noise, as the minimization of a novel cost function. According to the Minimum Description Length principle, the proposed cost function favors succinct pattern sets that may approximately describe the input data. We propose a greedy algorithm for the discovery of Patterns in Noisy Datasets, named PaNDa, and show that it outperforms related techniques on both synthetic and realworld data.

[1]  Aristides Gionis,et al.  Geometric and Combinatorial Tiles in 0-1 Data , 2004, PKDD.

[2]  Cheng Yang,et al.  Efficient discovery of error-tolerant frequent itemsets in high dimensions , 2001, KDD '01.

[3]  Bart Goethals,et al.  Tiling Databases , 2004, Discovery Science.

[4]  Ella Bingham,et al.  Dependencies between transcription factor binding sites: comparison between ICA, NMF, PLSA and frequent sets , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[5]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[6]  Vipin Kumar,et al.  Quantitative evaluation of approximate frequent pattern mining algorithms , 2008, KDD.

[7]  Yang Xiang,et al.  Succinct summarization of transactional databases: an overlapped hyperrectangle scheme , 2008, KDD.

[8]  Heikki Mannila,et al.  A Simple Algorithm for Topic Identification in 0-1 Data , 2003, PKDD.

[9]  Vipin Kumar,et al.  Introduction to Data Mining , 2022, Data Mining and Machine Learning Applications.

[10]  Mohammed J. Zaki,et al.  Efficient algorithms for mining closed itemsets and their lattice structure , 2005, IEEE Transactions on Knowledge and Data Engineering.

[11]  Pauli Miettinen,et al.  The Discrete Basis Problem , 2006, IEEE Transactions on Knowledge and Data Engineering.

[12]  Heikki Mannila,et al.  Dense itemsets , 2004, KDD.

[13]  Thomas Hofmann,et al.  Probabilistic latent semantic indexing , 1999, SIGIR '99.

[14]  Philip S. Yu,et al.  AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery , 2006, Sixth International Conference on Data Mining (ICDM'06).

[15]  Vipin Kumar,et al.  Support envelopes: a technique for exploring the structure of association patterns , 2004, KDD.

[16]  Tao Li,et al.  A general model for clustering binary data , 2005, KDD '05.

[17]  J. Rissanen Stochastic Complexity in Statistical Inquiry Theory , 1989 .

[18]  Erkki Oja,et al.  Independent Component Analysis , 2001 .