论文信息 - Mining Top-K Patterns from Binary Datasets in Presence of Noise

Mining Top-K Patterns from Binary Datasets in Presence of Noise

The discovery of patterns in binary dataset has many applications, e.g. in electronic commerce, TCP/IP networking, Web usage logging, etc. Still, this is a very challenging task in many respects: overlapping vs. non overlapping patterns, presence of noise, extraction of the most important patterns only. In this paper we formalize the problem of discovering the Top-K patterns from binary datasets in presence of noise, as the minimization of a novel cost function. According to the Minimum Description Length principle, the proposed cost function favors succinct pattern sets that may approximately describe the input data. We propose a greedy algorithm for the discovery of Patterns in Noisy Datasets, named PaNDa, and show that it outperforms related techniques on both synthetic and realworld data.

[1] Aristides Gionis,et al. Geometric and Combinatorial Tiles in 0-1 Data , 2004, PKDD.

[2] Cheng Yang,et al. Efficient discovery of error-tolerant frequent itemsets in high dimensions , 2001, KDD '01.

[3] Bart Goethals,et al. Tiling Databases , 2004, Discovery Science.

[4] Ella Bingham,et al. Dependencies between transcription factor binding sites: comparison between ICA, NMF, PLSA and frequent sets , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[5] Michael I. Jordan,et al. Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[6] Vipin Kumar,et al. Quantitative evaluation of approximate frequent pattern mining algorithms , 2008, KDD.

[7] Yang Xiang,et al. Succinct summarization of transactional databases: an overlapped hyperrectangle scheme , 2008, KDD.

[8] Heikki Mannila,et al. A Simple Algorithm for Topic Identification in 0-1 Data , 2003, PKDD.

[9] Vipin Kumar,et al. Introduction to Data Mining , 2022, Data Mining and Machine Learning Applications.

[10] Mohammed J. Zaki,et al. Efficient algorithms for mining closed itemsets and their lattice structure , 2005, IEEE Transactions on Knowledge and Data Engineering.

[11] Pauli Miettinen,et al. The Discrete Basis Problem , 2006, IEEE Transactions on Knowledge and Data Engineering.

[12] Heikki Mannila,et al. Dense itemsets , 2004, KDD.

[13] Thomas Hofmann,et al. Probabilistic latent semantic indexing , 1999, SIGIR '99.

[14] Philip S. Yu,et al. AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery , 2006, Sixth International Conference on Data Mining (ICDM'06).

[15] Vipin Kumar,et al. Support envelopes: a technique for exploring the structure of association patterns , 2004, KDD.

[16] Tao Li,et al. A general model for clustering binary data , 2005, KDD '05.

[17] J. Rissanen. Stochastic Complexity in Statistical Inquiry Theory , 1989 .

[18] Erkki Oja,et al. Independent Component Analysis , 2001 .