On mining approximate and exact fault-tolerant frequent itemsets

Robust frequent itemset mining has attracted much attention due to the necessity to find frequent patterns from noisy data in many applications. In this paper, we focus on a variant of robust frequent itemsets in which a small amount of “faults” is allowed in each item and each supporting transaction. This problem is challenging since computing fault-tolerant support count is NP-hard and the anti-monotone property does not hold when the amount of allowable faults is proportional to the size of the itemset. We develop heuristic methods to solve an approximation version of the problem and propose speedup techniques for the exact problem. Experimental results show that our heuristic algorithms are substantially faster than the state-of-the-art exact algorithms while the error is acceptable. In addition, the proposed speedup techniques substantially improve the efficiency of the exact algorithms.

[1]  Aravind Srinivasan,et al.  Improved Approximation Guarantees for Packing and Covering Integer Programs , 1999, SIAM J. Comput..

[2]  Prabhakar Raghavan,et al.  Randomized rounding: A technique for provably good algorithms and algorithmic proofs , 1985, Comb..

[3]  Jiawei Han,et al.  Frequent pattern mining: current status and future directions , 2007, Data Mining and Knowledge Discovery.

[4]  Heikki Mannila,et al.  Dense itemsets , 2004, KDD.

[5]  Vipin Kumar,et al.  Quantitative evaluation of approximate frequent pattern mining algorithms , 2008, KDD.

[6]  Ruggero G. Pensa,et al.  Constraint-Based Mining of Fault-Tolerant Patterns from Boolean Data , 2005, KDID.

[7]  Sheng-Lung Peng,et al.  Proportional fault-tolerant data mining with applications to bioinformatics , 2009, Inf. Syst. Frontiers.

[8]  Lusheng Wang,et al.  Modeling Protein Interacting Groups by Quasi-Bicliques: Complexity, Algorithm, and Application , 2010, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[9]  Andrew B. Nobel,et al.  Mining Approximate Frequent Itemsets In the Presence of Noise: Algorithm and Analysis , 2006, SDM.

[10]  Piotr Krysta,et al.  Greedy Approximation via Duality for Packing, Combinatorial Auctions and Routing , 2005, MFCS.

[11]  Toon Calders,et al.  Depth-First Non-Derivable Itemset Mining , 2005, SDM.

[12]  D. Hochbaum Approximating covering and packing problems: set cover, vertex cover, independent set, and related problems , 1996 .

[13]  Vivekanand Gopalkrishnan,et al.  Efficient Computation of Partial-Support for Mining Interesting Itemsets , 2009, SDM.

[14]  Marco Pellegrini,et al.  Extraction and classification of dense implicit communities in the Web graph , 2009, TWEB.

[15]  Chung Keung Poon,et al.  On Mining Proportional Fault-Tolerant Frequent Itemsets , 2014, DASFAA.

[16]  Christian Borgelt,et al.  Fuzzy frequent pattern discovering based on recursive elimination , 2005, Fourth International Conference on Machine Learning and Applications (ICMLA'05).

[17]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[18]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[19]  Jian Pei,et al.  Mining frequent patterns without candidate generation , 2000, SIGMOD '00.

[20]  Anthony K. H. Tung,et al.  FARMER: finding interesting rule groups in microarray datasets , 2004, SIGMOD '04.

[21]  Guanling Lee,et al.  Mining fault-tolerant frequent patterns efficiently with powerful pruning , 2008, SAC '08.

[22]  Das Amrita,et al.  Mining Association Rules between Sets of Items in Large Databases , 2013 .

[23]  Jinyan Li,et al.  Mining Maximal Quasi-Bicliques to Co-Cluster Stocks and Financial Ratios for Value Investment , 2006, Sixth International Conference on Data Mining (ICDM'06).

[24]  Aravind Srinivasan,et al.  Solving Packing Integer Programs via Randomized Rounding with Alterations , 2012, Theory Comput..

[25]  Vivekanand Gopalkrishnan,et al.  Mining Statistical Information of Frequent Fault-Tolerant Patterns in Transactional Databases , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[26]  Anthony K. H. Tung,et al.  Fault-Tolerant Frequent Pattern Mining: Problems and Challenges , 2001, DMKD.

[27]  Vivekanand Gopalkrishnan,et al.  Towards efficient mining of proportional fault-tolerant frequent itemsets , 2009, KDD.

[28]  Philip S. Yu,et al.  Approximate Frequent Itemset Mining In the Presence of Random Noise , 2008, Soft Computing for Knowledge Discovery and Data Mining.

[29]  Stavros G. Kolliopoulos,et al.  Approximation Algorithms for Covering/Packing Integer Programs , 2002, cs/0205030.

[30]  Hans-Peter Kriegel,et al.  Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering , 2009, TKDD.

[31]  Cheng Yang,et al.  Efficient discovery of error-tolerant frequent itemsets in high dimensions , 2001, KDD '01.

[32]  Klemens Böhm,et al.  Proceedings of the International Conference on Very Large Data Bases , 2005 .

[33]  Berthold Vöcking,et al.  Approximation techniques for utilitarian mechanism design , 2005, STOC '05.

[34]  Jia-Ling Koh,et al.  An Efficient Approach for Mining Fault-Tolerant Frequent Patterns Based on Bit Vector Representations , 2005, DASFAA.

[35]  Lusheng Wang,et al.  Modeling Protein Interacting Groups by Quasi-Bicliques: Complexity, Algorithm, and Application , 2010, IEEE/ACM Transactions on Computational Biology and Bioinformatics.