Memory-efficient frequent-itemset mining

Efficient discovery of frequent itemsets in large datasets is a key component of many data mining tasks. In-core algorithms---which operate entirely in main memory and avoid expensive disk accesses---and in particular the prefix tree-based algorithm FP-growth are generally among the most efficient of the available algorithms. Unfortunately, their excessive memory requirements render them inapplicable for large datasets with many distinct items and/or itemsets of high cardinality. To overcome this limitation, we propose two novel data structures---the CFP-tree and the CFP-array---, which reduce memory consumption by about an order of magnitude. This allows us to process significantly larger datasets in main memory than previously possible. Our data structures are based on structural modifications of the prefix tree that increase compressability, an optimized physical representation, lightweight compression techniques, and intelligent node ordering and indexing. Experiments with both real-world and synthetic datasets show the effectiveness of our approach.

[1]  Andrea Pietracaprina,et al.  Mining Frequent Itemsets using Patricia Tries , 2003, FIMI.

[2]  Srinivasan Parthasarathy,et al.  New Algorithms for Fast Discovery of Association Rules , 1997, KDD.

[3]  David Thomas,et al.  The Art in Computer Programming , 2001 .

[4]  Robert E. Tarjan,et al.  Self-adjusting binary search trees , 1985, JACM.

[5]  Srinivasan Parthasarathy,et al.  Out-of-core frequent pattern mining on a commodity PC , 2006, KDD '06.

[6]  Raj P. Gopalan,et al.  CT-PRO: A Bottom-Up Non Recursive Frequent Itemset Mining Algorithm Using Compressed FP-Tree Data Structure , 2004, FIMI.

[7]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[8]  Hongjun Lu,et al.  AFOPT: An Efficient Implementation of Pattern Growth Approach , 2003, FIMI.

[9]  Cevdet Aykanat,et al.  A Space Optimization for FP-Growth , 2004, FIMI.

[10]  Hannu Toivonen,et al.  Data Mining In Bioinformatics , 2005 .

[11]  Das Amrita,et al.  Mining Association Rules between Sets of Items in Large Databases , 2013 .

[12]  Hiroki Arimura,et al.  LCM ver. 2: Efficient Mining Algorithms for Frequent/Closed/Maximal Itemsets , 2004, FIMI.

[13]  Geert Wets,et al.  Using association rules for product assortment decisions: a case study , 1999, KDD '99.

[14]  Sven Helmer,et al.  The implementation and performance of compressed databases , 2000, SGMD.

[15]  Edward Y. Chang,et al.  Pfp: parallel fp-growth for query recommendation , 2008, RecSys '08.

[16]  Jian Pei,et al.  Mining frequent patterns without candidate generation , 2000, SIGMOD '00.

[17]  Bart Goethals,et al.  Advances in frequent itemset mining implementations: report on FIMI'03 , 2004, SKDD.

[18]  Hassan K. Reghbati,et al.  Special Feature An Overview of Data Compression Techniques , 1981, Computer.

[19]  Balázs Rácz,et al.  nonordfp: An FP-growth variation without rebuilding the FP-tree , 2004, FIMI.

[20]  Hannu Toivonen,et al.  Sampling Large Databases for Association Rules , 1996, VLDB.

[21]  Shirish Tatikonda,et al.  Toward terabyte pattern mining: an architecture-conscious solution , 2007, PPoPP.

[22]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[23]  Fabrizio Silvestri,et al.  WebDocs: a real-life huge transactional dataset , 2004, FIMI.

[24]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[25]  Donald E. Knuth,et al.  The Art of Computer Programming: Volume 3: Sorting and Searching , 1998 .

[26]  Srinivasan Parthasarathy,et al.  Cache-conscious Frequent Pattern Mining on a Modern Processor , 2005, VLDB.

[27]  Raj P. Gopalan,et al.  CT-ITL : Efficient Frequent Item Set Mining Using a Compressed Prefix Tree with Pattern Growth , 2003, ADC.

[28]  Gösta Grahne,et al.  Efficiently Using Prefix-trees in Mining Frequent Itemsets , 2003, FIMI.

[29]  Mohammed J. Zaki,et al.  Web Usage Mining — Languages and Algorithms , 2003 .

[30]  Eric Li,et al.  Optimization of Frequent Itemset Mining on Multiple-Core Processor , 2007, VLDB.

[31]  K. Vanhoof,et al.  Profiling of High-Frequency Accident Locations by Use of Association Rules , 2003 .

[32]  Ramakrishnan Srikant,et al.  Fast algorithms for mining association rules , 1998, VLDB 1998.