P-Mine: Parallel itemset mining on large datasets

Itemset mining is a well-known exploratory technique used to discover interesting correlations hidden in a data collection. Since ever increasing amounts of data are being collected and stored (e.g., business transactions, medical and biological data, context-aware applications), scalable and efficient approaches are needed to analyzing these large data collections. This paper proposes a parallel disk-based approach to efficiently supporting frequent itemset mining on a multi-core processor. Our parallel strategy is presented in the context of the VLDB-Mine persistent data structure. Different techniques have been proposed to optimize both data- and compute-intensive aspects of the mining algorithm. Preliminary experiments, performed on both real and synthetic datasets, show promising results in improving the efficiency and scalability of the mining activity on large datasets.

[1]  Tomasz Imielinski,et al.  Database Mining: A Performance Perspective , 1993, IEEE Trans. Knowl. Data Eng..

[2]  Mohammed J. Zaki Parallel and distributed association mining: a survey , 1999, IEEE Concurr..

[3]  Gösta Grahne,et al.  Mining frequent itemsets from secondary memory , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[4]  Reda Alhajj,et al.  DRFP-tree: disk-resident frequent pattern tree , 2009, Applied Intelligence.

[5]  Masaru Kitsuregawa,et al.  Tree Structure Based Parallel Frequent Pattern Mining on PC Cluster , 2003, DEXA.

[6]  Srinivasan Parthasarathy,et al.  Cache-conscious frequent pattern mining on modern and emerging processors , 2007, The VLDB Journal.

[7]  Eric Li,et al.  Optimization of Frequent Itemset Mining on Multiple-Core Processor , 2007, VLDB.

[8]  Jin Chang,et al.  Balanced parallel FP-Growth with MapReduce , 2010, 2010 IEEE Youth Conference on Information, Computing and Telecommunications.

[9]  Elena Baralis,et al.  A persistent HY-Tree to efficiently support itemset mining on large datasets , 2010, SAC '10.

[10]  Elena Baralis,et al.  IMine: Index Support for Item Set Mining , 2009, IEEE Transactions on Knowledge and Data Engineering.

[11]  Hiroki Arimura,et al.  LCM ver. 2: Efficient Mining Algorithms for Frequent/Closed/Maximal Itemsets , 2004, FIMI.

[12]  Osmar R. Zaïane,et al.  Parallel Bifold: Large-scale parallel pattern mining with constraints , 2006, Distributed and Parallel Databases.

[13]  Osmar R. Zaïane,et al.  COFI approach for mining frequent itemsets revisited , 2004, DMKD '04.

[14]  Jian Pei,et al.  Mining frequent patterns without candidate generation , 2000, SIGMOD '00.

[15]  Osmar R. Zaïane,et al.  Fast parallel association rule mining without candidacy generation , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[16]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[17]  Edward Y. Chang,et al.  Pfp: parallel fp-growth for query recommendation , 2008, RecSys '08.

[18]  Balázs Rácz,et al.  nonordfp: An FP-growth variation without rebuilding the FP-tree , 2004, FIMI.

[19]  Wenguang Chen,et al.  Tree partition based parallel frequent pattern mining on shared memory systems , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[20]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[21]  Srinivasan Parthasarathy,et al.  Out-of-core frequent pattern mining on a commodity PC , 2006, KDD '06.