Toward terabyte pattern mining: an architecture-conscious solution

We present a strategy for mining frequent item sets from terabyte-scale data sets on cluster systems. The algorithm embraces the holistic notion of architecture-conscious datamining, taking into account the capabilities of the processor, the memory hierarchy and the available network interconnects. Optimizations have been designed for lowering communication costs using compressed data structures and a succinct encoding. Optimizations for improving cache, memory and I/O utilization using pruningand tiling techniques, and smart data placement strategies are also employed. We leverage the extended memory spaceand computational resources of a distributed message-passing clusterto design a scalable solution, where each node can extend its metastructures beyond main memory by leveraging 64-bit architecture support. Our solution strategy is presented in the context of FPGrowth, a well-studied and rather efficient frequent pattern mining algorithm. Results demonstrate that the proposed strategy result in near-linearscaleup on up to 48 nodes.

[1]  Rajeev Motwani,et al.  Scalable Techniques for Mining Causal Structures , 1998, Data Mining and Knowledge Discovery.

[2]  Ramakrishnan Srikant,et al.  Mining sequential patterns , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[3]  Wei Li,et al.  New parallel algorithms for fast discovery of associ-ation rules , 1997 .

[4]  Wei Li,et al.  New algorithms for fast discovery discovery of association rules , 1995 .

[5]  Srinivasan Parthasarathy,et al.  Shared State for Distributed Interactive Data Mining Applications , 2002, Distributed and Parallel Databases.

[6]  Srinivasan Parthasarathy,et al.  Out-of-core frequent pattern mining on a commodity PC , 2006, KDD '06.

[7]  Srinivasan Parthasarathy,et al.  Mining frequent itemsets in distributed and dynamic databases , 2003, Third IEEE International Conference on Data Mining.

[8]  Mohammed J. Zaki Parallel and distributed association mining: a survey , 1999, IEEE Concurr..

[9]  Ashfaq Khokhar,et al.  Frequent Pattern Mining on Message Passing Multiprocessor Systems , 2004, Distributed and Parallel Databases.

[10]  Jinyan Li,et al.  Efficient mining of emerging patterns: discovering trends and differences , 1999, KDD '99.

[11]  Ran Wolff,et al.  A high-performance distributed algorithm for mining association rules , 2004, Knowledge and Information Systems.

[12]  Srinivasan Parthasarathy,et al.  Parallel Algorithms for Discovery of Association Rules , 1997, Data Mining and Knowledge Discovery.

[13]  Srinivasan Parthasarathy,et al.  Parallel Data Mining for Association Rules on Shared-memory Systems , 1998 .

[14]  Ran Wolff,et al.  Communication-efficient distributed mining of association rules , 2001, SIGMOD '01.

[15]  David A. Padua,et al.  A sampling-based framework for parallel data mining , 2005, PPoPP.

[16]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[17]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[18]  Osmar R. Zaïane,et al.  Fast parallel association rule mining without candidacy generation , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[19]  Jiawei Han,et al.  A fast distributed algorithm for mining association rules , 1996, Fourth International Conference on Parallel and Distributed Information Systems.

[20]  Ramakrishnan Srikant,et al.  Fast algorithms for mining association rules , 1998, VLDB 1998.

[21]  Srinivasan Parthasarathy,et al.  Cache-conscious Frequent Pattern Mining on a Modern Processor , 2005, VLDB.

[22]  Bart Goethals,et al.  Advances in frequent itemset mining implementations: report on FIMI'03 , 2004, SKDD.

[23]  Rakesh Agrawal,et al.  Parallel Mining of Association Rules , 1996, IEEE Trans. Knowl. Data Eng..

[24]  Rajeev Motwani,et al.  Beyond market baskets: generalizing association rules to correlations , 1997, SIGMOD '97.

[25]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[26]  Vipin Kumar,et al.  Scalable parallel data mining for association rules , 1997, SIGMOD '97.

[27]  Hannu Toivonen,et al.  Sampling Large Databases for Association Rules , 1996, VLDB.

[28]  David Wai-Lok Cheung,et al.  Asynchronous parallel algorithm for mining association rules on a shared-memory multi-processors , 1998, SPAA '98.

[29]  Shamkant B. Navathe,et al.  An Efficient Algorithm for Mining Association Rules in Large Databases , 1995, VLDB.

[30]  Jian Pei,et al.  Mining frequent patterns without candidate generation , 2000, SIGMOD '00.

[31]  A. Akhmetova Discovery of Frequent Episodes in Event Sequences , 2006 .