Scalable APRIORI-Based Frequent Pattern Discovery

Frequent pattern discovery, the task of finding sets of items that frequently occur together in a dataset, has beenat the core of the field of data mining for the past sixteen years. In that time, the size of datasets has grown much faster than has the ability of existing algorithms to handle those datasets. Consequently, improvements are needed.In this paper we take the classic algorithm for the problem, A Priori, and by adding a vertical sort drastically improve its performance characteristics when processing very large datasets. We use the benchmark large dataset webdocs from the FIMI 2004 conference to contrast our performance against several state-of-the-art implementations and demonstrate both equal efficiency with lower memory usage at all support thresholds and also the ability to mine support thresholds as yet unattempted in literature. We also indicate how this work can be extended to achieve yet more impressive results.

[1]  Christian Borgelt Recursion Pruning for the Apriori Algorithm , 2004, FIMI.

[2]  Dirk Van Gucht,et al.  A probability analysis for candidate-based frequent itemset algorithms , 2006, SAC.

[3]  Fabrizio Silvestri,et al.  WebDocs: a real-life huge transactional dataset , 2004, FIMI.

[4]  Dennis P. Groth,et al.  Average-Case Performance of the Apriori Algorithm , 2004, SIAM J. Comput..

[5]  Gösta Grahne,et al.  Efficiently Using Prefix-trees in Mining Frequent Itemsets , 2003, FIMI.

[6]  Nicolas Pasquier,et al.  Discovering Frequent Closed Itemsets for Association Rules , 1999, ICDT.

[7]  Bart Goethals,et al.  Efficient frequent pattern mining , 2002 .

[8]  Salvatore Orlando,et al.  kDCI: on using direct count up to the third iteration , 2004, FIMI.

[9]  Edward Fredkin,et al.  Trie memory , 1960, Commun. ACM.

[10]  Rakesh Agarwal,et al.  Fast Algorithms for Mining Association Rules , 1994, VLDB 1994.

[11]  Rajeev Motwani,et al.  Dynamic itemset counting and implication rules for market basket data , 1997, SIGMOD '97.

[12]  Mohammed J. Zaki Scalable Algorithms for Association Mining , 2000, IEEE Trans. Knowl. Data Eng..

[13]  Srinivasan Parthasarathy,et al.  Cache-conscious Frequent Pattern Mining on a Modern Processor , 2005, VLDB.

[14]  Mohammed J. Zaki Generating non-redundant association rules , 2000, KDD '00.

[15]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[16]  Shamkant B. Navathe,et al.  An Efficient Algorithm for Mining Association Rules in Large Databases , 1995, VLDB.

[17]  Toon Calders,et al.  Deducing Bounds on the Frequency of Itemsets , 2002 .

[18]  Srinivasan Parthasarathy,et al.  Out-of-core frequent pattern mining on a commodity PC , 2006, KDD '06.

[19]  Ferenc Bodon,et al.  A fast APRIORI implementation , 2003, FIMI.

[20]  Ramakrishnan Srikant,et al.  Mining Sequential Patterns: Generalizations and Performance Improvements , 1996, EDBT.

[21]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[22]  Christian Borgelt,et al.  Induction of Association Rules: Apriori Implementation , 2002, COMPSTAT.

[23]  Jian Pei,et al.  Mining frequent patterns without candidate generation , 2000, SIGMOD '00.