A Binary Decision Diagram to discover low threshold support frequent itemsets

Discovering association rules that identify relationships among sets of items is an important problem in data mining. Finding frequent itemsets is computationally the most expensive step in association rule discovery and therefore it has grasped significant research focus [1]. Discovery of frequently occurring subsets of items, called itemsets, is the core of many data mining methods. Most of the previous studies adopt Apriori- like algorithms, whom iteratively generate candidate itemsets and check their occurrence frequencies in the database. These approaches suffer from serious costs of repeated passes over the analyzed database. In this paper, we propose a new BDD-based (Binary Decision Diagram) data structure called TreeSupBDD. The TREESUPBDD extends the idea claimed by the authors of FP-TREE [9] and ITL-Tree [5] structures, aiming to improve storage compression and to allow frequent pattern mining without an "explicit" candidate itemset generation step. To address this problem, we propose a novel method, called TreeSupBDD- MlNE, for reducing database activity of frequent itemset discovery algorithms. The idea of TREESUPBDD-MlNE consists in using a Binary Decision Diagram and a tree for representing both database and frequent itemsets. The proposed method requires one scan over the source database : to create the associated tree and BDD and check discovered itemset supports. The originality of our work stands on the fact that the proposed algorithm extracts the frequent itemsets directly from the TreeSupBDD. Carried out experiments showed very encouraged results. Its performance improvements have been shown in a series of our experiments. We extend the binary decision diagram structure to store transaction groups and propose a new method to discover frequents itemsets. To study the trade-offs in the new representation of transactions in binary decision diagram, we compare the performance of our algorithm with the fastest Apriori [2] implementation algorithm and the latest extension of FP-Growth [15]. We have tested all the algorithms using different benchmark datasets. The performance study shows that the new algorithm significantly reduces the processing time for mining frequent itemsets from dense datasets that contain relatively long patterns and for low threshold. We discuss the performance results in detail and also the strengths and limitations of our algorithm.

[1]  Mohammed J. Zaki Scalable Algorithms for Association Mining , 2000, IEEE Trans. Knowl. Data Eng..

[2]  Shamkant B. Navathe,et al.  An Efficient Algorithm for Mining Association Rules in Large Databases , 1995, VLDB.

[3]  Randal E. Bryant,et al.  Graph-Based Algorithms for Boolean Function Manipulation , 1986, IEEE Transactions on Computers.

[4]  Carole E. Chaski,et al.  Empirical evaluations of language-based author identification techniques , 2001 .

[5]  Hans van Halteren,et al.  Linguistic Profiling for Authorship Recognition and Verification , 2004, ACL.

[6]  Shlomo Argamon,et al.  Style mining of electronic messages for multiple authorship discrimination: first results , 2003, KDD '03.

[7]  Kenneth McGarry,et al.  A survey of interestingness measures for knowledge discovery , 2005, The Knowledge Engineering Review.

[8]  Ke Wang,et al.  Top Down FP-Growth for Association Rule Mining , 2002, PAKDD.

[9]  Frans Coenen,et al.  Algorithms for computing association rules using a partial-support tree , 2000, Knowl. Based Syst..

[10]  Stefanos Gritzalis,et al.  Effective identification of source code authors using byte-level information , 2006, ICSE.

[11]  Gösta Grahne,et al.  Fast algorithms for frequent itemset mining using FP-trees , 2005, IEEE Transactions on Knowledge and Data Engineering.

[12]  Raj P. Gopalan,et al.  TreeITL-Mine: Mining Frequent Itemsets Using Pattern Growth, Tid Intersection, and Prefix Tree , 2002, Australian Joint Conference on Artificial Intelligence.

[13]  Jian Pei,et al.  Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach , 2006, Sixth IEEE International Conference on Data Mining - Workshops (ICDMW'06).

[14]  Jian Pei,et al.  Mining frequent patterns without candidate generation , 2000, SIGMOD 2000.

[15]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[16]  Ansaf Salleb Recherche de motifs fréquents pour l'extraction de règles d'association et de caractérisation , 2003 .

[17]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[18]  William John Teahan,et al.  A repetition based measure for verification of text collections and for text categorization , 2003, SIGIR.

[19]  Raj P. Gopalan,et al.  ITL-MINE: Mining Frequent Itemsets More Efficiently , 2002, FSKD.

[20]  Sanguthevar Rajasekaran,et al.  A transaction mapping algorithm for frequent itemsets mining , 2006 .

[21]  Salvatore Orlando,et al.  Enhancing the Apriori Algorithm for Frequent Set Counting , 2001, DaWaK.

[22]  N. Cercone,et al.  Automatic detection and rating of dementia of Alzheimer type through lexical analysis of spontaneous speech , 2005, IEEE International Conference Mechatronics and Automation, 2005.

[23]  Christian Borgelt,et al.  EFFICIENT IMPLEMENTATIONS OF APRIORI AND ECLAT , 2003 .