Indexing Evolving Databases for Itemset Mining

Research activity in data mining has been initially focused on defining efficient algorithms to perform the computationally intensive knowledge extraction task (i.e., itemset mining). The data to be analyzed was (possibly) extracted from the DBMS and stored into binary files. Proposed approaches for mining flat file data require a lot of memory and do not scale efficiently on large databases. An improved memory management could be achieved through the integration of the data mining algorithm into the kernel of the database management system. Furthermore, most data mining algorithms deal with “static” datasets (i.e., datasets which do not change over time). This chapter presents a novel index, called I-Forest, to support data mining activities on evolving databases, whose content is periodically updated through insertion (or deletion) of data blocks. I-Forest is a covering index that represents transactional blocks in a succinct form and allows different kinds of analysis. Time and support constraints (e.g., “analyze frequent quarterly data”) may be enforced during the extraction phase. The I-Forest index has been implemented into the PostgreSQL open source DBMS and it exploits its physical level access methods. Experiments, run for both sparse and dense data distributions, show the efficiency of the proposed approach which is always comparable with, and for low support threshold faster than, the Prefix-Tree algorithm accessing static data on flat file.

[1]  Johannes Gehrke,et al.  DEMON: mining and monitoring evolving data , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[2]  Srinivasan Parthasarathy,et al.  Mining Frequent Itemsets in Evolving Databases , 2002, SDM.

[3]  Gösta Grahne,et al.  Efficiently Using Prefix-trees in Mining Frequent Itemsets , 2003, FIMI.

[4]  Jian Pei,et al.  Mining frequent patterns without candidate generation , 2000, SIGMOD '00.

[5]  Bart Goethals,et al.  Advances in frequent itemset mining implementations: report on FIMI'03 , 2004, SKDD.

[6]  Ganesh Ramesh,et al.  Indexing and Data Access Methods for Database Mining , 2002, DMKD.

[7]  Giuseppe Psaila,et al.  A tightly-coupled architecture for data mining , 1998, Proceedings 14th International Conference on Data Engineering.

[8]  Jiawei Han,et al.  Maintenance of discovered association rules in large databases: an incremental updating technique , 1996, Proceedings of the Twelfth International Conference on Data Engineering.

[9]  Surajit Chaudhuri,et al.  Efficient evaluation of queries with mining predicates , 2002, Proceedings 18th International Conference on Data Engineering.

[10]  Bart Goethals,et al.  FIMI'03: Workshop on Frequent Itemset Mining Implementations , 2003 .

[11]  Osmar R. Zaïane,et al.  Inverted matrix: efficient discovery of frequent items in large datasets in the context of interactive mining , 2003, KDD '03.

[12]  Andrea Pietracaprina,et al.  Mining Frequent Itemsets using Patricia Tries , 2003, FIMI.

[13]  Yonatan Aumann,et al.  Borders: An Efficient Algorithm for Association Generation in Dynamic Databases , 1999, Journal of Intelligent Information Systems.

[14]  Elena Baralis,et al.  Index support for frequent itemset mining in a relational DBMS , 2005, 21st International Conference on Data Engineering (ICDE'05).

[15]  Johannes Gehrke,et al.  Mining data streams under block evolution , 2002, SKDD.

[16]  Jean-François Boulicaut,et al.  A Comparison between Query Languages for the Extraction of Association Rules , 2002, DaWaK.

[17]  Osmar R. Zaïane,et al.  Incremental mining of frequent patterns without candidate generation or support constraint , 2003, Seventh International Database Engineering and Applications Symposium, 2003. Proceedings..

[18]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[19]  Tomasz Imielinski,et al.  Database Mining: A Performance Perspective , 1993, IEEE Trans. Knowl. Data Eng..

[20]  Luminita Dumitriu,et al.  Interactive mining and knowledge reuse for the closed-itemset incremental-mining problem , 2002, SKDD.