Discovery of Useful Patterns from Tree-Structured Documents with Label-Projected Database

Due to its highly flexible tree structure, XML data is used to capture most kinds of data and provides a substrate in which almost any other data structure may be presented. With the continuous growth of XML tree data in electronic environments, the discovery of useful knowledge from them has been a main research area in the information retrieval community. The mostly used approach to this task is to extract frequently occurring subtree patterns from a set of trees. However, because the number of frequent subtrees grows exponentially with the size of trees, a more practical and scalable alternative is required, which is the discovery of maximal frequent subtrees. The maximal frequent subtrees hold all the useful information, though, the number of them is much smaller than that of frequent subtrees. Handling the maximal frequent subtrees is an interesting challenge, and represents the core of this paper. As far as we know, this is one of the first studies to directly discover maximal frequent subtrees without any candidate sets generations as well as eliminating the process of useless subtree pruning. To this end, we define and use a new type of projected database to represent XML tree data efficiently. It significantly improves the entire process of mining maximal frequent subtree patterns. We study the performance and the scalability of the proposed approach through experiments based on synthetic datasets.

[1]  Meng Li,et al.  Stream Operators for Querying Data Streams , 2005, WAIM.

[2]  Alexandre Termier,et al.  TreeFinder: a first step towards XML data mining , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[3]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[4]  Ke Wang,et al.  Schema Discovery for Semistructured Data , 1997, KDD.

[5]  Jian Pei,et al.  Mining frequent patterns without candidate generation , 2000, SIGMOD '00.

[6]  Chen Wang,et al.  Efficient Pattern-Growth Methods for Frequent Tree Pattern Mining , 2004, PAKDD.

[7]  Dong Ryeol Shin,et al.  EFoX: A Scalable Method for Extracting Frequent Subtrees , 2005, International Conference on Computational Science.

[8]  Setsuo Ohsuga,et al.  Information Modelling and Knowledge Bases , 1990 .

[9]  Jack Dongarra,et al.  Computational Science - ICCS 2005, 5th International Conference, Atlanta, GA, USA, May 22-25, 2005, Proceedings, Part I , 2005, International Conference on Computational Science.

[10]  Hiroki Arimura,et al.  Efficient Substructure Discovery from Large Semi-Structured Data , 2001, IEICE Trans. Inf. Syst..

[11]  Yangjun Chen,et al.  A new tree inclusion algorithm , 2006, Inf. Process. Lett..

[12]  Lei Zou,et al.  Mining Frequent Induced Subtrees by Prefix-Tree-Projected Pattern Growth , 2006, 2006 Seventh International Conference on Web-Age Information Management Workshops.

[13]  Mohammed J. Zaki Scalable Algorithms for Association Mining , 2000, IEEE Trans. Knowl. Data Eng..

[14]  Zhigang Li,et al.  Efficient data mining for maximal frequent subtrees , 2003, Third IEEE International Conference on Data Mining.

[15]  Frederic Maire,et al.  Intelligent Data Engineering and Automated Learning - IDEAL 2005, 6th International Conference, Brisbane, Australia, July 6-8, 2005, Proceedings , 2005, IDEAL.

[16]  Yun Chi,et al.  Canonical forms for labelled trees and their applications in frequent subtree mining , 2005, Knowledge and Information Systems.

[17]  Kari-Jouko Räihä,et al.  On query languages for the P-string data model , 1990 .

[18]  Dongho Won,et al.  EXiT-B: A New Approach for Extracting Maximal Frequent Subtrees from XML Data , 2005, IDEAL.

[19]  Mohammed J. Zaki Efficiently mining frequent trees in a forest: algorithms and applications , 2005, IEEE Transactions on Knowledge and Data Engineering.