Discovering Minimal Infrequent Structures from XML Documents

More and more data (documents) are wrapped in XML format. Mining these documents involves mining the corresponding XML structures. However, the semi-structured (tree structured) XML makes it somewhat difficult for traditional data mining algorithms to work properly. Recently, several new algorithms were proposed to mine XML documents. These algorithms mainly focus on mining frequent tree structures from XML documents. However, none of them was designed for mining infrequent structures which are also important in many applications, such as query processing and identification of exceptional cases. In this paper, we consider the problem of identifying infrequent tree structures from XML documents. Intuitively, if a tree structure is infrequent, all tree structures that contain this subtree is also infrequent. So, we propose to consider the minimal infrequent structure (MIS), which is an infrequent structure while all proper subtrees of it are frequent. We also derive a level-wise mining algorithm that makes use of the SG-tree (signature tree) and some effective pruning techniques to efficiently discover all MIS. We validate the efficiency and feasibility of our methods through experiments on both synthetic and real data.

[1]  Hannu Toivonen,et al.  Sampling Large Databases for Association Rules , 1996, VLDB.

[2]  Ke Wang,et al.  Discovering Structural Association of Semistructured Data , 2000, IEEE Trans. Knowl. Data Eng..

[3]  R. Jain,et al.  Records , 1973, Tempo.

[4]  Mohammed J. Zaki Efficiently mining frequent trees in a forest , 2002, KDD.

[5]  George Karypis,et al.  Frequent subgraph discovery , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[6]  Jiawei Han,et al.  Maintenance of discovered association rules in large databases: an incremental updating technique , 1996, Proceedings of the Twelfth International Conference on Data Engineering.

[7]  Ramakrishnan Srikant,et al.  Fast algorithms for mining association rules , 1998, VLDB 1998.

[8]  Hannu Toivonen,et al.  Finding Frequent Substructures in Chemical Compounds , 1998, KDD.

[9]  Nikos Mamoulis,et al.  Similarity search in sets and categorical data using the signature tree , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[10]  Mong-Li Lee,et al.  Efficient Mining of XML Query Patterns for Caching , 2003, VLDB.

[11]  Roberto J. Bayardo,et al.  Efficiently mining long patterns from databases , 1998, SIGMOD '98.

[12]  Hiroki Arimura,et al.  Efficient Substructure Discovery from Large Semi-Structured Data , 2001, IEICE Trans. Inf. Syst..

[13]  Eiichi Tanaka,et al.  The Tree-to-Tree Editing Problem , 1988, Int. J. Pattern Recognit. Artif. Intell..