TRIPS and TIDES: new algorithms for tree mining

Recent research in data mining has progressed from mining frequent itemsets to more general and structured patterns like trees and graphs. In this paper, we address the problem of frequent subtree mining that has proven to be viable in a wide range of applications such as bioinformatics, XML processing, computational linguistics, and web usage mining. We propose novel algorithms to mine frequent subtrees from a database of rooted trees. We evaluate the use of two popular sequential encodings of trees to systematically generate and evaluate the candidate patterns. The proposed approach is very generic and can be used to mine embedded or induced subtrees that can be labeled, unlabeled, ordered, unordered, or edge-labeled. Our algorithms are highly cache-conscious in nature because of the compact and simple array-based data structures we use. Typically, L1 and L2 hit rates above 99% are observed. Experimental evaluation showed that our algorithms can achieve up to several orders of magnitude speedup on real datasets when compared to state-of-the-art tree mining algorithms.

[1]  Tharam S. Dillon,et al.  X3-Miner: Mining Patterns from XML Database , 2005 .

[2]  Bongki Moon,et al.  PRIX: indexing and querying XML using prufer sequences , 2004, Proceedings. 20th International Conference on Data Engineering.

[3]  Hiroki Arimura,et al.  Efficient Substructure Discovery from Large Semi-Structured Data , 2001, IEICE Trans. Inf. Syst..

[4]  Charu C. Aggarwal,et al.  XRules: an effective structural classifier for XML data , 2003, KDD '03.

[5]  Srinivasan Parthasarathy,et al.  Cache-conscious Frequent Pattern Mining on a Modern Processor , 2005, VLDB.

[6]  Ke Wang,et al.  Discovering typical structures of documents: a road map approach , 1998, SIGIR '98.

[7]  Joost N. Kok,et al.  Efficient discovery of frequent unordered trees , 2003 .

[8]  Stefan Kramer,et al.  Frequent free tree discovery in graph data , 2004, SAC '04.

[9]  Qiming Chen,et al.  PrefixSpan,: mining sequential patterns efficiently by prefix-projected pattern growth , 2001, Proceedings 17th International Conference on Data Engineering.

[10]  George Karypis,et al.  Frequent subgraph discovery , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[11]  Imrich Chlamtac,et al.  Location aware, dependable multicast for mobile ad hoc networks , 2001, Comput. Networks.

[12]  Mario Gerla,et al.  Aggregated Multicast – A Comparative Study , 2002, Cluster Computing.

[13]  Srinivasan Parthasarathy,et al.  New Algorithms for Fast Discovery of Association Rules , 1997, KDD.

[14]  Alexandre Termier,et al.  Efficient mining of high branching factor attribute trees , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[15]  Joost N. Kok,et al.  A quickstart in frequent structure mining can make a difference , 2004, KDD.

[16]  Mohammed J. Zaki Efficiently mining frequent trees in a forest , 2002, KDD.

[17]  Hiroki Arimura,et al.  Optimized Substructure Discovery for Semi-structured Data , 2002, PKDD.

[18]  Alexandre Termier,et al.  Dryade: a new approach for discovering closed frequent trees in heterogeneous tree databases , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[19]  Umeshwar Dayal,et al.  PrefixSpan: Mining Sequential Patterns by Prefix-Projected Growth , 2001, ICDE 2001.

[20]  Sen Zhang,et al.  Mining Frequent Agreement Subtrees in Phylogenetic Databases , 2006, SDM.

[21]  Srinivasan Parthasarathy,et al.  A Decomposition-Based Probabilistic Framework for Estimating the Selectivity of XML Twig Queries , 2006, EDBT.

[22]  Jiawei Han,et al.  gSpan: graph-based substructure pattern mining , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[23]  Yun Chi,et al.  Frequent Subtree Mining - An Overview , 2004, Fundam. Informaticae.

[24]  Tharam S. Dillon,et al.  IMB3-Miner: Mining Induced/Embedded Subtrees by Constraining the Level of Embedding , 2006, PAKDD.

[25]  Jian Pei,et al.  Mining frequent patterns without candidate generation , 2000, SIGMOD '00.

[26]  Chen Wang,et al.  Efficient Pattern-Growth Methods for Frequent Tree Pattern Mining , 2004, PAKDD.

[27]  Yun Chi,et al.  CMTreeMiner: Mining Both Closed and Maximal Frequent Subtrees , 2004, PAKDD.

[28]  Jaideep Srivastava,et al.  Web mining: information and pattern discovery on the World Wide Web , 1997, Proceedings Ninth IEEE International Conference on Tools with Artificial Intelligence.