An Efficient Algorithm for Mining String Databases Under Constraints

We study the problem of mining substring patterns from string databases. Patterns are selected using a conjunction of monotonic and anti-monotonic predicates. Based on the earlier introduced version space tree data structure, a novel algorithm for discovering substring patterns is introduced. It has the nice property of requiring only one database scan, which makes it highly scalable and applicable in distributed environments, where the data are not necessarily stored in local memory or disk. The algorithm is experimentally compared to a previously introduced algorithm in the same setting.

[1]  Jian Pei,et al.  Mining frequent patterns without candidate generation , 2000, SIGMOD 2000.

[2]  Rakesh Agarwal,et al.  Fast Algorithms for Mining Association Rules , 1994, VLDB 1994.

[3]  Chad Creighton,et al.  Mining gene expression databases for association rules , 2003, Bioinform..

[4]  Baptiste Jeudy,et al.  Using Constraints During Set Mining: Should We Prune or not? , 2000 .

[5]  Tom M. Mitchell,et al.  Generalization as Search , 2002 .

[6]  Laks V. S. Lakshmanan,et al.  Efficient mining of constrained correlated sets , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[7]  Luc De Raedt,et al.  The Levelwise Version Space Algorithm and its Application to Molecular Fragment Finding , 2001, IJCAI.

[8]  Xindong Wu,et al.  Proceedings, Third IEEE International Conference on Data Mining, ICDM 2003, 19-22 November 2003, Melbourne, Florida , 2003 .

[9]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[10]  Peter Weiner,et al.  Linear Pattern Matching Algorithms , 1973, SWAT.

[11]  Saul Greenberg,et al.  USING UNIX: COLLECTED TRACES OF 168 USERS , 1988 .

[12]  Ramakrishnan Srikant,et al.  Mining sequential patterns , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[13]  Esko Ukkonen,et al.  On-line construction of suffix trees , 1995, Algorithmica.

[14]  Philip A. Bernstein,et al.  Proceedings of the 2000 ACM SIGMOD : International Conference on Management of Data, May 16-18, 2000, Dallas, Texas , 2000 .

[15]  Jian Pei,et al.  Can we push more constraints into frequent pattern mining? , 2000, KDD '00.

[16]  Luc De Raedt,et al.  Molecular feature mining in HIV data , 2001, KDD '01.

[17]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[18]  Luc De Raedt,et al.  An algebra for inductive query evaluation , 2003, Third IEEE International Conference on Data Mining.

[19]  Dino Pedreschi,et al.  ExAMiner: optimized level-wise frequent pattern mining with monotone constraints , 2003, Third IEEE International Conference on Data Mining.

[20]  Luc De Raedt,et al.  Towards Optimizing Conjunctive Inductive Queries , 2004, KDID.