Pattern matching with wildcards and gap-length constraints based on a centrality-degree graph

Pattern matching with wildcards is a challenging topic in many domains, such as bioinformatics and information retrieval. This paper focuses on the problem with gap-length constraints and the one-off condition (The one-off condition means that each character can be used at most once in all occurrences of a pattern in the sequence). It is difficult to achieve the optimal solution. We propose a graph structure WON-Net (WON-Net is a graph structure. It stands for a network with the weighted centralization measure based on each node’s centrality-degree. Its details are given in Definition 4.1) to obtain all candidate matching solutions and then design the WOW (WOW stands for pattern matching with wildcards based on WON-Net) algorithm with the weighted centralization measure based on nodes’ centrality-degrees. We also propose an adjustment mechanism to balance the optimal solutions and the running time. We also define a new variant of WOW as WOW-δ. Theoretical analysis and experiments demonstrate that WOW and WOW-δ are more effective than their peers. Besides, the algorithms demonstrate an advantage on running time by parallel processing.

[1]  Kevin Y. Yip,et al.  Mining periodic patterns with gap requirement from sequences , 2007 .

[2]  Xingquan Zhu,et al.  SAIL-APPROX: An Efficient On-Line Algorithm for Approximate Pattern Matching with Wildcards and Length Constraints , 2007, 2007 IEEE International Conference on Bioinformatics and Biomedicine (BIBM 2007).

[3]  Amos Bairoch,et al.  A Generalized Profile Syntax for Biomolecular Sequence Motifs and its Function in Automatic Sequence Interpretation , 1994, ISMB.

[4]  Maxime Crochemore,et al.  Bases of motifs for generating repeated patterns with wild cards , 2005, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[5]  Xindong Wu,et al.  Pattern Matching with Flexible Wildcards and Recurring Characters , 2010, 2010 IEEE International Conference on Granular Computing.

[6]  Xindong Wu,et al.  MAIL: mining sequential patterns with wildcards , 2013, Int. J. Data Min. Bioinform..

[7]  Philip S. Yu,et al.  Mining interesting user behavior patterns in mobile commerce environments , 2012, Applied Intelligence.

[8]  Amos Bairoch,et al.  The PROSITE database, its status in 1997 , 1997, Nucleic Acids Res..

[9]  Philip Bille,et al.  String Matching with Variable Length Gaps , 2010, SPIRE.

[10]  Xindong Wu,et al.  A BIT-PARALLEL ALGORITHM FOR SEQUENTIAL PATTERN MATCHING WITH WILDCARDS , 2011, Cybern. Syst..

[11]  Gonzalo Navarro,et al.  Flexible Pattern Matching in Strings: Practical On-Line Search Algorithms for Texts and Biological Sequences , 2002 .

[12]  Young-Koo Lee,et al.  HUC-Prune: an efficient candidate pruning technique to mine high utility patterns , 2011, Applied Intelligence.

[13]  Luo Xiao,et al.  Information Extraction from the Web: System and Techniques , 2004, Applied Intelligence.

[14]  Xindong Wu,et al.  Pattern matching with wildcards based on key character location , 2009, 2009 IEEE International Conference on Information Reuse & Integration.

[15]  M. Fischer,et al.  STRING-MATCHING AND OTHER PRODUCTS , 1974 .

[16]  He Jiang,et al.  A Heuristic Algorithm for MPMGOOC: A Heuristic Algorithm for MPMGOOC , 2011 .

[17]  Jiang He,et al.  A Heuristic Algorithm for MPMGOOC , 2011 .

[18]  Peter Funk,et al.  Concise case indexing of time series in health care by means of key sequence discovery , 2007, Applied Intelligence.

[19]  Ye-In Chang,et al.  A hash trie filter method for approximate string matching in genomic databases , 2010, Applied Intelligence.

[20]  Byung-Won On,et al.  Meta similarity , 2011, Applied Intelligence.

[21]  José Francisco Martínez Trinidad,et al.  RP-Miner: a relaxed prune algorithm for frequent similar pattern mining , 2011, Knowledge and Information Systems.

[22]  Xindong Wu,et al.  Efficient string matching with wildcards and length constraints , 2006, Knowledge and Information Systems.

[23]  Xindong Wu,et al.  A Nettree for pattern Matching with flexible wildcard Constraints , 2010, 2010 IEEE International Conference on Information Reuse & Integration.

[24]  Xindong Wu,et al.  Pattern Matching with Independent Wildcard Gaps , 2009, 2009 Eighth IEEE International Conference on Dependable, Autonomic and Secure Computing.

[25]  Rodrigo Gonçalves,et al.  Approximate data instance matching: a survey , 2011, Knowledge and Information Systems.

[26]  Jiawei Han,et al.  Efficient Mining of Closed Repetitive Gapped Subsequences from a Sequence Database , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[27]  Ramakrishnan Srikant,et al.  Mining sequential patterns , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[28]  Sylvie Ratté,et al.  Classifier-based acronym extraction for business documents , 2011, Knowledge and Information Systems.

[29]  Amos Bairoch,et al.  The PROSITE database, its status in 1999 , 1999, Nucleic Acids Res..

[30]  Bart Goethals,et al.  Mining frequent conjunctive queries in relational databases through dependency discovery , 2012, Knowledge and Information Systems.

[31]  Richard Cole,et al.  Dictionary matching and indexing with errors and don't cares , 2004, STOC '04.

[32]  Jian Pei,et al.  Aggregate keyword search on large relational databases , 2012, Knowledge and Information Systems.

[33]  David Sánchez,et al.  Automatic extraction of acronym definitions from the Web , 2011, Applied Intelligence.