A Novel Algorithm for Pattern Matching Based on Modified Push-Down Automata

In this paper we propose a new algorithm called MEPda (Motif Extraction algorithm based on Push-down automata) to solve the problem of finding patterns containing loops. These loop-patterns or loop-motifs are very known and used in many domains, especially in mathematics and bioinformatics. MEPda meant to find these kinds of patterns by using pushdown automata as a mechanism of matching process alongside with a counter to verify the acceptance length of loop in an optimistic way of looking. The results obtained from MEPda have shown high accuracy and much reduced runtime for finding patterns containing loops compared to using a push-down automata based algorithm without implementing a counter, a regular expression based algorithm, an Aho-Corasick algorithm, a KMP algorithm, and MoTeX algorithm.

[1]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[2]  Alfred V. Aho,et al.  Efficient string matching , 1975, Commun. ACM.

[3]  Donald E. Knuth,et al.  Fast Pattern Matching in Strings , 1977, SIAM J. Comput..

[4]  Ajay N. Jain,et al.  A deterministic motif finding algorithm with application to the human genome , 2006, Bioinform..

[5]  M. I. Khalil,et al.  Exact and like string matching algorithm for web and network security , 2013, 2013 World Congress on Computer and Information Technology (WCCIT).

[6]  Dana Shapira,et al.  Adapting the Knuth-Morris-Pratt algorithm for pattern matching in Huffman encoded texts , 2006, Inf. Process. Manag..

[7]  Gonzalo Navarro,et al.  A guided tour to approximate string matching , 2001, CSUR.

[8]  Francis Y. L. Chin,et al.  Finding motifs from all sequences with and without binding sites , 2006, Bioinform..

[9]  Weiping Chen,et al.  Face Recognition Using Ensemble String Matching , 2013, IEEE Transactions on Image Processing.

[10]  Esko Ukkonen,et al.  Mining for Putative Regulatory Elements in the Yeast Genome Using Gene Expression Data , 2000, ISMB.

[11]  Mireille Régnier,et al.  Rare Events and Conditional Events on Random Strings , 2004, Discret. Math. Theor. Comput. Sci..

[12]  Erik van Nimwegen,et al.  PhyloGibbs: A Gibbs Sampling Motif Finder That Incorporates Phylogeny , 2005, PLoS Comput. Biol..

[13]  H. Kitano Systems Biology: A Brief Overview , 2002, Science.

[14]  Francis Y. L. Chin,et al.  An efficient motif discovery algorithm with unknown motif length and number of binding sites , 2006, Int. J. Data Min. Bioinform..

[15]  Weixiong Zhang,et al.  WordSpy: identifying transcription factor binding motifs by building a dictionary and learning a grammar , 2005, Nucleic Acids Res..

[16]  Farhad Nourai Automata theory I , 1973, CSC '73.

[17]  Sean R. Eddy,et al.  Rfam: an RNA family database , 2003, Nucleic Acids Res..

[18]  Sebastian Sakowski,et al.  Autonomous Push-down Automaton Built on DNA , 2011, Informatica.

[19]  Chapter 1 Formal Pushdown Automata Formal Definition and View , .

[20]  G. K. Sandve,et al.  A survey of motif discovery methods in an integrated framework , 2006, Biology Direct.

[21]  Irène Guessarian,et al.  Pushdown tree automata , 1983, Mathematical systems theory.

[22]  Tao Jiang,et al.  W-AlignACE: an improved Gibbs sampling algorithm based on more accurate position weight matrices learned from sequence and gene expression/ChIP-chip data , 2008, Bioinform..

[23]  N M Luscombe,et al.  What is Bioinformatics? A Proposed Definition and Overview of the Field , 2001, Methods of Information in Medicine.

[24]  Michael Lappe,et al.  Accurate Detection of Very Sparse Sequence Motifs , 2004, J. Comput. Biol..

[25]  Thierry Lecroq,et al.  The exact online string matching problem: A review of the most recent results , 2013, CSUR.

[26]  H. K. Dai,et al.  A survey of DNA motif finding algorithms , 2007, BMC Bioinformatics.

[27]  Francis Y. L. Chin,et al.  Voting algorithms for discovering long motifs , 2005, APBC.

[28]  Michael J. Fischer,et al.  The String-to-String Correction Problem , 1974, JACM.

[29]  Vladimir B. Bajic,et al.  Combining Position Weight Matrices and Document-Term Matrix for Efficient Extraction of Associations of Methylated Genes and Diseases from Free Text , 2013, PloS one.

[30]  Pandjassarame Kangueane,et al.  Bioinformatics: A Concept-Based Introduction , 2008 .

[31]  Yongqiang Zhang,et al.  EXMOTIF: efficient structured motif extraction , 2006, Algorithms for Molecular Biology.

[32]  Seymour Ginsburg,et al.  The mathematical theory of context free languages , 1966 .

[33]  Edward Keedwell,et al.  Intelligent Bioinformatics: The Application of Artificial Intelligence Techniques to Bioinformatics Problems , 2005 .

[34]  Jalel Rejeb,et al.  Extension of Aho-Corasick Algorithm to Detect Injection Attacks , 2007, SCSS.

[35]  Saurabh Sinha,et al.  Discriminative motifs , 2002, RECOMB '02.

[36]  Yongqiang Zhang,et al.  SMOTIF: efficient structured pattern and profile motif search , 2006, Algorithms for Molecular Biology.

[37]  Solon P. Pissis,et al.  MoTeX-II: structured MoTif eXtraction from large-scale datasets , 2014, BMC Bioinformatics.

[38]  P. Sellers On the Theory and Computation of Evolutionary Distances , 1974 .

[39]  P. Bucher Weight matrix descriptions of four eukaryotic RNA polymerase II promoter elements derived from 502 unrelated promoter sequences. , 1990, Journal of molecular biology.

[40]  Deborah A. Siegele,et al.  MOPAC: MOtif Finding by Preprocessing and Agglomerative Clustering from Microarrays , 2003, Pacific Symposium on Biocomputing.

[41]  Eleazar Eskin,et al.  Finding composite regulatory patterns in DNA sequences , 2002, ISMB.

[42]  Kazuhito Shida,et al.  GibbsST: a Gibbs sampling method for motif discovery with enhanced resistance to local optima , 2006, BMC Bioinformatics.

[43]  Harri Lähdesmäki,et al.  Evaluating a linear k-mer model for protein-DNA interactions using high-throughput SELEX data , 2013, BMC Bioinformatics.

[44]  T. K. Vintsyuk Speech discrimination by dynamic programming , 1968 .

[45]  John W. Lockwood,et al.  Fast and Scalable Pattern Matching for Network Intrusion Detection Systems , 2006, IEEE Journal on Selected Areas in Communications.

[46]  Hirotaka Ono,et al.  Best Fitting Fixed-Length Substring Patterns for a Set of Strings , 2005, COCOON.

[47]  David R. Gilbert,et al.  Approaches to the Automatic Discovery of Patterns in Biosequences , 1998, J. Comput. Biol..

[48]  John C. Nesbit The accuracy of approximate string matching algorithms , 1986 .

[49]  Matthew Simon,et al.  Automata Theory , 1999 .