Parallel Processing of Multiple Pattern Matching Algorithms for Biological Sequences: Methods and Performance Results

Multiple pattern matching is the computationally intensive kernel of many applications including information retrieval and intrusion detection systems, web and spam filters and virus scanners. The use of multiple pattern matching is very important in genomics where the algorithms are frequently used to locate nucleotide or amino acid sequence patterns in biological sequence databases. For example, when proteomics data is used for genome annotation in a process called proteogenomic mapping (Jaffe et al., 2004), a set of peptide identifications obtained using mass spectrometry is matched against a target genome translated in all six reading frames. Given a sequence database (or text) T = t1t2...tn of length n and a finite set of r patterns P = p1, p2, ..., pr , where each pi is a string pi = pi 1p i 2...p i m of length m over a finite character set Σ, the multiple pattern matching problem can be defined as the way to locate all the occurrences of any of the patterns in the sequence database. The naive solution to this problem is to perform r separate searches with one of the sequential algorithms (Navarro & Raffinot, 2002). While frequently used in the past, this technique is not efficient when a large pattern set is involved. The aim of all multiple pattern matching algorithms is to locate the occurrences of all patterns with a single pass of the sequence database. These algorithms are based of single-pattern matching algorithms, with some of their functions generalized to process multiple patterns simultaneously during the preprocessing phase, generally with the use of trie structures or hashing. Multiple pattern matching is widely used in computational biology for a variety of pattern matching tasks. Brundo and Morgenstern used a simplified version of the Aho-Corasick algorithm to identify anchor points in their CHAOS algorithm for fast alignment of large genomic sequences (Brudno & Morgenstern, 2002; Brudno et al., 2004). Hyyro et al. demonstrated that Aho-Corasick outperforms other algorithms for locating unique oligonucleotides in the yeast genome (Hyyro et al., 2005). The SITEBLAST algorithm (Michael et al., 2005) employs the Aho-Corasick algorithm to retrieve all motif anchors for a local alignment procedure for genomic sequences that makes use of prior knowledge. Buhler Parallel Processing of Multiple Pattern Matching Algorithms for Biological Sequences: Methods and Performance Results

[1]  R. Nigel Horspool,et al.  Practical fast searching in strings , 1980, Softw. Pract. Exp..

[2]  Jacob D. Jaffe,et al.  Proteogenomic mapping as a complementary method to perform genome annotation , 2004, Proteomics.

[3]  Richard M. Karp,et al.  Efficient Randomized Pattern-Matching Algorithms , 1987, IBM J. Res. Dev..

[4]  Udi Manber,et al.  A FAST ALGORITHM FOR MULTI-PATTERN SEARCHING , 1999 .

[5]  Al Geist,et al.  PVM (Parallel Virtual Machine) , 2011, Encyclopedia of Parallel Computing.

[6]  Martin Vingron,et al.  SITEBLAST-rapid and sensitive local alignment of genomic sequences employing motif anchors , 2005, Bioinform..

[7]  Yanggon Kim,et al.  A Fast Multiple String-Pattern Matching Algorithm , 1999 .

[8]  Mireille Régnier,et al.  Exact p-value calculation for heterotypic clusters of regulatory motifs and its application in computational annotation of cis-regulatory modules , 2007, Algorithms for Molecular Biology.

[9]  Rajkumar Buyya,et al.  High Performance Cluster Computing , 1999 .

[10]  Gonzalo Navarro,et al.  A Bit-Parallel Approach to Suffix Automata: Fast Extended String Matching , 1998, CPM.

[11]  Jeremy Buhler,et al.  Designing seeds for similarity search in genomic DNA , 2003, RECOMB '03.

[12]  Michael Brudno,et al.  The CHAOS/DIALIGN WWW server for multiple alignment of genomic sequences , 2004, Nucleic Acids Res..

[13]  Konstantinos G. Margaritis,et al.  Parallel implementations for string matching problem on a cluster of distributed workstations , 2002, Neural Parallel Sci. Comput..

[14]  Bradford Nichols,et al.  Pthreads programming , 1996 .

[15]  Konstantinos G. Margaritis,et al.  Performance evaluation of load balancing strategies for approximate string matching application on an MPI cluster of heterogeneous workstations , 2003, Future Gener. Comput. Syst..

[16]  Hermann A. Maurer Proceedings of the 6th Colloquium, on Automata, Languages and Programming , 1979 .

[17]  U. Manber,et al.  APPROXIMATE MULTIPLE STRING SEARCH , 1996 .

[18]  Udi Manber,et al.  Approximate Multiple Strings Search , 1996, CPM.

[19]  Rajkumar Buyya,et al.  High Performance Cluster Computing: Programming and Applications , 1999 .

[20]  Wojciech Plandowski,et al.  Speeding up two string-matching algorithms , 2005, Algorithmica.

[21]  Jorma Tarhio,et al.  Multipattern string matching with q-grams , 2007, ACM J. Exp. Algorithmics.

[22]  Maxime Crochemore,et al.  Factor Oracle: A New Structure for Pattern Matching , 1999, SOFSEM.

[23]  Xin-Min Tian,et al.  Intel OpenMP C++/Fortran Compiler for Hyper-Threading Technology: Implementation and Performance , 2002 .

[24]  Michael Brudno,et al.  Fast and sensitive alignment of large genomic sequences , 2002, Proceedings. IEEE Computer Society Bioinformatics Conference.

[25]  Beate Commentz-Walter,et al.  A String Matching Algorithm Fast on the Average , 1979, ICALP.

[26]  Wojciech Plandowski,et al.  Fast Practical Multi-Pattern Matching , 1999, Inf. Process. Lett..

[27]  Alejandro Duran,et al.  Is the Schedule Clause Really Necessary in OpenMP? , 2003, WOMPAT.

[28]  Jack Dongarra,et al.  PVM: Parallel virtual machine: a users' guide and tutorial for networked parallel computing , 1995 .

[29]  Peter Willett,et al.  Efficiency of text scanning in bibliographic databases using microprocessor-based, multiprocessor networks , 1988, J. Inf. Sci..

[30]  Leena Salmela,et al.  IMPROVED ALGORITHMS FOR STRING SEARCHING PROBLEMS , 2009 .

[31]  Weng-Fai Wong,et al.  Generating hardware from OpenMP programs , 2006, 2006 IEEE International Conference on Field Programmable Technology.

[32]  Jack Dongarra,et al.  MPI: The Complete Reference , 1996 .

[33]  Gonzalo Navarro,et al.  Flexible Pattern Matching in Strings: Practical On-Line Search Algorithms for Texts and Biological Sequences , 2002 .

[34]  D. Marr,et al.  Hyper-Threading Technology Architecture and MIcroarchitecture , 2002 .

[35]  Konstantinos G. Margaritis,et al.  Parallel Implementation of Exact Two Dimensional Pattern Matching Algorithms using MPI and OpenMP , 2009 .

[36]  Alfred V. Aho,et al.  Efficient string matching , 1975, Commun. ACM.

[37]  Wojciech Rytter,et al.  Text Algorithms , 1994 .

[38]  Martti Juhola,et al.  On exact string matching of unique oligonucleotides , 2005, Comput. Biol. Medicine.

[39]  Wei Zhang,et al.  MDH: A High Speed Multi-phase Dynamic Hash String Matching Algorithm for Large-Scale Pattern Set , 2007, ICICS.

[40]  Konstantinos G. Margaritis,et al.  Experimental Results on Multiple Pattern Matching Algorithms for Biological Sequences , 2011, BIOINFORMATICS.

[41]  Gaston H. Gonnet,et al.  A new approach to text searching , 1989, SIGIR '89.