Fast Motif Search in Protein Sequence Databases

Regular expression pattern matching is widely used in computational biology. Searching through a database of sequences for a motif (a simple regular expression), or its variations is an important interactive process which requires fast motif-matching algorithms. In this paper, we explore and evaluate various representations of the database of sequences using suffix trees for two types of query problems for a given regular expression: 1) Find the first match, and 2) Find all matches. Answering Problem 1 increases the level and effectiveness of interactive motif exploration. We propose a framework in which Problem 1 can be solved in a faster manner than existing solutions while not slowing down the solution of Problem 2. We apply several heuristics both at the level of suffix tree creation resulting in modified tree representations, and at the regular expression matching level in which we search subtrees in a given predefined order by simulating a deterministic finite automaton that we create from the given regular expression. The focus of our work is to develop a method for faster retrieval of PROSITE motif (a restricted regular expression) matches from a protein sequence database. We show empirically the effectiveness of our solution using several real protein data sets.

[1]  Abdullah N. Arslan Efficient approximate dictionary look-up over small alphabets , 2005 .

[2]  Marcos Kiwi,et al.  LATIN 2006: Theoretical Informatics , 2006, Lecture Notes in Computer Science.

[3]  Michael Sipser,et al.  Introduction to the Theory of Computation , 1996, SIGA.

[4]  A. R. Meyer,et al.  Handbook of Theoretical Computer Science: Algorithms and Complexity , 1990 .

[5]  Abdullah N. Arslan Efficient Approximate Dictionary Look-Up for Long Words over Small Alphabets , 2006, LATIN.

[6]  Amos Bairoch,et al.  The PROSITE database, its status in 2002 , 2002, Nucleic Acids Res..

[7]  M. Crochemore,et al.  On-line construction of suffix trees , 2002 .

[8]  Dan Gusfield Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[9]  Jan van Leeuwen,et al.  Handbook of Theoretical Computer Science, Vol. A: Algorithms and Complexity , 1994 .

[10]  Maria Jesus Martin,et al.  The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003 , 2003, Nucleic Acids Res..

[11]  Amos Bairoch,et al.  The PROSITE database, its status in 1997 , 1997, Nucleic Acids Res..

[12]  Alfred V. Aho,et al.  Algorithms for Finding Patterns in Strings , 1991, Handbook of Theoretical Computer Science, Volume A: Algorithms and Complexity.

[13]  J. Riedl MOTIF EXPLORER - A TOOL FOR INTERACTI VE EXPLORATI ON OF AMINOAC ID SEQUENCE MOTI FS , 2007 .

[14]  Amos Bairoch,et al.  ScanProsite: a reference implementation of a PROSITE scanning tool. , 2002, Applied bioinformatics.