Fast profile matching algorithms - A survey

Position-specific scoring matrices are a popular choice for modelling signals or motifs in biological sequences, both in DNA and protein contexts. A lot of effort has been dedicated to the definition of suitable scores and thresholds for increasing the specificity of the model and the sensitivity of the search. It is quite surprising that, until very recently, little attention has been paid to the actual process of finding the matches of the matrices in a set of sequences, once the score and the threshold have been fixed. In fact, most profile matching tools still rely on a simple sliding window approach to scan the input sequences. This can be a very time expensive routine when searching for hits of a large set of scoring matrices in a sequence database. In this paper we will give a survey of proposed approaches to speed up profile matching based on statistical significance, multipattern matching, filtering, indexing data structures, matrix partitioning, Fast Fourier Transform and data compression. These approaches improve the expected searching time of profile matching, thus leading to implementation of faster tools in practice.

[1]  Wyeth W. Wasserman,et al.  JASPAR: an open-access database for eukaryotic transcription factor binding profiles , 2004, Nucleic Acids Res..

[2]  Maria Jesus Martin,et al.  The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003 , 2003, Nucleic Acids Res..

[3]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[4]  Andrea Califano,et al.  SPLASH: structural pattern localization analysis by sequential histograms , 2000, Bioinform..

[5]  T. D. Schneider,et al.  Use of the 'Perceptron' algorithm to distinguish translational initiation sites in E. coli. , 1982, Nucleic acids research.

[6]  Peter Weiner,et al.  Linear Pattern Matching Algorithms , 1973, SWAT.

[7]  Peter Sanders,et al.  Simple Linear Work Suffix Array Construction , 2003, ICALP.

[8]  Maria Jesus Martin,et al.  High-quality Protein Knowledge Resource: SWISS-PROT and TrEMBL , 2002, Briefings Bioinform..

[9]  Terri K. Attwood,et al.  FingerPRINTScan: intelligent searching of the PRINTS motif database , 1999, Bioinform..

[10]  S. Altschul,et al.  Detection of conserved segments in proteins: iterative scanning of sequence databases with alignment blocks. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[11]  Stefan Kurtz,et al.  Reducing the space requirement of suffix trees , 1999 .

[12]  E. Wingender,et al.  MATCH: A tool for searching transcription factor binding sites in DNA sequences. , 2003, Nucleic acids research.

[13]  Rodger Staden,et al.  Methods for calculating the probabilities of finding patterns in sequences , 1989, Comput. Appl. Biosci..

[14]  Douglas L. Brutlag,et al.  Fast probabilistic analysis of sequence function using scoring matrices , 2000, Bioinform..

[15]  M. O. Dayhoff,et al.  22 A Model of Evolutionary Change in Proteins , 1978 .

[16]  Gary D. Stormo,et al.  DNA binding sites: representation and discovery , 2000, Bioinform..

[17]  Edward M. McCreight,et al.  A Space-Economical Suffix Tree Construction Algorithm , 1976, JACM.

[18]  M. Fischer,et al.  STRING-MATCHING AND OTHER PRODUCTS , 1974 .

[19]  Michael Beckstette,et al.  PoSSuMsearch: Fast and Sensitive Matching of Position Specific Scoring Matrices using Enhanced Suffix Arrays , 2004, German Conference on Bioinformatics.

[20]  Martin Vingron,et al.  On the Power of Profiles for Transcription Factor Binding Site Detection , 2003, Statistical applications in genetics and molecular biology.

[21]  Alessandro Bogliolo,et al.  Using sequence compression to speedup probabilistic profile matching , 2005, Bioinform..

[22]  T. D. Schneider,et al.  Information content of binding sites on nucleotide sequences. , 1986, Journal of molecular biology.

[23]  Alexander E. Kel,et al.  TRANSFAC®: transcriptional regulation, from patterns to profiles , 2003, Nucleic Acids Res..

[24]  T. Attwood,et al.  PRINTS--a protein motif fingerprint database. , 1994, Protein engineering.

[25]  Dong Kyue Kim,et al.  Linear-Time Construction of Suffix Arrays , 2003, CPM.

[26]  Gaston H. Gonnet,et al.  Some string matching problems from Bioinformatics which still need better solutions , 2004, J. Discrete Algorithms.

[27]  Steven Henikoff,et al.  PATMAT: a searching and extraction program for sequence, pattern and block queries and databases , 1992, Comput. Appl. Biosci..

[28]  Rainer Fuchs Block searches on VAX and Alpha computer systems , 1993, Comput. Appl. Biosci..

[29]  Michael Beckstette,et al.  Fast index based algorithms and software for matching position specific scoring matrices , 2006, BMC Bioinformatics.

[30]  Dan Gusfield,et al.  Algorithms on strings , 1997 .

[31]  Sanguthevar Rajasekaran,et al.  The Efficient Computation of Position-Specific Match Scores with the Fast Fourier Transform , 2002, J. Comput. Biol..

[32]  G. Navarro,et al.  Flexible Pattern Matching in Strings: Approximate matching , 2002 .

[33]  Douglas L. Brutlag,et al.  Minimal-Risk Scoring Matrices for Sequence Analysis , 1999, J. Comput. Biol..

[34]  S. Henikoff,et al.  Finding protein similarities with nucleotide sequence databases. , 1990, Methods in enzymology.

[35]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[36]  T. Werner,et al.  MatInd and MatInspector: new fast and versatile tools for detection of consensus matches in nucleotide sequence data. , 1995, Nucleic acids research.

[37]  M. Crochemore,et al.  On-line construction of suffix trees , 2002 .

[38]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[39]  Jorja G. Henikoff,et al.  Using substitution probabilities to improve position-specific scoring matrices , 1996, Comput. Appl. Biosci..

[40]  A. D. McLachlan,et al.  Profile analysis: detection of distantly related proteins. , 1987, Proceedings of the National Academy of Sciences of the United States of America.

[41]  David Haussler,et al.  Dirichlet mixtures: a method for improved detection of weak but significant protein sequence homology , 1996, Comput. Appl. Biosci..

[42]  Shmuel Pietrokovski,et al.  Increased coverage of protein families with the Blocks Database servers , 2000, Nucleic Acids Res..

[43]  Alfred V. Aho,et al.  Efficient string matching , 1975, Commun. ACM.

[44]  Wojciech Rytter,et al.  Text Algorithms , 1994 .

[45]  Esko Ukkonen,et al.  Pattern Discovery in Biosequences , 1998, ICGI.

[46]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[47]  Rainer Fuchs Fast protein block searches , 1994, Comput. Appl. Biosci..

[48]  Jean-Stéphane Varré,et al.  Large Scale Matching for Position Weight Matrices , 2006, CPM.

[49]  Alexander E. Kel,et al.  MATCHTM: a tool for searching transcription factor binding sites in DNA sequences , 2003, Nucleic Acids Res..

[50]  Abraham Lempel,et al.  Compression of individual sequences via variable-rate coding , 1978, IEEE Trans. Inf. Theory.

[51]  Srinivas Aluru,et al.  Space efficient linear time construction of suffix arrays , 2005, J. Discrete Algorithms.

[52]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[53]  S. Henikoff,et al.  Scores for sequence searches and alignments. , 1996, Current opinion in structural biology.

[54]  Gonzalo Navarro,et al.  Flexible Pattern Matching in Strings: Practical On-Line Search Algorithms for Texts and Biological Sequences , 2002 .

[55]  S. Eddy Hidden Markov models. , 1996, Current opinion in structural biology.

[56]  Enno Ohlebusch,et al.  Replacing suffix trees with enhanced suffix arrays , 2004, J. Discrete Algorithms.

[57]  Bogdan Dorohonceanu,et al.  Accelerating Protein Classification Using Suffix Trees , 2000, ISMB.

[58]  Esko Ukkonen,et al.  Fast Search Algorithms for Position Specific Scoring Matrices , 2007, BIRD.