Finding Significant Matches of Position Weight Matrices in Linear Time

Position weight matrices are an important method for modeling signals or motifs in biological sequences, both in DNA and protein contexts. In this paper, we present fast algorithms for the problem of finding significant matches of such matrices. Our algorithms are of the online type, and they generalize classical multipattern matching, filtering, and superalphabet techniques of combinatorial string matching to the problem of weight matrix matching. Several variants of the algorithms are developed, including multiple matrix extensions that perform the search for several matrices in one scan through the sequence database. Experimental performance evaluation is provided to compare the new techniques against each other as well as against some other online and index-based algorithms proposed in the literature. Compared to the brute-force O(mn) approach, our solutions can be faster by a factor that is proportional to the matrix length m. Our multiple-matrix filtration algorithm had the best performance in the experiments. On a current PC, this algorithm finds significant matches (p = 0.0001) of the 123 JASPAR matrices in the human genome in about 18 minutes.

[1]  Alessandro Bogliolo,et al.  Using sequence compression to speedup probabilistic profile matching , 2005, Bioinform..

[2]  Douglas L. Brutlag,et al.  Fast probabilistic analysis of sequence function using scoring matrices , 2000, Bioinform..

[3]  Alexander E. Kel,et al.  TRANSFAC®: transcriptional regulation, from patterns to profiles , 2003, Nucleic Acids Res..

[4]  Sanguthevar Rajasekaran,et al.  The Efficient Computation of Position-Specific Match Scores with the Fast Fourier Transform , 2002, J. Comput. Biol..

[5]  T. Werner,et al.  MatInd and MatInspector: new fast and versatile tools for detection of consensus matches in nucleotide sequence data. , 1995, Nucleic acids research.

[6]  A. D. McLachlan,et al.  Profile analysis: detection of distantly related proteins. , 1987, Proceedings of the National Academy of Sciences of the United States of America.

[7]  Jorma Tarhio,et al.  Algorithms for Weighted Matching , 2007, SPIRE.

[8]  Esko Ukkonen,et al.  Fast Search Algorithms for Position Specific Scoring Matrices , 2007, BIRD.

[9]  Bogdan Dorohonceanu,et al.  Accelerating Protein Classification Using Suffix Trees , 2000, ISMB.

[10]  Kimmo Fredriksson,et al.  Shift-or string matching with super-alphabets , 2003, Inf. Process. Lett..

[11]  Michael Beckstette,et al.  PoSSuMsearch: Fast and Sensitive Matching of Position Specific Scoring Matrices using Enhanced Suffix Arrays , 2004, German Conference on Bioinformatics.

[12]  Steven Henikoff,et al.  PATMAT: a searching and extraction program for sequence, pattern and block queries and databases , 1992, Comput. Appl. Biosci..

[13]  Michael Beckstette,et al.  Fast index based algorithms and software for matching position specific scoring matrices , 2006, BMC Bioinformatics.

[14]  T. D. Schneider,et al.  Information content of binding sites on nucleotide sequences. , 1986, Journal of molecular biology.

[15]  Wyeth W. Wasserman,et al.  JASPAR: an open-access database for eukaryotic transcription factor binding profiles , 2004, Nucleic Acids Res..

[16]  Gonzalo Navarro,et al.  Flexible Pattern Matching in Strings: Practical On-Line Search Algorithms for Texts and Biological Sequences , 2002 .

[17]  Maria Jesus Martin,et al.  The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003 , 2003, Nucleic Acids Res..

[18]  Esko Ukkonen,et al.  MOODS: fast search for position weight matrix matches in DNA sequences , 2009, Bioinform..

[19]  Terri K. Attwood,et al.  FingerPRINTScan: intelligent searching of the PRINTS motif database , 1999, Bioinform..

[20]  Shmuel Pietrokovski,et al.  Increased coverage of protein families with the Blocks Database servers , 2000, Nucleic Acids Res..

[21]  E. Ukkonen,et al.  Genome-wide Prediction of Mammalian Enhancers Based on Analysis of Transcription-Factor Binding Affinity , 2006, Cell.

[22]  Alfred V. Aho,et al.  Efficient string matching , 1975, Commun. ACM.

[23]  Wojciech Rytter,et al.  Text Algorithms , 1994 .

[24]  Esko Ukkonen,et al.  Approximate String Matching with q-grams and Maximal Matches , 1992, Theor. Comput. Sci..

[25]  Rodger Staden,et al.  Methods for calculating the probabilities of finding patterns in sequences , 1989, Comput. Appl. Biosci..

[26]  T. Attwood,et al.  PRINTS--a protein motif fingerprint database. , 1994, Protein engineering.

[27]  S. Henikoff,et al.  Finding protein similarities with nucleotide sequence databases. , 1990, Methods in enzymology.

[28]  Michael Q. Zhang,et al.  Statistical significance of cis-regulatory modules. , 2007, BMC bioinformatics.

[29]  Jean-Stéphane Varré,et al.  Large Scale Matching for Position Weight Matrices , 2006, CPM.

[30]  Robert S. Boyer,et al.  A fast string searching algorithm , 1977, CACM.

[31]  T. D. Schneider,et al.  Use of the 'Perceptron' algorithm to distinguish translational initiation sites in E. coli. , 1982, Nucleic acids research.