Algorithmic complexity of protein identification: combinatorics of weighted strings

Abstract We investigate a problem which arises in computational biology: Given a constant-size alphabet A with a weight function μ : A → N , find an efficient data structure and query algorithm solving the following problem: For a string σ over A and a weight M∈ N , decide whether σ contains a substring with weight M , where the weight of a string is the sum of the weights of its letters (O NE -S TRING M ASS F INDING P ROBLEM ). If the answer is yes , then we may in addition require a witness, i.e., indices i ⩽ j such that the substring beginning at position i and ending at position j has weight M . We allow preprocessing of the string and measure efficiency in two parameters: storage space required for the preprocessed data and running time of the query algorithm for given M . We are interested in data structures and algorithms requiring subquadratic storage space and sublinear query time, where we measure the input size as the length n of the input string σ . Among others, we present two non-trivial efficient algorithms: L OOKUP solves the problem with O( n ) storage space and O (n/ log n) time; I NTERVAL solves the problem for binary alphabets with O( n ) storage space in O ( log n) query time. We introduce other variants of the problem and sketch how our algorithms may be extended for these variants. Finally, we discuss combinatorial properties of weighted strings.

[1]  Larry Gonick,et al.  The cartoon guide to genetics , 1983 .

[2]  M. Wilm,et al.  Error-tolerant identification of peptides in sequence databases by peptide sequence tags. , 1994, Analytical chemistry.

[3]  C. Watanabe,et al.  Identifying proteins from two-dimensional gels by molecular mass searching of peptide fragments in protein sequence databases. , 1993, Proceedings of the National Academy of Sciences of the United States of America.

[4]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[5]  Grzegorz Rozenberg,et al.  Handbook of Formal Languages , 1997, Springer Berlin Heidelberg.

[6]  P. Højrup,et al.  Use of mass spectrometric molecular weight information to identify proteins in sequence databases. , 1993, Biological mass spectrometry.

[7]  János Komlós,et al.  Storing a sparse table with O(1) worst case access time , 1982, 23rd Annual Symposium on Foundations of Computer Science (sfcs 1982).

[8]  F. Young Biochemistry , 1955, The Indian Medical Gazette.

[9]  Jean Berstel,et al.  Context-Free Languages and Pushdown Automata , 1997, Handbook of Formal Languages.

[10]  Vineet Bafna,et al.  SCOPE: a probabilistic model for scoring tandem mass spectra against a peptide database , 2001, ISMB.

[11]  Peter L. Hammer,et al.  Discrete Applied Mathematics , 1993 .

[13]  Pavel A. Pevzner,et al.  Computational molecular biology : an algorithmic approach , 2000 .

[14]  Tero Harju,et al.  Combinatorics on Words , 2004 .

[15]  Michael L. Fredman,et al.  Two applications of a probabilistic search technique: Sorting X+Y and building balanced search trees , 1975, STOC.

[16]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[17]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[18]  G. Gonnet,et al.  Protein identification by mass profile fingerprinting. , 1993, Biochemical and biophysical research communications.

[19]  Z. Galil,et al.  Combinatorial Algorithms on Words , 1985 .

[20]  Afonso Ferreira,et al.  The Complexity of Searching in X+Y and Other Multisets , 1990, Inf. Process. Lett..

[21]  João Meidanis,et al.  Introduction to computational molecular biology , 1997 .

[22]  J. Yates,et al.  Mining genomes: correlating tandem mass spectra of modified and unmodified peptides to sequences in nucleotide databases. , 1995, Analytical chemistry.

[23]  P. Højrup,et al.  Rapid identification of proteins by peptide-mass fingerprinting , 1993, Current Biology.

[24]  Chris L. Tang,et al.  Efficiency of database search for identification of mutated and modified proteins via mass spectrometry. , 2001, Genome research.

[25]  Wojciech Rytter,et al.  Text Algorithms , 1994 .

[26]  T. Hunkapiller,et al.  Peptide mass maps: a highly informative approach to protein identification. , 1993, Analytical biochemistry.

[27]  J. Yates,et al.  An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database , 1994, Journal of the American Society for Mass Spectrometry.

[28]  Thomas H. Payne,et al.  Sorting X + Y , 1975, Commun. ACM.

[29]  J R Yates,et al.  Database searching using mass spectrometry data , 1998, Electrophoresis.

[30]  Pavel A. Pevzner,et al.  Mutation-tolerant protein identification by mass-spectrometry , 2000, RECOMB '00.

[31]  Pavel A. Pevzner,et al.  Mutation-Tolerant Protein Identification by Mass Spectrometry , 2000, J. Comput. Biol..

[32]  Jon Louis Bentley,et al.  Programming pearls , 1987, CACM.

[33]  D. Du,et al.  Combinatorial Group Testing and Its Applications , 1993 .

[34]  János Komlós,et al.  Storing a sparse table with O(1) worst case access time , 1982, 23rd Annual Symposium on Foundations of Computer Science (sfcs 1982).