Algorithmic Complexity of Protein Identification: Searching in Weighted Strings

We investigate a problem which arises in computational biology: Given a constant-size alphabet A with a weight function µ: A → ℕ, find an efficient data structure and query algorithm solving the following problem: For a string σ over A and a weight M ∈ ℕ, decide whether a contains a substring with weight M (One-String Mass Finding Problem). If the answer is yes then we may in addition require a witness, i.e., indices i ≤ j such that the substring beginning at position i and ending at position j has weight M. We allow preprocessing of the string, and measure efficiency in two parameters: storage space required for the preprocessed data, and running time of the query algorithm for given M. We are interested in data structures and algorithms requiring subquadratic storage space and sublinear query time, where we measure the input size as the length of the input string. Among others, we present two non-trivial efficient algorithms: Lookup solves the problem with 0(n) space and time; Interval solves the problem for binary alphabets with 0(n) storage space in 0(log n) query time. Finally, we introduce other variants of the problem and sketch how our algorithms may be extended for these variants.

[1]  J R Yates,et al.  Database searching using mass spectrometry data , 1998, Electrophoresis.

[2]  C. Watanabe,et al.  Identifying proteins from two-dimensional gels by molecular mass searching of peptide fragments in protein sequence databases. , 1993, Proceedings of the National Academy of Sciences of the United States of America.

[3]  J. Yates,et al.  An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database , 1994, Journal of the American Society for Mass Spectrometry.

[4]  G. Gonnet,et al.  Protein identification by mass profile fingerprinting. , 1993, Biochemical and biophysical research communications.

[5]  Michael L. Fredman,et al.  Two applications of a probabilistic search technique: Sorting X+Y and building balanced search trees , 1975, STOC.

[6]  P. Højrup,et al.  Use of mass spectrometric molecular weight information to identify proteins in sequence databases. , 1993, Biological mass spectrometry.

[7]  김동규,et al.  [서평]「Algorithms on Strings, Trees, and Sequences」 , 2000 .

[8]  M. Lothaire,et al.  Combinatorics on words: Frontmatter , 1997 .

[9]  D. Du,et al.  Combinatorial Group Testing and Its Applications , 1993 .

[10]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[11]  J. Yates,et al.  Mining genomes: correlating tandem mass spectra of modified and unmodified peptides to sequences in nucleotide databases. , 1995, Analytical chemistry.

[12]  Thomas Erlebach,et al.  Algorithmic complexity of protein identification: combinatorics of weighted strings , 2004, Discret. Appl. Math..

[13]  David Martin,et al.  Computational Molecular Biology: An Algorithmic Approach , 2001 .

[14]  Vineet Bafna,et al.  SCOPE: a probabilistic model for scoring tandem mass spectra against a peptide database , 2001, ISMB.

[15]  Pavel A. Pevzner,et al.  Mutation-tolerant protein identification by mass-spectrometry , 2000, RECOMB '00.

[16]  Wojciech Rytter,et al.  Text Algorithms , 1994 .

[17]  Chris L. Tang,et al.  Efficiency of database search for identification of mutated and modified proteins via mass spectrometry. , 2001, Genome research.

[18]  Larry Gonick,et al.  The cartoon guide to genetics , 1983 .

[19]  M. Wilm,et al.  Error-tolerant identification of peptides in sequence databases by peptide sequence tags. , 1994, Analytical chemistry.

[20]  P. Højrup,et al.  Rapid identification of proteins by peptide-mass fingerprinting , 1993, Current Biology.

[21]  Z. Galil,et al.  Combinatorial Algorithms on Words , 1985 .

[22]  Afonso Ferreira,et al.  The Complexity of Searching in X+Y and Other Multisets , 1990, Inf. Process. Lett..

[23]  João Meidanis,et al.  Introduction to computational molecular biology , 1997 .

[24]  T. Hunkapiller,et al.  Peptide mass maps: a highly informative approach to protein identification. , 1993, Analytical biochemistry.

[25]  Dennis Saleh Zs , 2001 .

[26]  Thomas H. Payne,et al.  Sorting X + Y , 1975, Commun. ACM.