论文信息 - Algorithmic complexity of protein identification: combinatorics of weighted strings

Algorithmic complexity of protein identification: combinatorics of weighted strings

Abstract We investigate a problem which arises in computational biology: Given a constant-size alphabet A with a weight function μ : A → N , find an efficient data structure and query algorithm solving the following problem: For a string σ over A and a weight M∈ N , decide whether σ contains a substring with weight M , where the weight of a string is the sum of the weights of its letters (O NE -S TRING M ASS F INDING P ROBLEM ). If the answer is yes , then we may in addition require a witness, i.e., indices i ⩽ j such that the substring beginning at position i and ending at position j has weight M . We allow preprocessing of the string and measure efficiency in two parameters: storage space required for the preprocessed data and running time of the query algorithm for given M . We are interested in data structures and algorithms requiring subquadratic storage space and sublinear query time, where we measure the input size as the length n of the input string σ . Among others, we present two non-trivial efficient algorithms: L OOKUP solves the problem with O( n ) storage space and O (n/ log n) time; I NTERVAL solves the problem for binary alphabets with O( n ) storage space in O ( log n) query time. We introduce other variants of the problem and sketch how our algorithms may be extended for these variants. Finally, we discuss combinatorial properties of weighted strings.

[1] Larry Gonick,et al. The cartoon guide to genetics , 1983 .

[2] M. Wilm,et al. Error-tolerant identification of peptides in sequence databases by peptide sequence tags. , 1994, Analytical chemistry.

[3] C. Watanabe,et al. Identifying proteins from two-dimensional gels by molecular mass searching of peptide fragments in protein sequence databases. , 1993, Proceedings of the National Academy of Sciences of the United States of America.

[4] G. G. Stokes. "J." , 1890, The New Yale Book of Quotations.

[5] Grzegorz Rozenberg,et al. Handbook of Formal Languages , 1997, Springer Berlin Heidelberg.

[6] P. Højrup,et al. Use of mass spectrometric molecular weight information to identify proteins in sequence databases. , 1993, Biological mass spectrometry.

[7] János Komlós,et al. Storing a sparse table with O(1) worst case access time , 1982, 23rd Annual Symposium on Foundations of Computer Science (sfcs 1982).

[8] F. Young. Biochemistry , 1955, The Indian Medical Gazette.

[9] Jean Berstel,et al. Context-Free Languages and Pushdown Automata , 1997, Handbook of Formal Languages.

[10] Vineet Bafna,et al. SCOPE: a probabilistic model for scoring tandem mass spectra against a peptide database , 2001, ISMB.

[11] Peter L. Hammer,et al. Discrete Applied Mathematics , 1993 .

[13] Pavel A. Pevzner,et al. Computational molecular biology : an algorithmic approach , 2000 .

[14] Tero Harju,et al. Combinatorics on Words , 2004 .

[15] Michael L. Fredman,et al. Two applications of a probabilistic search technique: Sorting X+Y and building balanced search trees , 1975, STOC.

[16] David S. Johnson,et al. Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[17] Dan Gusfield,et al. Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[18] G. Gonnet,et al. Protein identification by mass profile fingerprinting. , 1993, Biochemical and biophysical research communications.

[19] Z. Galil,et al. Combinatorial Algorithms on Words , 1985 .

[20] Afonso Ferreira,et al. The Complexity of Searching in X+Y and Other Multisets , 1990, Inf. Process. Lett..

[21] João Meidanis,et al. Introduction to computational molecular biology , 1997 .

[22] J. Yates,et al. Mining genomes: correlating tandem mass spectra of modified and unmodified peptides to sequences in nucleotide databases. , 1995, Analytical chemistry.

[23] P. Højrup,et al. Rapid identification of proteins by peptide-mass fingerprinting , 1993, Current Biology.

[24] Chris L. Tang,et al. Efficiency of database search for identification of mutated and modified proteins via mass spectrometry. , 2001, Genome research.

[25] Wojciech Rytter,et al. Text Algorithms , 1994 .

[26] T. Hunkapiller,et al. Peptide mass maps: a highly informative approach to protein identification. , 1993, Analytical biochemistry.

[27] J. Yates,et al. An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database , 1994, Journal of the American Society for Mass Spectrometry.

[28] Thomas H. Payne,et al. Sorting X + Y , 1975, Commun. ACM.

[29] J R Yates,et al. Database searching using mass spectrometry data , 1998, Electrophoresis.

[30] Pavel A. Pevzner,et al. Mutation-tolerant protein identification by mass-spectrometry , 2000, RECOMB '00.

[31] Pavel A. Pevzner,et al. Mutation-Tolerant Protein Identification by Mass Spectrometry , 2000, J. Comput. Biol..

[32] Jon Louis Bentley,et al. Programming pearls , 1987, CACM.

[33] D. Du,et al. Combinatorial Group Testing and Its Applications , 1993 .

[34] János Komlós,et al. Storing a sparse table with O(1) worst case access time , 1982, 23rd Annual Symposium on Foundations of Computer Science (sfcs 1982).