Statistics of cleavage fragments in random weighted strings

Peptide mass fingerprinting is an important technique that allows to identify a protein from its fragment masses obtained by mass spectrometry after enzymatic fragmentation: An experimental mass fingerprint is compared with or aligned to several reference fingerprints obtained from protein databases using in-silico digestion. Recently, much attention has been given to the questions of how to score such an alignment of mass spectra and how to evaluate its significance; results have been developed mostly from a combinatorial perspective. In particular, existing methods generally do not (or only at the price of a combinatorial explosion) capture the fact that the same amino acid can have different masses because of, e.g., isotopic distributions or variable chemical modifications. We offer several new contributions to the field: We introduce the notions of a probabilistically weighted alphabet, where each character can have different masses according to a specified probability distribution, and the notion of a random weighted string as a fundamental model for a random protein. We then develop a general computational framework, which we call weighted HMMs for various length and mass statistics of cleavage fragments of random proteins. We obtain general formulas for the length distribution of a fragment, the number of fragments, the joint length-mass distribution, and for fragment mass occurrence probabilities, and special results for so-called standard cleavage schemes (e.g., for Trypsin). We also discuss how to efficiently implement the probability computations. Computational results are provided, as well as a comparison to proteins from the SwissProt database.

[1]  Craig A. Stewart,et al.  Introduction to computational biology , 2005 .

[2]  John T. Stults,et al.  Protein identification: The origins of peptide mass fingerprinting , 2003, Journal of the American Society for Mass Spectrometry.

[3]  Gesine Reinert,et al.  Probabilistic and Statistical Properties of Words: An Overview , 2000, J. Comput. Biol..

[4]  Abraham J. Wyner More on recurrence and waiting times , 1999 .

[5]  Thomas Erlebach,et al.  Algorithmic complexity of protein identification: combinatorics of weighted strings , 2004, Discret. Appl. Math..

[6]  I-Jeng Wang,et al.  A statistical model of proteolytic digestion , 2003, Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference. CSB2003.

[7]  Jacques Colinge,et al.  A Systematic Statistical Analysis of Ion Trap Tandem Mass Spectra in View of Peptide Scoring , 2003, WABI.

[8]  B. Chait,et al.  ProFound: an expert system for protein identification using mass spectrometric peptide mapping information. , 2000, Analytical chemistry.

[9]  Nathan Edwards,et al.  Generating Peptide Candidates from Amino-Acid Sequence Databases for Protein Identification via Mass Spectrometry , 2002, WABI.

[10]  R. Aebersold,et al.  Mass spectrometry-based proteomics , 2003, Nature.

[11]  A. Bairoch,et al.  The SWISS-PROT protein sequence data bank. , 1991, Nucleic acids research.

[12]  P. Højrup,et al.  Rapid identification of proteins by peptide-mass fingerprinting , 1993, Current Biology.

[13]  Ting Chen,et al.  A Hidden Markov Model Based Scoring Function for Mass Spectrometry Database Search , 2005, RECOMB.

[14]  Nikhil Bansal,et al.  Efficient Algorithms for Finding Submasses in Weighted Strings , 2004, CPM.

[15]  Sebastian Böcker,et al.  Mass spectra alignments and their significance , 2007, J. Discrete Algorithms.

[16]  P. Pevzner,et al.  PepNovo: de novo peptide sequencing via probabilistic network modeling. , 2005, Analytical chemistry.

[17]  D. N. Perkins,et al.  Probability‐based protein identification by searching sequence databases using mass spectrometry data , 1999, Electrophoresis.

[18]  Jean-Jacques Daudin,et al.  Exact distribution of word occurrences in a random sequence of letters , 1999, Journal of Applied Probability.

[19]  Zsuzsanna Lipták,et al.  Efficient mass decomposition , 2005, SAC '05.

[20]  Mireille Régnier,et al.  A unified approach to word occurrence probabilities , 2000, Discret. Appl. Math..