Sequence Database Compression for Peptide Identification from Tandem Mass Spectra

The identification of peptides from tandem mass spectra is an important part of many high-throughput proteomics pipelines. In the high-throughput setting, the spectra are typically identified using software that matches tandem mass spectra with putative peptides from amino-acid sequence databases. The effectiveness of these search engines depends heavily on the completeness of the amino-acid sequence database used, but suitably complete amino-acid sequence databases are large, and the sequence database search engines typically have search times that are proportional to the size of the sequence database. We demonstrate that the peptide content of an amino-acid sequence database can be represented by a reformulated amino-acid sequence database containing fewer amino-acid symbols than the original. In some cases, where the original amino-acid sequence database contains many redundant peptides, we have been able to reduce the size of the amino- acid sequence to almost half of its original size. We develop a lower bound for achievable compression and demonstrate empirically that regardless of the peptide redundancy of the original amino-acid sequence database, we can compress the sequence to within 15-25% of this lower bound. We believe this may provide a principled way to combine amino-acid sequence data from many sources without unduly bloating the resulting sequence database with redundant peptide sequences.

[1]  P. Pevzner 1-Tuple DNA sequencing: computer analysis. , 1989, Journal of biomolecular structure & dynamics.

[2]  Vineet Bafna,et al.  SCOPE: a probabilistic model for scoring tandem mass spectra against a peptide database , 2001, ISMB.

[3]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[4]  R. Drmanac,et al.  Sequencing of megabase plus DNA by hybridization: theory of the method. , 1989, Genomics.

[5]  Rolf Apweiler,et al.  VARSPLIC: alternatively-spliced protein sequences derived from SWISS-PROT and TrEMBL , 2000, Bioinform..

[6]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[7]  A D Mirzabekov,et al.  [DNA sequencing by hybridization with oligonucleotides immobilized in a gel. Chemical ligation as a method of expanding the prospects for the method]. , 1994, Molekuliarnaia biologiia.

[8]  W. Bains,et al.  A novel method for nucleic acid sequence determination. , 1988, Journal of theoretical biology.

[9]  Nathan Edwards,et al.  Generating Peptide Candidates from Amino-Acid Sequence Databases for Protein Identification via Mass Spectrometry , 2002, WABI.

[10]  D. N. Perkins,et al.  Probability‐based protein identification by searching sequence databases using mass spectrometry data , 1999, Electrophoresis.

[11]  Thomas Erlebach,et al.  Algorithmic complexity of protein identification: combinatorics of weighted strings , 2004, Discret. Appl. Math..

[12]  Mikhail S. Gelfand,et al.  Pro-Frame: similarity-based gene recognition in eukaryotic DNA sequences with errors , 2001, Bioinform..

[13]  Cathy H. Wu,et al.  UniProt: the Universal Protein knowledgebase , 2004, Nucleic Acids Res..

[14]  de Ng Dick Bruijn A combinatorial problem , 1946 .

[15]  J. Yates,et al.  An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database , 1994, Journal of the American Society for Mass Spectrometry.