Gapped Spectral Dictionaries and Their Applications for Database Searches of Tandem Mass Spectra*

Generating all plausible de novo interpretations of a peptide tandem mass (MS/MS) spectrum (Spectral Dictionary) and quickly matching them against the database represent a recently emerged alternative approach to peptide identification. However, the sizes of the Spectral Dictionaries quickly grow with the peptide length making their generation impractical for long peptides. We introduce Gapped Spectral Dictionaries (all plausible de novo interpretations with gaps) that can be easily generated for any peptide length thus addressing the limitation of the Spectral Dictionary approach. We show that Gapped Spectral Dictionaries are small thus opening a possibility of using them to speed-up MS/MS searches. Our MS-GappedDictionary algorithm (based on Gapped Spectral Dictionaries) enables proteogenomics applications (such as searches in the six-frame translation of the human genome) that are prohibitively time consuming with existing approaches. MS-GappedDictionary generates gapped peptides that occupy a niche between accurate but short peptide sequence tags and long but inaccurate full length peptide reconstructions. We show that, contrary to conventional wisdom, some high-quality spectra do not have good peptide sequence tags and introduce gapped tags that have advantages over the conventional peptide sequence tags in MS/MS database searches.

[1]  Richard D. Smith,et al.  Whole proteome analysis of post-translational modifications: applications of mass-spectrometry for proteogenomic annotation. , 2007, Genome research.

[2]  J. A. Taylor,et al.  Searching sequence databases via De novo peptide sequencing by tandem mass spectrometry , 2002, Molecular biotechnology.

[3]  David Eppstein,et al.  Finding the k Shortest Paths , 1999, SIAM J. Comput..

[4]  J. Yates,et al.  Mining genomes: correlating tandem mass spectra of modified and unmodified peptides to sequences in nucleotide databases. , 1995, Analytical chemistry.

[5]  P. Mortensen,et al.  Mass spectrometry allows direct identification of proteins in large genomes , 2001, Proteomics.

[6]  Daniel B. Goodman,et al.  Comparative proteogenomics: combining mass spectrometry and comparative genomics to analyze multiple genomes. , 2008, Genome research.

[7]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[8]  P. Pevzner,et al.  InsPecT: identification of posttranslationally modified peptides from tandem mass spectra. , 2005, Analytical chemistry.

[9]  Pavel A. Pevzner,et al.  De Novo Peptide Sequencing via Tandem Mass Spectrometry , 1999, J. Comput. Biol..

[10]  Samuel H. Payne,et al.  Discovery and revision of Arabidopsis genes by proteogenomics , 2008, Proceedings of the National Academy of Sciences.

[11]  David Goldberg,et al.  Lookup peaks: a hybrid of de novo sequencing and database search for protein identification by tandem mass spectrometry. , 2007, Analytical chemistry.

[12]  Ari M Frank,et al.  A ranking-based scoring function for peptide-spectrum matches. , 2009, Journal of proteome research.

[13]  Steven P Gygi,et al.  Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry , 2007, Nature Methods.

[14]  B. Searle,et al.  Identification of protein modifications using MS/MS de novo sequencing and the OpenSea alignment algorithm. , 2005, Journal of proteome research.

[15]  P. Zimmermann,et al.  Genome-Scale Proteomics Reveals Arabidopsis thaliana Gene Models and Proteome Dynamics , 2008, Science.

[16]  R. Guigó,et al.  Improving gene annotation using peptide mass spectrometry. , 2007, Genome research.

[17]  P. Pevzner,et al.  PepNovo: de novo peptide sequencing via probabilistic network modeling. , 2005, Analytical chemistry.

[18]  M. Wilm,et al.  Error-tolerant identification of peptides in sequence databases by peptide sequence tags. , 1994, Analytical chemistry.

[19]  R. Sommer,et al.  Proteogenomics of Pristionchus pacificus reveals distinct proteome structure of nematode models. , 2010, Genome research.

[20]  J. Yates,et al.  An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database , 1994, Journal of the American Society for Mass Spectrometry.

[21]  Alexey I Nesvizhskii,et al.  Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. , 2002, Analytical chemistry.

[22]  A. Burlingame,et al.  Functional Assignment of the 20 S Proteasome from Trypanosoma brucei Using Mass Spectrometry and New Bioinformatics Approaches* , 2001, The Journal of Biological Chemistry.

[23]  J. Choudhary,et al.  Interrogating the human genome using uninterpreted mass spectrometry data , 2001, Proteomics.

[24]  P. Pevzner,et al.  Spectral Profiles, a Novel Representation of Tandem Mass Spectra and Their Applications for De Novo Peptide Sequencing and Identification* □ S , 2022 .

[25]  P. Pevzner,et al.  False discovery rates of protein identifications: a strike against the two-peptide rule. , 2009, Journal of proteome research.

[26]  Richard D. Smith,et al.  Clustering millions of tandem mass spectra. , 2008, Journal of proteome research.

[27]  P. Pevzner,et al.  The Generating Function of CID, ETD, and CID/ETD Pairs of Tandem Mass Spectra: Applications to Database Search* , 2010, Molecular & Cellular Proteomics.

[28]  S. Bryant,et al.  Open mass spectrometry search algorithm. , 2004, Journal of proteome research.

[29]  Ming Li,et al.  PEAKS: powerful software for peptide de novo sequencing by tandem mass spectrometry. , 2003, Rapid communications in mass spectrometry : RCM.

[30]  Ruedi Aebersold,et al.  The standard protein mix database: a diverse data set to assist in the production of improved Peptide and protein identification software tools. , 2008, Journal of proteome research.

[31]  P. Pevzner,et al.  Spectral probabilities and generating functions of tandem mass spectra: a strike against decoy databases. , 2008, Journal of proteome research.

[32]  John R Yates,et al.  Parallel identification of new genes in Saccharomyces cerevisiae. , 2002, Genome research.