Gapped Spectral Dictionaries and Their Applications for Database Searches of Tandem Mass Spectra*

Generating all plausible de novo interpretations of a peptide tandem mass (MS/MS) spectrum (Spectral Dictionary) and quickly matching them against the database represent a recently emerged alternative approach to peptide identification However, the sizes of the Spectral Dictionaries quickly grow with the peptide length making their generation impractical for long peptides We introduce Gapped Spectral Dictionaries (all plausible de novo interpretations with gaps) that can be easily generated for any peptide length thus addressing the shortcoming of the Spectral Dictionary approach We show that Gapped Spectral Dictionaries are small thus opening a possibility of using them to speed-up MS/MS database searches Our MS-GappedDictionary algorithm (based on Gapped Spectral Dictionaries) enables proteogenomics applications that are prohibitively time consuming with existing approaches We further introduce gapped tags that have advantages over the conventional peptide sequence tags in filtration-based MS/MS database searches.

[1]  Damian Fermin,et al.  Novel gene and gene model detection using a whole genome open reading frame analysis in proteomics , 2006, Genome Biology.

[2]  Pavel A. Pevzner,et al.  De Novo Peptide Sequencing via Tandem Mass Spectrometry , 1999, J. Comput. Biol..

[3]  P. Pevzner,et al.  Spectral probabilities and generating functions of tandem mass spectra: a strike against decoy databases. , 2008, Journal of proteome research.

[4]  Richard Cole,et al.  Dictionary matching and indexing with errors and don't cares , 2004, STOC '04.

[5]  Maxime Crochemore,et al.  Algorithms on strings , 2007 .

[6]  J. Yates,et al.  Mining genomes: correlating tandem mass spectra of modified and unmodified peptides to sequences in nucleotide databases. , 1995, Analytical chemistry.

[7]  S. Bryant,et al.  Open mass spectrometry search algorithm. , 2004, Journal of proteome research.

[8]  John R Yates,et al.  Parallel identification of new genes in Saccharomyces cerevisiae. , 2002, Genome research.

[9]  P. Zimmermann,et al.  Genome-Scale Proteomics Reveals Arabidopsis thaliana Gene Models and Proteome Dynamics , 2008, Science.

[10]  P. Pevzner,et al.  PepNovo: de novo peptide sequencing via probabilistic network modeling. , 2005, Analytical chemistry.

[11]  Dan Gusfield Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[12]  Richard D. Smith,et al.  Whole proteome analysis of post-translational modifications: applications of mass-spectrometry for proteogenomic annotation. , 2007, Genome research.

[13]  Ari M Frank,et al.  A ranking-based scoring function for peptide-spectrum matches. , 2009, Journal of proteome research.

[14]  Ming Li,et al.  PEAKS: powerful software for peptide de novo sequencing by tandem mass spectrometry. , 2003, Rapid communications in mass spectrometry : RCM.

[15]  David Eppstein,et al.  Finding the k shortest paths , 1994, Proceedings 35th Annual Symposium on Foundations of Computer Science.

[16]  M. Wilm,et al.  Error-tolerant identification of peptides in sequence databases by peptide sequence tags. , 1994, Analytical chemistry.

[17]  Alexey I Nesvizhskii,et al.  Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. , 2002, Analytical chemistry.

[18]  P. Pevzner,et al.  The Generating Function of CID, ETD, and CID/ETD Pairs of Tandem Mass Spectra: Applications to Database Search* , 2010, Molecular & Cellular Proteomics.

[19]  J. Choudhary,et al.  Interrogating the human genome using uninterpreted mass spectrometry data , 2001, Proteomics.

[20]  J. Yates,et al.  An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database , 1994, Journal of the American Society for Mass Spectrometry.

[21]  Richard D. Smith,et al.  Clustering millions of tandem mass spectra. , 2008, Journal of proteome research.

[22]  Costas S. Iliopoulos,et al.  Pattern Matching Algorithms with Don't Cares , 2007, SOFSEM.

[23]  J. A. Taylor,et al.  Searching sequence databases via De novo peptide sequencing by tandem mass spectrometry , 2002, Molecular biotechnology.

[24]  David Goldberg,et al.  Lookup peaks: a hybrid of de novo sequencing and database search for protein identification by tandem mass spectrometry. , 2007, Analytical chemistry.

[25]  P. Mortensen,et al.  Mass spectrometry allows direct identification of proteins in large genomes , 2001, Proteomics.

[26]  P. Pevzner,et al.  InsPecT: identification of posttranslationally modified peptides from tandem mass spectra. , 2005, Analytical chemistry.

[27]  Manesh B Shah,et al.  Expressed peptide tags: an additional layer of data for genome annotation. , 2006, Journal of proteome research.

[28]  Steven P Gygi,et al.  Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry , 2007, Nature Methods.

[29]  Akhilesh Pandey,et al.  Genome annotation of Anopheles gambiae using mass spectrometry-derived data , 2005, BMC Genomics.

[30]  Samuel H. Payne,et al.  Discovery and revision of Arabidopsis genes by proteogenomics , 2008, Proceedings of the National Academy of Sciences.

[31]  Daniel B. Goodman,et al.  Comparative proteogenomics: combining mass spectrometry and comparative genomics to analyze multiple genomes. , 2008, Genome research.

[32]  R. Guigó,et al.  Improving gene annotation using peptide mass spectrometry. , 2007, Genome research.

[33]  B. Searle,et al.  Identification of protein modifications using MS/MS de novo sequencing and the OpenSea alignment algorithm. , 2005, Journal of proteome research.

[34]  Jacob D. Jaffe,et al.  Proteogenomic mapping as a complementary method to perform genome annotation , 2004, Proteomics.

[35]  R. Sommer,et al.  Proteogenomics of Pristionchus pacificus reveals distinct proteome structure of nematode models. , 2010, Genome research.

[36]  P. Pevzner,et al.  Spectral Profiles, a Novel Representation of Tandem Mass Spectra and Their Applications for De Novo Peptide Sequencing and Identification* □ S , 2022 .

[37]  A. Burlingame,et al.  Functional Assignment of the 20 S Proteasome from Trypanosoma brucei Using Mass Spectrometry and New Bioinformatics Approaches* , 2001, The Journal of Biological Chemistry.

[38]  Ruedi Aebersold,et al.  The standard protein mix database: a diverse data set to assist in the production of improved Peptide and protein identification software tools. , 2008, Journal of proteome research.

[39]  Costas S. Iliopoulos,et al.  Finding Patterns with Variable Length Gaps or Don't Cares , 2006, COCOON.

[40]  P. Pevzner,et al.  Spectral Dictionaries , 2009, Molecular & Cellular Proteomics.

[41]  P. Pevzner,et al.  False discovery rates of protein identifications: a strike against the two-peptide rule. , 2009, Journal of proteome research.