Overcoming Species Boundaries in Peptide Identification with Bayesian Information Criterion-driven Error-tolerant Peptide Search (BICEPS)*

Currently, the reliable identification of peptides and proteins is only feasible when thoroughly annotated sequence databases are available. Although sequencing capacities continue to grow, many organisms remain without reliable, fully annotated reference genomes required for proteomic analyses. Standard database search algorithms fail to identify peptides that are not exactly contained in a protein database. De novo searches are generally hindered by their restricted reliability, and current error-tolerant search strategies are limited by global, heuristic tradeoffs between database and spectral information. We propose a Bayesian information criterion-driven error-tolerant peptide search (BICEPS) and offer an open source implementation based on this statistical criterion to automatically balance the information of each single spectrum and the database, while limiting the run time. We show that BICEPS performs as well as current database search algorithms when such algorithms are applied to sequenced organisms, whereas BICEPS only uses a remotely related organism database. For instance, we use a chicken instead of a human database corresponding to an evolutionary distance of more than 300 million years (International Chicken Genome Sequencing Consortium (2004) Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution. Nature 432, 695–716). We demonstrate the successful application to cross-species proteomics with a 33% increase in the number of identified proteins for a filarial nematode sample of Litomosoides sigmodontis.

[1]  G. Petit,et al.  Litomosoides sigmodontis in mice: reappraisal of an old model for filarial research. , 2000, Parasitology today.

[2]  Alexey I Nesvizhskii,et al.  Analysis and validation of proteomic data generated by tandem mass spectrometry , 2007, Nature Methods.

[3]  Charles Darwin,et al.  Experiments , 1800, The Medical and physical journal.

[4]  Daniel A. Schaeffer,et al.  Error‐tolerant EST database searches by tandem mass spectrometry and multiTag software , 2005, Proteomics.

[5]  R. Beavis,et al.  A method for reducing the time required to match protein sequences with tandem mass spectra. , 2003, Rapid communications in mass spectrometry : RCM.

[6]  A. Shevchenko,et al.  Protein identification pipeline for the homology-driven proteomics. , 2008, Journal of proteomics.

[7]  Bin Ma,et al.  SPIDER: software for protein identification from sequence tags with de novo sequencing error. , 2004, Proceedings. IEEE Computational Systems Bioinformatics Conference.

[8]  Robertson Craig,et al.  TANDEM: matching proteins with tandem mass spectra. , 2004, Bioinformatics.

[9]  J. Yates,et al.  Method to correlate tandem mass spectra of modified peptides to amino acid sequences in the protein database. , 1995, Analytical chemistry.

[10]  R. J. Beynon,et al.  Cross Species Proteomics , 2010, Proteome Bioinformatics.

[11]  Richard D. Smith,et al.  De novo sequencing of unique sequence tags for discovery of post-translational modifications of proteins. , 2008, Analytical chemistry.

[12]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[13]  M. O. Dayhoff,et al.  22 A Model of Evolutionary Change in Proteins , 1978 .

[14]  Mark L. Blaxter,et al.  A molecular evolutionary framework for the phylum Nematoda , 1998, Nature.

[15]  A. Shevchenko,et al.  MultiTag: multiple error-tolerant sequence tag search for the sequence-similarity identification of proteins by mass spectrometry. , 2003, Analytical chemistry.

[16]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[17]  David L Tabb,et al.  DirecTag: accurate sequence tags from peptide MS/MS through statistical scoring. , 2008, Journal of proteome research.

[18]  Tatiana Tatusova,et al.  NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins , 2004, Nucleic Acids Res..

[19]  Joachim M. Buhmann,et al.  PepSplice: cache-efficient search algorithms for comprehensive identification of tandem mass spectra , 2007, Bioinform..

[20]  John M. Asara,et al.  Protein Sequences from Mastodon and Tyrannosaurus Rex Revealed by Mass Spectrometry , 2007, Science.

[21]  S. Salzberg,et al.  Genome Assembly Has a Major Impact on Gene Content: A Comparison of Annotation in Two Bos Taurus Assemblies , 2011, PloS one.

[22]  P. Pevzner,et al.  Automated de novo protein sequencing of monoclonal antibodies , 2008, Nature Biotechnology.

[23]  P. Pevzner,et al.  Sequence similarity‐driven proteomics in organisms with unknown genomes by LC‐MS/MS and automated de novo sequencing , 2007, Proteomics.

[24]  L. Cantley,et al.  Biomolecular Characterization and Protein Sequences of the Campanian Hadrosaur B. canadensis , 2009, Science.

[25]  W. Miller,et al.  Comment on "Protein Sequences from Mastodon and Tyrannosaurus rex Revealed by Mass Spectrometry" , 2008, Science.

[26]  Patrice Waridel,et al.  Rapid validation of protein identifications with the borderline statistical confidence via de novo sequencing and MS BLAST searches. , 2006, Journal of proteome research.

[27]  D. Tabb,et al.  TagRecon: high-throughput mutation identification through sequence tagging. , 2010, Journal of proteome research.

[28]  P. Pevzner,et al.  PepNovo: de novo peptide sequencing via probabilistic network modeling. , 2005, Analytical chemistry.

[29]  J. Buhmann,et al.  A workflow to increase the detection rate of proteins from unsequenced organisms in high‐throughput proteomics experiments , 2007, Proteomics.

[30]  M. Wilm,et al.  Error-tolerant identification of peptides in sequence databases by peptide sequence tags. , 1994, Analytical chemistry.

[31]  W. Pao,et al.  A Bioinformatics Workflow for Variant Peptide Detection in Shotgun Proteomics* , 2011, Molecular & Cellular Proteomics.

[32]  D. N. Perkins,et al.  Probability‐based protein identification by searching sequence databases using mass spectrometry data , 1999, Electrophoresis.

[33]  Christodoulos A. Floudas,et al.  A hybrid method for peptide identification using integer linear optimization, local database search, and quadrupole time-of-flight or OrbiTrap tandem mass spectrometry. , 2008, Journal of proteome research.

[34]  Judith A J Steen,et al.  When less can yield more – Computational preprocessing of MS/MS spectra for peptide identification , 2009, Proteomics.

[35]  J. Yates,et al.  An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database , 1994, Journal of the American Society for Mass Spectrometry.

[36]  A. Shevchenko,et al.  The Power and the Limitations of Cross-Species Protein Identification by Mass Spectrometry-driven Sequence Similarity Searches*S , 2004, Molecular & Cellular Proteomics.

[37]  Bin Ma,et al.  Automated protein (re)sequencing with MS/MS and a homologous database yields almost full coverage and accuracy , 2009, Bioinform..

[38]  P. Pevzner,et al.  Spectral Profiles, a Novel Representation of Tandem Mass Spectra and Their Applications for De Novo Peptide Sequencing and Identification* □ S , 2022 .

[39]  Colin N. Dewey,et al.  Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution , 2004, Nature.

[40]  Hanno Steen,et al.  Estimating the confidence of peptide identifications without decoy databases. , 2010, Analytical chemistry.

[41]  P. Bork,et al.  Charting the proteomes of organisms with unsequenced genomes by MALDI-quadrupole time-of-flight mass spectrometry and BLAST homology searching. , 2001, Analytical chemistry.

[42]  Leo C. McHugh,et al.  Computational Methods for Protein Identification from Mass Spectrometry Data , 2008, PLoS Comput. Biol..

[43]  Bo Yan,et al.  Peptide sequence tag-based blind identification of post-translational modifications with point process model , 2006, ISMB.

[44]  Ronald J Moore,et al.  Proteome-wide identification of proteins and their modifications with decreased ambiguities and improved false discovery rates using unique sequence tags. , 2008, Analytical chemistry.

[45]  David Goldberg,et al.  Lookup peaks: a hybrid of de novo sequencing and database search for protein identification by tandem mass spectrometry. , 2007, Analytical chemistry.

[46]  Andrew S. Greene,et al.  DeNovoID: a web-based tool for identifying peptides from sequence and mass tags deduced from de novo peptide sequencing by mass spectroscopy , 2005, Nucleic Acids Res..

[47]  J. Yates,et al.  GutenTag: high-throughput sequence tagging via an empirically derived fragmentation model. , 2003, Analytical chemistry.

[48]  P. Pevzner,et al.  Comment on "Protein Sequences from Mastodon and Tyrannosaurus rex Revealed by Mass Spectrometry" , 2008, Science.

[49]  Bin Ma,et al.  SPIDER: software for protein identification from sequence tags with de novo sequencing error , 2004, Proceedings. 2004 IEEE Computational Systems Bioinformatics Conference, 2004. CSB 2004..

[50]  Sean L Seymour,et al.  The Paragon Algorithm, a Next Generation Search Engine That Uses Sequence Temperature Values and Feature Probabilities to Identify Peptides from Tandem Mass Spectra*S , 2007, Molecular & Cellular Proteomics.

[51]  D. Creasy,et al.  Error tolerant searching of uninterpreted tandem mass spectrometry data , 2002, Proteomics.

[52]  A. Shevchenko,et al.  Tools for exploring the proteomosphere. , 2009, Journal of proteomics.

[53]  B. Searle,et al.  High-throughput identification of proteins and unanticipated sequence modifications using a mass-based alignment algorithm for MS/MS de novo sequencing results. , 2004, Analytical chemistry.

[54]  J. Yates,et al.  A hypergeometric probability model for protein identification and validation using tandem mass spectral data and protein sequence databases. , 2003, Analytical chemistry.

[55]  Gerald J Wyckoff,et al.  Virtual polymorphism: finding divergent peptide matches in mass spectrometry data. , 2007, Analytical chemistry.