PepExplorer: A Similarity-driven Tool for Analyzing de Novo Sequencing Results *

Peptide spectrum matching is the current gold standard for protein identification via mass-spectrometry-based proteomics. Peptide spectrum matching compares experimental mass spectra against theoretical spectra generated from a protein sequence database to perform identification, but protein sequences not present in a database cannot be identified unless their sequences are in part conserved. The alternative approach, de novo sequencing, can make it possible to infer a peptide sequence directly from a mass spectrum, but interpreting long lists of peptide sequences resulting from large-scale experiments is not trivial. With this as motivation, PepExplorer was developed to use rigorous pattern recognition to assemble a list of homologue proteins using de novo sequencing data coupled to sequence alignment to allow biological interpretation of the data. PepExplorer can read the output of various widely adopted de novo sequencing tools and converge to a list of proteins with a global false-discovery rate. To this end, it employs a radial basis function neural network that considers precursor charge states, de novo sequencing scores, peptide lengths, and alignment scores to select similar protein candidates, from a target-decoy database, usually obtained from phylogenetically related species. Alignments are performed using a modified Smith–Waterman algorithm tailored for the task at hand. We verified the effectiveness of our approach using a reference set of identifications generated by ProLuCID when searching for Pyrococcus furiosus mass spectra on the corresponding NCBI RefSeq database. We then modified the sequence database by swapping amino acids until ProLuCID was no longer capable of identifying any proteins. By searching the mass spectra using PepExplorer on the modified database, we were able to recover most of the identifications at a 1% false-discovery rate. Finally, we employed PepExplorer to disclose a comprehensive proteomic assessment of the Bothrops jararaca plasma, a known biological source of natural inhibitors of snake toxins. PepExplorer is integrated into the PatternLab for Proteomics environment, which makes available various tools for downstream data analysis, including resources for quantitative and differential proteomics.

[1]  T. D. Brock The value of basic research: discovery of Thermus aquaticus and other extreme thermophiles. , 1997, Genetics.

[2]  A. Tanaka,et al.  A new blood coagulation inhibitor from the snake Bothrops jararaca plasma: isolation and characterization. , 2003, Biochemical and biophysical research communications.

[3]  O. Gotoh An improved algorithm for matching biological sequences. , 1982, Journal of molecular biology.

[4]  John R Yates,et al.  Can the false‐discovery rate be misleading? , 2011, Proteomics.

[5]  J. Fox,et al.  BJ46a, a snake venom metalloproteinase inhibitor. Isolation, characterization, cloning and insights into its mechanism of action. , 2001, European journal of biochemistry.

[6]  A. Tanaka,et al.  Proteomic Analysis of the Ontogenetic Variability in Plasma Composition of Juvenile and Adult Bothrops jararaca Snakes , 2013, International journal of proteomics.

[7]  P. Pevzner,et al.  PepNovo: de novo peptide sequencing via probabilistic network modeling. , 2005, Analytical chemistry.

[8]  Charles Buck,et al.  Performance evaluation of existing de novo sequencing algorithms. , 2006, Journal of proteome research.

[9]  K. Mullis,et al.  Primer-directed enzymatic amplification of DNA with a thermostable DNA polymerase. , 1988, Science.

[10]  M. Wilm,et al.  Error-tolerant identification of peptides in sequence databases by peptide sequence tags. , 1994, Analytical chemistry.

[11]  D. N. Perkins,et al.  Probability‐based protein identification by searching sequence databases using mass spectrometry data , 1999, Electrophoresis.

[12]  Lennart Martens,et al.  A complex standard for protein identification, designed by evolution. , 2012, Journal of proteome research.

[13]  Jay W Fox,et al.  Approaching the golden age of natural product pharmaceuticals from venom libraries: an overview of toxins and toxin-derivatives currently involved in therapeutic or diagnostic applications. , 2007, Current pharmaceutical design.

[14]  John R Yates,et al.  Search engine processor: Filtering and organizing peptide spectrum matches , 2012, Proteomics.

[15]  J. Yates,et al.  An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database , 1994, Journal of the American Society for Mass Spectrometry.

[16]  Derek J. Bailey,et al.  The One Hour Yeast Proteome* , 2013, Molecular & Cellular Proteomics.

[17]  L. Opie,et al.  The discovery of captopril: from large animals to small molecules. , 1995, Cardiovascular research.

[18]  Alexey I Nesvizhskii,et al.  Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. , 2002, Analytical chemistry.

[19]  J. A. Taylor,et al.  Sequence database searches via de novo peptide sequencing by tandem mass spectrometry. , 1997, Rapid communications in mass spectrometry : RCM.

[20]  Shane J. Neph,et al.  Personal and population genomics of human regulatory variation , 2012, Genome research.

[21]  Beatrix Ueberheide,et al.  Protein identification using sequential ion/ion reactions and tandem mass spectrometry. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[22]  Isolation, characterization, cloning and insights into its mechanism of action , 2001 .

[23]  Sean R Eddy,et al.  A new generation of homology search tools based on probabilistic inference. , 2009, Genome informatics. International Conference on Genome Informatics.

[24]  Felipe Maia Galvão França,et al.  Effectively addressing complex proteomic search spaces with peptide spectrum matching , 2013, Bioinform..

[25]  Catalin C. Barbacioru,et al.  mRNA-Seq whole-transcriptome analysis of a single cell , 2009, Nature Methods.

[26]  Steven P Gygi,et al.  Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry , 2007, Nature Methods.

[27]  Nuno Bandeira,et al.  Spectral networks: a new approach to de novo discovery of protein sequences and posttranslational modifications. , 2007, BioTechniques.

[28]  Eunok Paek,et al.  Fast Multi-blind Modification Search through Tandem Mass Spectrometry* , 2011, Molecular & Cellular Proteomics.

[29]  William Stafford Noble,et al.  Semi-supervised learning for peptide identification from shotgun proteomics datasets , 2007, Nature Methods.

[30]  K. Clauser,et al.  Shotgun Protein Sequencing with Meta-contig Assembly* , 2012, Molecular & Cellular Proteomics.

[31]  John R Yates,et al.  Validation of Tandem Mass Spectrometry Database Search Results Using DTASelect , 2006, Current protocols in bioinformatics.

[32]  M. Savitski,et al.  Electron capture/transfer versus collisionally activated/induced dissociations: Solo or duet? , 2008, Journal of the American Society for Mass Spectrometry.

[33]  Tao Xu,et al.  Toward objective evaluation of proteomic algorithms , 2012, Nature Methods.

[34]  B. Ma,et al.  De Novo Sequencing and Homology Searching‡‡* , 2011, Molecular & Cellular Proteomics.

[35]  Roger E. Moore,et al.  Qscore: An algorithm for evaluating SEQUEST database search results , 2002, Journal of the American Society for Mass Spectrometry.

[36]  P. K. Smith,et al.  Measurement of protein using bicinchoninic acid. , 1985, Analytical biochemistry.

[37]  D E Koshland,et al.  The Molecule of the Year. , 1989, Science.

[38]  Magno Junqueira,et al.  Tools and challenges for diversity‐driven proteomics in Brazil , 2012, Proteomics.

[39]  Robertson Craig,et al.  TANDEM: matching proteins with tandem mass spectra. , 2004, Bioinformatics.

[40]  Yan Fu,et al.  pNovo: de novo peptide sequencing and identification using HCD spectra. , 2010, Journal of proteome research.

[41]  P C Carvalho,et al.  Pinpointing differentially expressed domains in complex protein mixtures with the cloud service of PatternLab for Proteomics. , 2013, Journal of proteomics.

[42]  G. Franco,et al.  Prospection, structural analysis and phylogenetic relationships of endogenous gamma-phospholipase A(2) inhibitors in Brazilian Bothrops snakes (Viperidae, Crotalinae). , 2008, Toxicon : official journal of the International Society on Toxinology.

[43]  P. Bork,et al.  Charting the proteomes of organisms with unsequenced genomes by MALDI-quadrupole time-of-flight mass spectrometry and BLAST homology searching. , 2001, Analytical chemistry.

[44]  F. McLafferty,et al.  Electron capture dissociation for structural characterization of multiply charged protein cations. , 2000, Analytical chemistry.

[45]  K. Biemann,et al.  Determination of the amino acid sequence in oligopeptides by computer interpretation of their high-resolution mass spectra. , 1966, Journal of the American Chemical Society.

[46]  John D. Venable,et al.  MS1, MS2, and SQT-three unified, compact, and easily parsed file formats for the storage of shotgun proteomic spectra and identifications. , 2004, Rapid communications in mass spectrometry : RCM.

[47]  Nuno Bandeira,et al.  Peptide Identification by Tandem Mass Spectrometry with Alternate Fragmentation Modes* , 2012, Molecular & Cellular Proteomics.

[48]  M. O. Dayhoff,et al.  Atlas of protein sequence and structure , 1965 .

[49]  Ming Li,et al.  PEAKS: powerful software for peptide de novo sequencing by tandem mass spectrometry. , 2003, Rapid communications in mass spectrometry : RCM.