A workflow to increase the detection rate of proteins from unsequenced organisms in high‐throughput proteomics experiments

We present and evaluate a strategy for the mass spectrometric identification of proteins from organisms for which no genome sequence information is available that incorporates cross‐species information from sequenced organisms. The presented method combines spectrum quality scoring, de novo sequencing and error tolerant BLAST searches and is designed to decrease input data complexity. Spectral quality scoring reduces the number of investigated mass spectra without a loss of information. Stringent quality‐based selection and the combination of different de novo sequencing methods substantially increase the catalog of significant peptide alignments. The de novo sequences passing a reliability filter are subsequently submitted to error tolerant BLAST searches and MS‐BLAST hits are validated by a sampling technique. With the described workflow, we identified up to 20% more groups of homologous proteins in proteome analyses with organisms whose genome is not sequenced than by state‐of‐the‐art database searches in an Arabidopsis thaliana database. We consider the novel data analysis workflow an excellent screening method to identify those proteins that evade detection in proteomics experiments as a result of database constraints.

[1]  P. Pevzner,et al.  Sequence similarity‐driven proteomics in organisms with unknown genomes by LC‐MS/MS and automated de novo sequencing , 2007, Proteomics.

[2]  W. Gruissem,et al.  Proteome analysis of chloroplast mRNA processing and degradation. , 2007, Journal of proteome research.

[3]  W. Gruissem,et al.  Proteome analysis of bell pepper (Capsicum annuum L.) chromoplasts. , 2006, Plant & cell physiology.

[4]  Charles Buck,et al.  Performance evaluation of existing de novo sequencing algorithms. , 2006, Journal of proteome research.

[5]  M. Gribskov,et al.  The Genome of Black Cottonwood, Populus trichocarpa (Torr. & Gray) , 2006, Science.

[6]  R. Aebersold,et al.  Dynamic Spectrum Quality Assessment and Iterative Computational Analysis of Shotgun Proteomic Data , 2006, Molecular & Cellular Proteomics.

[7]  Mark Cieliebak,et al.  AUDENS: a tool for automated peptide de novo sequencing. , 2005, Journal of proteome research.

[8]  A. Shevchenko,et al.  Sequence similarity-based proteomics in insects: characterization of the larvae venom of the Brazilian moth Cerodirphia speciosa. , 2005, Journal of proteome research.

[9]  P. Pevzner,et al.  PepNovo: de novo peptide sequencing via probabilistic network modeling. , 2005, Analytical chemistry.

[10]  Joachim M. Buhmann,et al.  A Hidden Markov Model for de Novo Peptide Sequencing , 2004, NIPS.

[11]  P. Pevzner,et al.  Shotgun protein sequencing by tandem mass spectra assembly. , 2004, Analytical chemistry.

[12]  W. Gruissem,et al.  Proteome analysis of tobacco bright yellow-2 (BY-2) cell culture plastids as a model for undifferentiated heterotrophic plastids. , 2004, Journal of proteome research.

[13]  Bin Ma,et al.  SPIDER: software for protein identification from sequence tags with de novo sequencing error , 2004, Proceedings. 2004 IEEE Computational Systems Bioinformatics Conference, 2004. CSB 2004..

[14]  K. Resing,et al.  Improving reproducibility and sensitivity in identifying human proteins by shotgun proteomics. , 2004, Analytical chemistry.

[15]  B. Searle,et al.  High-throughput identification of proteins and unanticipated sequence modifications using a mass-based alignment algorithm for MS/MS de novo sequencing results. , 2004, Analytical chemistry.

[16]  K. Sjölander,et al.  The Arabidopsis thaliana Chloroplast Proteome Reveals Pathway Abundance and Novel Protein Functions , 2004, Current Biology.

[17]  A. Shevchenko,et al.  The Power and the Limitations of Cross-Species Protein Identification by Mass Spectrometry-driven Sequence Similarity Searches*S , 2004, Molecular & Cellular Proteomics.

[18]  S. Rhee,et al.  MAPMAN: a user-driven tool to display genomics data sets onto diagrams of metabolic pathways and other biological processes. , 2004, The Plant journal : for cell and molecular biology.

[19]  Ming Li,et al.  PEAKS: powerful software for peptide de novo sequencing by tandem mass spectrometry. , 2003, Rapid communications in mass spectrometry : RCM.

[20]  R. Aebersold,et al.  A statistical model for identifying proteins by tandem mass spectrometry. , 2003, Analytical chemistry.

[21]  J. A. Taylor,et al.  Searching sequence databases via De novo peptide sequencing by tandem mass spectrometry , 2002, Molecular biotechnology.

[22]  Alexey I Nesvizhskii,et al.  Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. , 2002, Analytical chemistry.

[23]  Huanming Yang,et al.  A Draft Sequence of the Rice Genome (Oryza sativa L. ssp. indica) , 2002, Science.

[24]  Aaron J Mackey,et al.  Getting More from Less , 2002, Molecular & Cellular Proteomics.

[25]  P. Bork,et al.  Charting the proteomes of organisms with unsequenced genomes by MALDI-quadrupole time-of-flight mass spectrometry and BLAST homology searching. , 2001, Analytical chemistry.

[26]  The Arabidopsis Genome Initiative Analysis of the genome sequence of the flowering plant Arabidopsis thaliana , 2000, Nature.

[27]  Ming-Yang Kao,et al.  A dynamic programming approach to de novo peptide sequencing via tandem mass spectrometry , 2000, SODA '00.

[28]  Bin Ma,et al.  SPIDER: software for protein identification from sequence tags with de novo sequencing error. , 2004, Proceedings. IEEE Computational Systems Bioinformatics Conference.

[29]  A. Oliphant,et al.  A draft sequence of the rice genome (Oryza sativa L. ssp. japonica). , 2002, Science.

[30]  W. Gruissem,et al.  Chloroplast mRNA 3'-end nuclease complex. , 2001, Methods in enzymology.

[31]  L. Carvalho,et al.  Analysis of proteins associated with storage root formation in cassava using two-dimensional gel electrophoresis , 2001 .