Preliminary Search Engine for Open Protein Identification

Protein identification is the most important and basic problem for proteomics. Using tandem mass spectrometry and database search is one of the most widely used identification techniques. However, the improved sensitivity of mass spectrometers, rapid expansion of databases and more complex analysis, like post-translational modification and non-specific enzymatic digestion, have challenged current restricted protein identification search engines in scale and speed severely. In this paper, we proposed an open protein identification method relaxing enzyme, and presented our distributed design to support big protein database with non-specific digestion analysis based on pFind, a practical tandem mass spectra search engine developed in China. With classical bigger protein databases ipi. HUMAN and uniprot-sprot we got nearly linear speedup in a 20-blade cluster. By further analysis, we can expect real time identification to some extent.

[1]  Bin Ma,et al.  PEAKS: Powerful Software for Peptide De Novo Sequencing by MS/MS , 2003 .

[2]  D. N. Perkins,et al.  Probability‐based protein identification by searching sequence databases using mass spectrometry data , 1999, Electrophoresis.

[3]  S. Bryant,et al.  Open mass spectrometry search algorithm. , 2004, Journal of proteome research.

[4]  Kei-Hoi Cheung,et al.  X!!Tandem, an improved method for running X!tandem in parallel on collections of commodity computers. , 2008, Journal of proteome research.

[5]  Wen Gao,et al.  pFind: a novel database-searching software system for automated peptide and protein identification via tandem mass spectrometry , 2005, Bioinform..

[6]  A. Shevchenko,et al.  MultiTag: multiple error-tolerant sequence tag search for the sequence-similarity identification of proteins by mass spectrometry. , 2003, Analytical chemistry.

[7]  Robertson Craig,et al.  TANDEM: matching proteins with tandem mass spectra. , 2004, Bioinformatics.

[8]  P. Pevzner,et al.  PepNovo: de novo peptide sequencing via probabilistic network modeling. , 2005, Analytical chemistry.

[9]  J. A. Taylor,et al.  Informatics for protein identification by mass spectrometry. , 2005, Methods.

[10]  Cesare Pautasso,et al.  Grid-based Analysis of Tandem Mass Spectrometry Data in Clinical Proteomics , 2007, HealthGrid.

[11]  Pavel A. Pevzner,et al.  Peptide sequence tags for fast database search in mass-spectrometry. , 2005 .

[12]  John S. Cottrell,et al.  Protein identification using MS/MS data. , 2011, Journal of proteomics.

[13]  J. Jeffry Howbert,et al.  MR-Tandem: parallel X!Tandem using Hadoop MapReduce on Amazon Web Services , 2012, Bioinform..

[14]  M. Wilm,et al.  Error-tolerant identification of peptides in sequence databases by peptide sequence tags. , 1994, Analytical chemistry.

[15]  Ruedi Aebersold,et al.  Tandem Mass Spectrometry Protein Identification on a PC Grid , 2007, HealthGrid.

[16]  Wen Gao,et al.  pFind 2.0: a software package for peptide and protein identification via tandem mass spectrometry. , 2007, Rapid communications in mass spectrometry : RCM.

[17]  J. Yates,et al.  GutenTag: high-throughput sequence tagging via an empirically derived fragmentation model. , 2003, Analytical chemistry.

[18]  Brian D Halligan,et al.  Low cost, scalable proteomics data analysis using Amazon's cloud computing services and open source search algorithms. , 2009, Journal of proteome research.

[19]  Rune Matthiesen,et al.  Methods, algorithms and tools in computational proteomics: A practical point of view , 2007, Proteomics.

[20]  Ming Li,et al.  PEAKS: powerful software for peptide de novo sequencing by tandem mass spectrometry. , 2003, Rapid communications in mass spectrometry : RCM.

[21]  Wen Gao,et al.  Exploiting the kernel trick to correlate fragment ions for peptide identification via tandem mass spectrometry , 2004, Bioinform..

[22]  J. A. Taylor,et al.  Implementation and uses of automated de novo peptide sequencing by tandem mass spectrometry. , 2001, Analytical chemistry.

[23]  J R Yates,et al.  Protein sequencing by tandem mass spectrometry. , 1986, Proceedings of the National Academy of Sciences of the United States of America.

[24]  Andrew J Link,et al.  Parallel tandem: a program for parallel processing of tandem mass spectra using PVM or MPI and X!Tandem. , 2005, Journal of proteome research.

[25]  P. Pevzner,et al.  Spectral Dictionaries , 2009, Molecular & Cellular Proteomics.

[26]  R. Appel,et al.  Popitam: Towards new heuristic strategies to improve protein identification from tandem mass spectrometry data , 2003, Proteomics.

[27]  J. Yates,et al.  An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database , 1994, Journal of the American Society for Mass Spectrometry.

[28]  C. Costello,et al.  Tandem mass spectrometry. , 1993, Methods in molecular biology.

[29]  Kaizhong Zhang,et al.  SPIDER: software for protein identification from sequence tags with de novo sequencing error , 2004 .