Protein-Level Integration Strategy of Multiengine MS Spectra Search Results for Higher Confidence and Sequence Coverage.

Multiple search engines based on various models have been developed to search MS/MS spectra against a reference database, providing different results for the same data set. How to integrate these results efficiently with minimal compromise on false discoveries is an open question due to the lack of an independent, reliable, and highly sensitive standard. We took the advantage of the translating mRNA sequencing (RNC-seq) result as a standard to evaluate the integration strategies of the protein identifications from various search engines. We used seven mainstream search engines (Andromeda, Mascot, OMSSA, X!Tandem, pFind, InsPecT, and ProVerB) to search the same label-free MS data sets of human cell lines Hep3B, MHCCLM3, and MHCC97H from the Chinese C-HPP Consortium for Chromosomes 1, 8, and 20. As expected, the union of seven engines resulted in a boosted false identification, whereas the intersection of seven engines remarkably decreased the identification power. We found that identifications of at least two out of seven engines resulted in maximizing the protein identification power while minimizing the ratio of suspicious/translation-supported identifications (STR), as monitored by our STR index, based on RNC-Seq. Furthermore, this strategy also significantly improves the peptides coverage of the protein amino acid sequence. In summary, we demonstrated a simple strategy to significantly improve the performance for shotgun mass spectrometry by protein-level integrating multiple search engines, maximizing the utilization of the current MS spectra without additional experimental work.

[1]  L. Jensen,et al.  Mass Spectrometric Analysis of Lysine Ubiquitylation Reveals Promiscuity at Site Level* , 2010, Molecular & Cellular Proteomics.

[2]  Qing-Yu He,et al.  Resolving chromosome-centric human proteome with translating mRNA analysis: a strategic demonstration. , 2014, Journal of proteome research.

[3]  Natalie I. Tasman,et al.  iProphet: Multi-level Integrative Analysis of Shotgun Proteomic Data Improves Peptide and Protein Identification Rates and Error Estimates* , 2011, Molecular & Cellular Proteomics.

[4]  M. Mann,et al.  Stable Isotope Labeling by Amino Acids in Cell Culture, SILAC, as a Simple and Accurate Approach to Expression Proteomics* , 2002, Molecular & Cellular Proteomics.

[5]  Manolis Kellis,et al.  Optimization of parameters for coverage of low molecular weight proteins , 2010, Analytical and bioanalytical chemistry.

[6]  Tao Zhang,et al.  Systematic analyses of the transcriptome, translatome, and proteome provide a global view and potential strategy for the C-HPP. , 2014, Journal of proteome research.

[7]  Lennart Martens,et al.  Human Proteome Project Mass Spectrometry Data Interpretation Guidelines 2.1. , 2016, Journal of proteome research.

[8]  S. Bryant,et al.  Open mass spectrometry search algorithm. , 2004, Journal of proteome research.

[9]  M. Mann,et al.  Deep and Highly Sensitive Proteome Coverage by LC-MS/MS Without Prefractionation* , 2011, Molecular & Cellular Proteomics.

[10]  B. Searle Scaffold: A bioinformatic tool for validating MS/MS‐based proteomic studies , 2010, Proteomics.

[11]  Wen Gao,et al.  pFind: a novel database-searching software system for automated peptide and protein identification via tandem mass spectrometry , 2005, Bioinform..

[12]  J. Buhmann,et al.  Protein Identification False Discovery Rates for Very Large Proteomics Data Sets Generated by Tandem Mass Spectrometry* , 2009, Molecular & Cellular Proteomics.

[13]  P. Pevzner,et al.  InsPecT: identification of posttranslationally modified peptides from tandem mass spectra. , 2005, Analytical chemistry.

[14]  Amit Kumar Yadav,et al.  MassWiz: a novel scoring algorithm with target-decoy based analysis pipeline for tandem mass spectrometry. , 2011, Journal of proteome research.

[15]  Qing-Yu He,et al.  Dispec: A Novel Peptide Scoring Algorithm Based on Peptide Matching Discriminability , 2013, PloS one.

[16]  Wolfram Weckwerth,et al.  Identification of Novel in vivo MAP Kinase Substrates in Arabidopsis thaliana Through Use of Tandem Metal Oxide Affinity Chromatography* , 2012, Molecular & Cellular Proteomics.

[17]  Mona Singh,et al.  Measuring differential gene expression by short read sequencing: quantitative comparison to 2-channel gene expression microarrays , 2009, BMC Genomics.

[18]  Min Fang,et al.  Coherent pipeline for biomarker discovery using mass spectrometry and bioinformatics , 2010, BMC Bioinformatics.

[19]  Amos Bairoch,et al.  The neXtProt knowledgebase on human proteins: 2017 update , 2016, Nucleic Acids Res..

[20]  Qing-Yu He,et al.  FANSe2: A Robust and Cost-Efficient Alignment Tool for Quantitative Next-Generation Sequencing Applications , 2014, PloS one.

[21]  Derek J. Bailey,et al.  COMPASS: A suite of pre‐ and post‐search proteomics software tools for OMSSA , 2011, Proteomics.

[22]  Stephen R Master,et al.  Isobaric labeling and tandem mass spectrometry: A novel approach for profiling and quantifying proteins differentially expressed in amniotic fluid in preterm labor with and without intra-amniotic infection/inflammation , 2010, The journal of maternal-fetal & neonatal medicine : the official journal of the European Association of Perinatal Medicine, the Federation of Asia and Oceania Perinatal Societies, the International Society of Perinatal Obstetricians.

[23]  Qing-Yu He,et al.  Binomial probability distribution model-based protein identification algorithm for tandem mass spectrometry utilizing peak intensity information. , 2013, Journal of proteome research.

[24]  Ravi Tharakan,et al.  Data maximization by multipass analysis of protein mass spectra , 2010, Proteomics.

[25]  Wen Gao,et al.  pFind 2.0: a software package for peptide and protein identification via tandem mass spectrometry. , 2007, Rapid communications in mass spectrometry : RCM.

[26]  M. Mann,et al.  MaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide protein quantification , 2008, Nature Biotechnology.

[27]  Hao Chi,et al.  Accelerating the scoring module of mass spectrometry-based peptide identification using GPUs , 2014, BMC Bioinformatics.

[28]  Martin Eisenacher,et al.  In-depth analysis of protein inference algorithms using multiple search engines and well-defined metrics. , 2017, Journal of proteomics.

[29]  D. N. Perkins,et al.  Probability‐based protein identification by searching sequence databases using mass spectrometry data , 1999, Electrophoresis.

[30]  R. Beavis,et al.  A method for reducing the time required to match protein sequences with tandem mass spectra. , 2003, Rapid communications in mass spectrometry : RCM.

[31]  Wen Gao,et al.  Exploiting the kernel trick to correlate fragment ions for peptide identification via tandem mass spectrometry , 2004, Bioinform..

[32]  Robertson Craig,et al.  TANDEM: matching proteins with tandem mass spectra. , 2004, Bioinformatics.