Systematic Comparison of False-Discovery-Rate-Controlling Strategies for Proteogenomic Search Using Spike-in Experiments.

Proteogenomic searches are useful for novel peptide identification from tandem mass spectra. Usually, separate and multistage approaches are adopted to accurately control the false discovery rate (FDR) for proteogenomic search. Their performance on novel peptide identification has not been thoroughly evaluated, however, mainly due to the difficulty in confirming the existence of identified novel peptides. We simulated a proteogenomic search using a controlled, spike-in proteomic data set. After confirming that the results of the simulated proteogenomic search were similar to those of a real proteogenomic search using a human cell line data set, we evaluated the performance of six FDR control methods-global, separate, and multistage FDR estimation, respectively, coupled to a target-decoy search and a mixture model-based method-on novel peptide identification. The multistage approach showed the highest accuracy for FDR estimation. However, global and separate FDR estimation with the mixture model-based method showed higher sensitivities than others at the same true FDR. Furthermore, the mixture model-based method performed equally well when applied without or with a reduced set of decoy sequences. Considering different prior probabilities for novel and known protein identification, we recommend using mixture model-based methods with separate FDR estimation for sensitive and reliable identification of novel peptides from proteogenomic searches.

[1]  Yixue Li,et al.  Identification of gene fusions from human lung cancer mass spectrometry data , 2013, BMC Genomics.

[2]  Alexey I Nesvizhskii,et al.  Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. , 2002, Analytical chemistry.

[3]  Jacob D. Jaffe,et al.  Proteogenomic mapping as a complementary method to perform genome annotation , 2004, Proteomics.

[4]  Patrick G. A. Pedrioli Trans-Proteomic Pipeline: A Pipeline for Proteomic Analysis , 2010, Proteome Bioinformatics.

[5]  B. Maček,et al.  Deep Coverage of the Escherichia coli Proteome Enables the Assessment of False Discovery Rates in Simple Proteogenomic Experiments* , 2013, Molecular & Cellular Proteomics.

[6]  Yohann Couté,et al.  Spiked proteomic standard dataset for testing label-free quantitative software and statistical methods , 2015, Data in brief.

[7]  Pavel A. Pevzner,et al.  Universal database search tool for proteomics , 2014, Nature Communications.

[8]  Steven P Gygi,et al.  Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry , 2007, Nature Methods.

[9]  California Jack Cassidy,et al.  An Automated Proteogenomic Method Uses Mass Spectrometry to Reveal Novel Genes in Zea mays* , 2013, Molecular & Cellular Proteomics.

[10]  W. Pao,et al.  A Bioinformatics Workflow for Variant Peptide Detection in Shotgun Proteomics* , 2011, Molecular & Cellular Proteomics.

[11]  Gennifer E. Merrihew,et al.  Proteogenomic database construction driven from large scale RNA-seq data. , 2014, Journal of proteome research.

[12]  Robertson Craig,et al.  TANDEM: matching proteins with tandem mass spectra. , 2004, Bioinformatics.

[13]  A. Nesvizhskii Proteogenomics: concepts, applications and computational strategies , 2014, Nature Methods.

[14]  Kyu-Baek Hwang,et al.  Evaluating the effect of database inflation in proteogenomic search on sensitive and reliable peptide identification , 2016, BMC Genomics.

[15]  Robert J. Chalkley,et al.  The Effect of Using an Inappropriate Protein Database for Proteomic Data Analysis , 2011, PloS one.

[16]  Chao Liu,et al.  A note on the false discovery rate of novel peptides in proteogenomics , 2015, Bioinform..

[17]  Samuel H. Payne,et al.  Proteogenomic strategies for identification of aberrant cancer peptides using large‐scale next‐generation sequencing data , 2014, Proteomics.

[18]  S. Hubbard,et al.  Addressing Statistical Biases in Nucleotide-Derived Protein Databases for Proteogenomic Search Strategies , 2012, Journal of proteome research.

[19]  Hokeun Kim,et al.  Compact variant‐rich customized sequence database and a fast and sensitive database search for efficient proteogenomic analyses , 2014, Proteomics.

[20]  M. Mann,et al.  Comparative Proteomic Analysis of Eleven Common Cell Lines Reveals Ubiquitous but Varying Expression of Most Proteins* , 2012, Molecular & Cellular Proteomics.

[21]  Vineet Bafna,et al.  Advanced Proteogenomic Analysis Reveals Multiple Peptide Mutations and Complex Immunoglobulin Peptides in Colon Cancer. , 2015, Journal of proteome research.

[22]  J. Eng,et al.  Comet: An open‐source MS/MS sequence database search tool , 2013, Proteomics.