Streaming FDR Calculation for Protein Identification

Identification of proteins is a key step of metaproteomics research. This protein identification task should be migrated to a fast data streaming architecture to increase horizontal scalability and performance. A protein database search involves two steps: the pairwise matching of experimental spectra against protein sequences creating peptide-spectrum-matches (PSM) and the statistical validation of PSMs. The peptide-spectrum-matching is inherently parallelizable since each match is independent. However, false positive matches are inherent to this method due to measurement errors and artifacts, thus requiring statistical validation. State of the art validation is achieved using the target-decoy method, which estimates the false discovery rate (FDR) by searching against a shuffled version of the original protein database. In contrast to the protein database search, validation by target-decoy is not parallelizable, because the FDR approximation requires all experimental data at once. In short, when using a fast data architecture for the workflow, the target-decoy approach is no longer feasible. Hence a novel approach is required to avoid false discovery of PSM on streaming single-pass experimental data. To this end, the recently proposed nokoi classifier seems promising to solve the aforementioned problems. In this paper, we present a general nokoi pipeline to create such a decoy-free classifier, that reach over 95% accuracy for general metaproteomics data.

[1]  Martin Eisenacher,et al.  Search and Decoy: The Automatic Identification of Mass Spectra , 2012, Quantitative Methods in Proteomics.

[2]  Joshua E. Elias,et al.  Target-Decoy Search Strategy for Mass Spectrometry-Based Proteomics , 2010, Proteome Bioinformatics.

[3]  Robert Heyer,et al.  Metaproteomics of complex microbial communities in biogas plants , 2015, Microbial biotechnology.

[4]  Daniela Cecconi,et al.  Pros and cons of peptide isolectric focusing in shotgun proteomics. , 2013, Journal of chromatography. A.

[5]  Eric W. Deutsch,et al.  File Formats Commonly Used in Mass Spectrometry Proteomics* , 2012, Molecular & Cellular Proteomics.

[6]  R. Aebersold,et al.  Mass spectrometry-based proteomics , 2003, Nature.

[7]  Lennart Martens,et al.  A decoy-free approach to the identification of peptides. , 2015, Journal of proteome research.

[8]  Robert Heyer,et al.  Challenges and perspectives of metaproteomic data analysis. , 2017, Journal of biotechnology.

[9]  Satya Harpalani,et al.  A metaproteomic approach for identifying proteins in anaerobic bioreactors converting coal to methane , 2015 .

[10]  Andreas Kipf,et al.  Analytics on Fast Data: Main-Memory Database Systems versus Modern Streaming Systems , 2017, EDBT.

[11]  D. N. Perkins,et al.  Probability‐based protein identification by searching sequence databases using mass spectrometry data , 1999, Electrophoresis.

[12]  R. Beavis,et al.  A method for reducing the time required to match protein sequences with tandem mass spectra. , 2003, Rapid communications in mass spectrometry : RCM.

[13]  Octávio L. Franco,et al.  Metaproteomics as a Complementary Approach to Gut Microbiota in Health and Disease , 2017, Front. Chem..

[14]  L. Ranjard,et al.  Metaproteomics: A New Approach for Studying Functional Microbial Ecology , 2007, Microbial Ecology.

[15]  J. Yates,et al.  An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database , 1994, Journal of the American Society for Mass Spectrometry.