WOMBAT-P: Benchmarking Label-Free Proteomics Data Analysis Workflows

Proteomics research encompasses a wide array of experimental designs, resulting in diverse datasets varying in structure and properties. This diversity has led to a considerable variety of software solutions for data analysis, each of them using multiple tools with different algorithms for operations like peptide-spectrum matching, protein inference, quantification, statistical analysis, and visualization. Computational workflows combine these algorithms to facilitate end-to-end analysis, spanning from raw data to detecting differentially regulated proteins. We introduce WOMBAT-P, a versatile platform designed for the automatic benchmarking and comparison of bottom-up label-free proteomics workflows. By standardizing software parameterization and workflow outputs, WOMBAT-P empowers an objective comparison of four commonly utilized data analysis workflows. Furthermore, WOMBAT-P streamlines the processing of public data based on the provided metadata, with an optional specification of 30 parameters. Wombat-P can use Sample and Data Relationship Format for Proteomics (SDRF-Proteomics) as the file input to simply process annotated local or ProteomeXchange deposited datasets. This feature offers a shortcut for data analysis and facilitates comparisons among diverse outputs. Through an examination of experimental ground truth data and a realistic biological dataset, we unveil significant disparities and a low overlap between identified and quantified proteins. WOMBAT-P not only enables rapid execution and seamless comparison of four workflows (on the same dataset) using a wide range of benchmarking metrics but also provides insights into the capabilities of different software solutions. These metrics support researchers in selecting the most suitable workflow for their specific dataset. The modular architecture of WOMBAT-P promotes extensibility and customization, making it an ideal platform for testing newly developed software tools within a realistic data analysis context.

[1]  William Stafford Noble,et al.  Bridging the False Discovery Gap. , 2023, Journal of proteome research.

[2]  J. Sweedler,et al.  Assessment and Comparison of Database Search Engines for Peptidomic Applications. , 2023, Journal of proteome research.

[3]  David D. Shteynberg,et al.  Trans-Proteomic Pipeline: Robust Mass Spectrometry-Based Proteomics Data Analysis Suite. , 2023, Journal of proteome research.

[4]  Gary D Bader,et al.  The reactome pathway knowledgebase 2022 , 2021, Nucleic Acids Res..

[5]  A. Brazma,et al.  The PRIDE database resources in 2022: a hub for mass spectrometry-based proteomics evidences , 2021, Nucleic Acids Res..

[6]  Paul T. Groth,et al.  Packaging research artefacts with RO-Crate , 2021, Data Sci..

[7]  Xiaochen Bo,et al.  clusterProfiler 4.0: A universal enrichment tool for interpreting omics data , 2021, Innovation.

[8]  A. Brazma,et al.  A proteomics sample metadata representation for multiomics integration and big data analysis , 2021, Nature Communications.

[9]  J. Ison,et al.  APE in the Wild: Automated Exploration of Proteomics Workflows in the bio.tools Registry , 2021, Journal of proteome research.

[10]  J. Vizcaíno,et al.  BioContainers Registry: searching bioinformatics and proteomics tools, packages, and containers , 2021, Journal of proteome research.

[11]  R. Durán,et al.  Quantitative proteomic dataset from oro- and naso-pharyngeal swabs used for COVID-19 diagnosis: Detection of viral proteins and host's biological processes altered by the infection , 2020, Data in Brief.

[12]  Yasset Perez-Riverol,et al.  Towards a sample metadata standard in public proteomics repositories. , 2020, Journal of proteome research.

[13]  Sebastian Gibb,et al.  MSnbase, efficient and elegant R-based processing and visualisation of raw mass spectrometry data , 2020, bioRxiv.

[14]  Yohann Couté,et al.  Proline: an efficient and user-friendly software suite for large-scale proteomics , 2020, Bioinform..

[15]  Sven Nahnsen,et al.  The nf-core framework for community-curated bioinformatics pipelines , 2020, Nature Biotechnology.

[16]  Marie Locard-Paulet,et al.  Comparing 22 popular phosphoproteomics pipelines for peptide identification and site localization. , 2020, Journal of proteome research.

[17]  G. Rambold,et al.  FAIR digital objects in environmental and life sciences should comprise workflow operation design data and method information for repeatability of study setups and reproducibility of results , 2020, Database J. Biol. Databases Curation.

[18]  Veit Schwämmle,et al.  PolySTest: Robust Statistical Testing of Proteomics Data with Missing Values Improves Detection of Biologically Relevant Features , 2019, Molecular & Cellular Proteomics.

[19]  Hedi Peterson,et al.  The bio.tools registry of software tools and data resources for the life sciences , 2019, Genome Biology.

[20]  Lennart Martens,et al.  ThermoRawFileParser: modular, scalable and cross-platform RAW file conversion , 2019, bioRxiv.

[21]  Fredrik Levander,et al.  NormalyzerDE: Online Tool for Improved Normalization of Omics Expression Data and High-Sensitivity Differential Expression Analysis. , 2018, Journal of proteome research.

[22]  Anna-Lena Lamprecht,et al.  Automated workflow composition in mass spectrometry-based proteomics , 2018, Bioinform..

[23]  Brent S. Pedersen,et al.  Bioconda: sustainable and comprehensive software distribution for the life sciences , 2018, Nature Methods.

[24]  Harald Barsnes,et al.  SearchGUI: A Highly Adaptable Common Interface for Proteomics Search and de Novo Engines. , 2018, Journal of proteome research.

[25]  Luis Mendoza,et al.  StPeter: Seamless Label-Free Quantification with the Trans-Proteomic Pipeline. , 2018, Journal of proteome research.

[26]  Kris Gevaert,et al.  Experimental design and data-analysis in label-free quantitative LC/MS proteomics: A tutorial with MSqRob. , 2018, Journal of proteomics.

[27]  Jing Zhao,et al.  Protein-Level Integration Strategy of Multiengine MS Spectra Search Results for Higher Confidence and Sequence Coverage. , 2017, Journal of proteome research.

[28]  Alfonso Valencia,et al.  Lessons Learned: Recommendations for Establishing Critical Periodic Scientific Benchmarking , 2017, bioRxiv.

[29]  Maria K. Jaakkola,et al.  ROTS: An R package for reproducibility-optimized statistical testing , 2017, PLoS Comput. Biol..

[30]  Harald Barsnes,et al.  BioContainers: an open-source and community-driven framework for software standardization , 2017, Bioinform..

[31]  Martin Eisenacher,et al.  In-depth analysis of protein inference algorithms using multiple search engines and well-defined metrics. , 2017, Journal of proteomics.

[32]  Magnus Palmblad,et al.  Visualizing and comparing results of different peptide identification methods , 2016, Briefings Bioinform..

[33]  Jüergen Cox,et al.  The MaxQuant computational platform for mass spectrometry-based shotgun proteomics , 2016, Nature Protocols.

[34]  Benjamin A. Garcia,et al.  Evaluation of Proteomic Search Engines for the Analysis of Histone Modifications , 2014, Journal of proteome research.

[35]  Steve Pettifer,et al.  EDAM: an ontology of bioinformatics operations, types of data and identifiers, topics and formats , 2013, Bioinform..

[36]  J. Eng,et al.  Comet: An open‐source MS/MS sequence database search tool , 2013, Proteomics.

[37]  Olga Vitek,et al.  A statistical model-building perspective to identification of MS/MS spectra with PeptideProphet , 2012, BMC Bioinformatics.

[38]  R. Breitling,et al.  msCompare: A Framework for Quantitative Analysis of Label-free LC-MS Data for Comparative Candidate Biomarker Studies* , 2012, Molecular & Cellular Proteomics.

[39]  Natalie I. Tasman,et al.  iProphet: Multi-level Integrative Analysis of Shotgun Proteomic Data Improves Peptide and Protein Identification Rates and Error Estimates* , 2011, Molecular & Cellular Proteomics.

[40]  William Stafford Noble,et al.  On using samples of known protein content to assess the statistical calibration of scores assigned to peptide-spectrum matches in shotgun proteomics. , 2011, Journal of proteome research.

[41]  Lennart Martens,et al.  compomics-utilities: an open-source Java library for computational proteomics , 2011, BMC Bioinformatics.

[42]  Tuula A Nyman,et al.  Compid: a new software tool to integrate and compare MS/MS based protein identification results from Mascot and Paragon. , 2010, Journal of proteome research.

[43]  M. Scalf,et al.  Fast, Free, and Flexible Peptide and Protein Quantification with FlashLFQ. , 2023, Methods in molecular biology.

[44]  O. Jensen,et al.  Robust statistical testing of proteomics data with missing values improves detection of biologically relevant features , 2020 .