ProteomicsML: An Online Platform for Community-Curated Data sets and Tutorials for Machine Learning in Proteomics

Data set acquisition and curation are often the most difficult and time-consuming parts of a machine learning endeavor. This is especially true for proteomics-based liquid chromatography (LC) coupled to mass spectrometry (MS) data sets, due to the high levels of data reduction that occur between raw data and machine learning-ready data. Since predictive proteomics is an emerging field, when predicting peptide behavior in LC-MS setups, each lab often uses unique and complex data processing pipelines in order to maximize performance, at the cost of accessibility and reproducibility. For this reason we introduce ProteomicsML, an online resource for proteomics-based data sets and tutorials across most of the currently explored physicochemical peptide properties. This community-driven resource makes it simple to access data in easy-to-process formats, and contains easy-to-follow tutorials that allow new users to interact with even the most advanced algorithms in the field. ProteomicsML provides data sets that are useful for comparing state-of-the-art machine learning algorithms, as well as providing introductory material for teachers and newcomers to the field alike. The platform is freely available at https://www.proteomicsml.org/, and we welcome the entire proteomics community to contribute to the project at https://github.com/ProteomicsML/ProteomicsML.

[1]  William Stafford Noble,et al.  Reducing peptide sequence bias in quantitative mass spectrometry data with machine learning , 2022, bioRxiv.

[2]  A. Brazma,et al.  The PRIDE database resources in 2022: a hub for mass spectrometry-based proteomics evidences , 2021, Nucleic Acids Res..

[3]  M. Dhaenens,et al.  A comprehensive LFQ benchmark dataset on modern day acquisition strategies in proteomics , 2021, bioRxiv.

[4]  R. Aebersold,et al.  Progress Identifying and Analyzing the Human Proteome: 2021 Metrics from the HUPO Human Proteome Project. , 2021, Journal of proteome research.

[5]  V. Schwämmle,et al.  MS2AI: Automated repurposing of public peptide LC-MS data for machine learning applications. , 2021, Bioinformatics.

[6]  K. V. van Wijk,et al.  The Arabidopsis PeptideAtlas: Harnessing worldwide proteomics data to create a comprehensive community proteomics resource. , 2021, The Plant cell.

[7]  William Stafford Noble,et al.  ppx: Programmatic access to proteomics data repositories , 2021, bioRxiv.

[8]  Jesse G. Meyer,et al.  Deep learning neural network tools for proteomics , 2021, Cell reports methods.

[9]  Maximilian T. Strauss,et al.  Deep learning the collisional cross sections of the peptide universe from a million experimental values , 2021, Nature Communications.

[10]  Marisa M. Gioioso,et al.  Application of Predicted Collisional Cross Section to Metabolome Databases to Probabilistically Describe the Current and Future Ion Mobility Mass Spectrometry. , 2021, Journal of the American Society for Mass Spectrometry.

[11]  Rebekah L. Gundry,et al.  A high-stringency blueprint of the human proteome , 2020, Nature Communications.

[12]  J. Prell,et al.  Fundamentals of ion mobility in the free molecular regime. Interlacing the past, present and future of ion mobility calculations , 2020, International Reviews in Physical Chemistry.

[13]  Bing Zhang,et al.  Deep Learning in Proteomics , 2020, Proteomics.

[14]  Lennart Martens,et al.  The Age of Data‐Driven Proteomics: How Machine Learning Enables Novel Workflows , 2020, Proteomics.

[15]  S. Degroeve,et al.  DeepLC can predict retention times for peptides that carry as-yet unseen modifications , 2020, Nature Methods.

[16]  Yasset Perez-Riverol,et al.  The ProteomeXchange consortium in 2020: enabling ‘big data’ approaches in proteomics , 2019, Nucleic Acids Res..

[17]  J. Dodds,et al.  Ion Mobility Spectrometry: Fundamental Concepts, Instrumentation, Applications, and the Road Ahead , 2019, Journal of The American Society for Mass Spectrometry.

[18]  Mathias Wilhelm,et al.  Prosit: proteome-wide prediction of peptide tandem mass spectra by deep learning , 2019, Nature Methods.

[19]  Lennart Martens,et al.  Updated MS²PIP web server delivers fast and accurate MS² peak intensity prediction for multiple fragmentation methods, instruments and labeling techniques , 2019, Nucleic Acids Res..

[20]  Lennart Martens,et al.  Accurate peptide fragmentation predictions allow data driven approaches to replace and improve upon proteomics search engine scoring functions , 2018, bioRxiv.

[21]  Zhiwei Zhou,et al.  MetCCS predictor: a web server for predicting collision cross‐section values of metabolites in ion mobility‐mass spectrometry based metabolomics , 2017, Bioinform..

[22]  Mathias Wilhelm,et al.  Building ProteomeTools based on a complete synthetic human proteome , 2017, Nature Methods.

[23]  Jüergen Cox,et al.  The MaxQuant computational platform for mass spectrometry-based shotgun proteomics , 2016, Nature Protocols.

[24]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[25]  Melvin A. Park,et al.  Fundamentals of Trapped Ion Mobility Spectrometry , 2014, Journal of The American Society for Mass Spectrometry.

[26]  Derek J. Bailey,et al.  The One Hour Yeast Proteome* , 2013, Molecular & Cellular Proteomics.

[27]  L. Deng,et al.  The MNIST Database of Handwritten Digit Images for Machine Learning Research [Best of the Web] , 2012, IEEE Signal Processing Magazine.

[28]  Richard D. Smith,et al.  Fundamentals of traveling wave ion mobility spectrometry. , 2008, Analytical chemistry.

[29]  H Nielsen,et al.  Machine learning approaches for the prediction of signal peptides and other protein sorting signals. , 1999, Protein engineering.

[30]  Gunnar von Heijne,et al.  Patterns of Amino Acids near Signal‐Sequence Cleavage Sites , 1983 .

[31]  G von Heijne,et al.  Patterns of amino acids near signal-sequence cleavage sites. , 1983, European journal of biochemistry.