Scalable Data Analysis in Proteomics and Metabolomics Using BioContainers and Workflows Engines

The recent improvements in mass spectrometry instruments and new analytical methods are increasing the intersection between proteomics and big data science. In addition, bioinformatics analysis is becoming increasingly complex and convoluted, involving multiple algorithms and tools. A wide variety of methods and software tools have been developed for computational proteomics and metabolomics during recent years, and this trend is likely to continue. However, most of the computational proteomics and metabolomics tools are designed as single‐tiered software application where the analytics tasks cannot be distributed, limiting the scalability and reproducibility of the data analysis. In this paper the key steps of metabolomics and proteomics data processing, including the main tools and software used to perform the data analysis, are summarized. The combination of software containers with workflows environments for large‐scale metabolomics and proteomics analysis is discussed. Finally, a new approach for reproducible and large‐scale data analysis based on BioContainers and two of the most popular workflow environments, Galaxy and Nextflow, is introduced to the proteomics and metabolomics communities.

[1]  Alejandra N. González-Beltrán,et al.  PhenoMeNal: processing and analysis of metabolomics data in the cloud , 2018, bioRxiv.

[2]  Harald Barsnes,et al.  The mzIdentML Data Standard Version 1.2, Supporting Advances in Proteome Informatics* , 2017, Molecular & Cellular Proteomics.

[3]  Mary Goldman,et al.  Toil enables reproducible, open source, big biomedical data analyses , 2017, Nature Biotechnology.

[4]  Paolo Di Tommaso,et al.  Nextflow enables reproducible computational workflows , 2017, Nature Biotechnology.

[5]  Harald Barsnes,et al.  BioContainers: an open-source and community-driven framework for software standardization , 2017, Bioinform..

[6]  Martin Eisenacher,et al.  In-depth analysis of protein inference algorithms using multiple search engines and well-defined metrics. , 2017, Journal of proteomics.

[7]  Knut Reinert,et al.  OpenMS - A platform for reproducible analysis of mass spectrometry data. , 2017, Journal of biotechnology.

[8]  Sven Rahmann,et al.  Snakemake--a scalable bioinformatics workflow engine. , 2012, Bioinformatics.

[9]  William J. Griffiths,et al.  Mass spectrometry: from proteomics to metabolomics and lipidomics. , 2009, Chemical Society reviews.

[10]  Jeremy Leipzig,et al.  A review of bioinformatic pipeline frameworks , 2016, Briefings Bioinform..

[11]  R. Fischer,et al.  Two birds with one stone: Doing metabolomics with your proteomics kit , 2013, Proteomics.

[12]  Ka Wan Li,et al.  Comparative Analyses of Data Independent Acquisition Mass Spectrometric Approaches: DIA, WiSIM‐DIA, and Untargeted DIA , 2018, Proteomics.

[13]  Robert Petryszak,et al.  Discovering and linking public omics data sets using the Omics Discovery Index , 2017, Nature Biotechnology.

[14]  Richard O. Sinnott,et al.  Investigating reproducibility and tracking provenance – A genomic workflow case study , 2017, BMC Bioinformatics.

[15]  Jürgen Cox,et al.  High performance computational analysis of large-scale proteome data sets to assess incremental contribution to coverage of the human genome. , 2013, Journal of proteome research.

[16]  Martin Eisenacher,et al.  mzTab-M: A Data Standard for Sharing Quantitative Results in Mass Spectrometry Metabolomics , 2019, Analytical chemistry.

[17]  Yasset Perez-Riverol,et al.  Open source libraries and frameworks for biological data visualisation: A guide for developers , 2015, Proteomics.

[18]  Christoph Steinbeck,et al.  Computational tools and workflows in metabolomics: An international survey highlights the opportunity for harmonisation through Galaxy , 2016, Metabolomics.

[19]  Jun Fan,et al.  The mzTab Data Exchange Format: Communicating Mass-spectrometry-based Proteomics and Metabolomics Experimental Results to a Wider Audience* , 2014, Molecular & Cellular Proteomics.

[20]  Natalie I. Tasman,et al.  A guided tour of the Trans‐Proteomic Pipeline , 2010, Proteomics.

[21]  Johannes Griss,et al.  Recognizing millions of consistently unidentified spectra across hundreds of shotgun proteomics datasets , 2016, Nature Methods.

[22]  Brendan MacLean,et al.  Bioinformatics Applications Note Gene Expression Skyline: an Open Source Document Editor for Creating and Analyzing Targeted Proteomics Experiments , 2022 .

[23]  Martin Eisenacher,et al.  PIA: An Intuitive Protein Inference Engine with a Web-Based User Interface. , 2015, Journal of proteome research.

[24]  C. Lynch Big data: How do your data grow? , 2008, Nature.

[25]  Yasset Perez-Riverol,et al.  Open source libraries and frameworks for mass spectrometry based proteomics: A developer's perspective , 2014, Biochimica et biophysica acta.

[26]  Lennart Martens,et al.  mzML—a Community Standard for Mass Spectrometry Data* , 2010, Molecular & Cellular Proteomics.

[27]  Alexey I Nesvizhskii,et al.  MSFragger: ultrafast and comprehensive peptide identification in shotgun proteomics , 2017, Nature Methods.

[28]  Jüergen Cox,et al.  The MaxQuant computational platform for mass spectrometry-based shotgun proteomics , 2016, Nature Protocols.

[29]  Hiroshi Tsugawa,et al.  Advances in computational metabolomics and databases deepen the understanding of metabolisms. , 2018, Current opinion in biotechnology.

[30]  Ben C. Collins,et al.  OpenSWATH enables automated, targeted analysis of data-independent acquisition MS data , 2014, Nature Biotechnology.

[31]  Martin Eisenacher,et al.  Protein Inference Using PIA Workflows and PSI Standard File Formats. , 2018, Journal of proteome research.

[32]  Kai Blin,et al.  Ten Simple Rules for Taking Advantage of Git and GitHub , 2014, bioRxiv.

[33]  Rolf Backofen,et al.  Practical computational reproducibility in the life sciences , 2017, bioRxiv.

[34]  Renan Valieris,et al.  Bioconda: sustainable and comprehensive software distribution for the life sciences , 2018, Nature Methods.

[35]  M. Mann,et al.  Andromeda: a peptide search engine integrated into the MaxQuant environment. , 2011, Journal of proteome research.

[36]  Natasha Lucas,et al.  A Case Study and Methodology for OpenSWATH Parameter Optimization Using the ProCan90 Data Set and 45 810 Computational Analysis Runs. , 2019, Journal of proteome research.

[37]  Martin Eisenacher,et al.  Proteomics Standards Initiative: Fifteen Years of Progress and Future Work , 2017, Journal of proteome research.

[38]  Matej Oresic,et al.  MZmine 2: Modular framework for processing, visualizing, and analyzing mass spectrometry-based molecular profile data , 2010, BMC Bioinformatics.

[39]  Christoph Steinbeck,et al.  Navigating freely-available software tools for metabolomics analysis , 2017, Metabolomics.

[40]  Andreas Schmidt,et al.  Bioinformatic analysis of proteomics data , 2014, BMC Systems Biology.

[41]  Anton Nekrutenko,et al.  Ten Simple Rules for Reproducible Computational Research , 2013, PLoS Comput. Biol..

[42]  Pavel A. Pevzner,et al.  Universal database search tool for proteomics , 2014, Nature Communications.

[43]  Felipe Maia Galvão França,et al.  Effectively addressing complex proteomic search spaces with peptide spectrum matching , 2013, Bioinform..

[44]  Daniel Jacob,et al.  Workflow4Metabolomics: a collaborative research infrastructure for computational metabolomics , 2014, Bioinform..

[45]  G. Siuzdak,et al.  XCMS2: processing tandem mass spectrometry data for metabolite identification and structural characterization. , 2008, Analytical chemistry.

[46]  Mark R. Viant,et al.  Galaxy-M: a Galaxy workflow for processing and analyzing direct infusion and liquid chromatography mass spectrometry-based metabolomics data , 2016, GigaScience.

[47]  Alban Gaignard,et al.  Scientific workflows for computational reproducibility in the life sciences: Status, challenges and opportunities , 2017, Future Gener. Comput. Syst..