Interoperable and scalable data analysis with microservices: applications in metabolomics

Abstract Motivation Developing a robust and performant data analysis workflow that integrates all necessary components whilst still being able to scale over multiple compute nodes is a challenging task. We introduce a generic method based on the microservice architecture, where software tools are encapsulated as Docker containers that can be connected into scientific workflows and executed using the Kubernetes container orchestrator. Results We developed a Virtual Research Environment (VRE) which facilitates rapid integration of new tools and developing scalable and interoperable workflows for performing metabolomics data analysis. The environment can be launched on-demand on cloud resources and desktop computers. IT-expertise requirements on the user side are kept to a minimum, and workflows can be re-used effortlessly by any novice user. We validate our method in the field of metabolomics on two mass spectrometry, one nuclear magnetic resonance spectroscopy and one fluxomics study. We showed that the method scales dynamically with increasing availability of computational resources. We demonstrated that the method facilitates interoperability using integration of the major software suites resulting in a turn-key workflow encompassing all steps for mass-spectrometry-based metabolomics including preprocessing, statistics and identification. Microservices is a generic methodology that can serve any scientific discipline and opens up for new types of large-scale integrative science. Availability and implementation The PhenoMeNal consortium maintains a web portal (https://portal.phenomenal-h2020.eu) providing a GUI for launching the Virtual Research Environment. The GitHub repository https://github.com/phnmnl/ hosts the source code of all projects. Supplementary information Supplementary data are available at Bioinformatics online.

[1]  Graziano Pesole,et al.  Laniakea: an open solution to provide Galaxy “on-demand” instances over heterogeneous cloud infrastructures , 2020, GigaScience.

[2]  B. Langmead,et al.  Cloud computing for genomic data analysis and collaboration , 2018, Nature Reviews Genetics.

[3]  Roy Fielding,et al.  Architectural Styles and the Design of Network-based Software Architectures"; Doctoral dissertation , 2000 .

[4]  Gilles J. Guillemin,et al.  Current Evidence for a Role of the Kynurenine Pathway of Tryptophan Metabolism in Multiple Sclerosis , 2016, Front. Immunol..

[5]  V. Marx Biology: The big challenges of big data , 2013, Nature.

[6]  I. Foster,et al.  Service-Oriented Science , 2005, Science.

[7]  Nancy Wilkins-Diehr,et al.  Science gateways today and tomorrow: positive perspectives of nearly 5000 members of the research community , 2015, Concurr. Comput. Pract. Exp..

[8]  Jano I. van Hemert,et al.  Scientific Workflows , 2016, ACM Comput. Surv..

[9]  Sven Rahmann,et al.  Genome analysis , 2022 .

[10]  David Baker,et al.  The endocannabinoid system and multiple sclerosis. , 2008, Current pharmaceutical design.

[11]  Johan Montagnat,et al.  Scientific workflows: Past, present and future , 2017, Future Gener. Comput. Syst..

[12]  Eoin Fahy,et al.  Metabolomics Workbench: An international repository for metabolomics data and metadata, metabolite standards, protocols, tutorials and training, and analysis tools , 2015, Nucleic Acids Res..

[13]  Matej Oresic,et al.  Data standards can boost metabolomics research, and if there is a will, there is a way , 2015, Metabolomics.

[14]  Oliver Kohlbacher,et al.  Improving global feature detectabilities through scan range splitting for untargeted metabolomics by high-performance liquid chromatography-Orbitrap mass spectrometry. , 2016, Analytica chimica acta.

[15]  K. Markides,et al.  Interferon‐β affects the tryptophan metabolism in multiple sclerosis patients , 2005, European journal of neurology.

[16]  Daniel Jacob,et al.  Workflow4Metabolomics: a collaborative research infrastructure for computational metabolomics , 2014, Bioinform..

[17]  et al.,et al.  Jupyter Notebooks - a publishing format for reproducible computational workflows , 2016, ELPUB.

[18]  Rolf Backofen,et al.  Jupyter and Galaxy: Easing entry barriers into complex data analyses for biomedical researchers , 2017, PLoS Comput. Biol..

[19]  T. Rubino,et al.  The endocannabinoid system and schizophrenia: integration of evidence. , 2012, Current pharmaceutical design.

[20]  Jeremy Leipzig,et al.  A review of bioinformatic pipeline frameworks , 2016, Briefings Bioinform..

[21]  Lennart Martens,et al.  mzML—a Community Standard for Mass Spectrometry Data* , 2010, Molecular & Cellular Proteomics.

[22]  A. Nekrutenko,et al.  Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences , 2010, Genome Biology.

[23]  Birgit Schmidt,et al.  Positioning and Power in Academic Publishing: Players, Agents and Agendas, 20th International Conference on Electronic Publishing, Göttingen, Germany, June 7-9, 2016 , 2016, ELPUB.

[24]  Christoph Steinbeck,et al.  nmrML: A Community Supported Open Data Standard for the Description, Storage, and Exchange of NMR Data. , 2018, Analytical chemistry.

[25]  Dirk Merkel,et al.  Docker: lightweight Linux containers for consistent development and deployment , 2014 .

[26]  Bernhard O. Palsson,et al.  Escher: A Web Application for Building, Sharing, and Embedding Data-Rich Visualizations of Biological Pathways , 2015, PLoS Comput. Biol..

[27]  Christoph Steinbeck,et al.  MetaboLights—an open-access general-purpose repository for metabolomics studies and associated meta-data , 2012, Nucleic Acids Res..

[28]  W. Wiechert,et al.  How to measure metabolic fluxes: a taxonomic guide for (13)C fluxomics. , 2015, Current opinion in biotechnology.

[29]  I. Wilson,et al.  Understanding 'Global' Systems Biology: Metabonomics and the Continuum of Metabolism , 2003, Nature Reviews Drug Discovery.

[30]  Erik Schultes,et al.  The FAIR Guiding Principles for scientific data management and stewardship , 2016, Scientific Data.

[31]  Pasquale Pagano,et al.  Virtual Research Environments: An Overview and a Research Agenda , 2013, Data Sci. J..

[32]  M. Gassmann,et al.  Cellular and developmental control of O2 homeostasis by hypoxia-inducible factor 1 alpha. , 1998, Genes & development.

[33]  Vladimir V. Voevodin,et al.  Parallel workflow manager for non-parallel bioinformatic applications to solve large-scale biological problems on a supercomputer , 2016, J. Bioinform. Comput. Biol..

[34]  G. Nolan,et al.  Computational solutions to large-scale data management and analysis , 2010, Nature Reviews Genetics.

[35]  Nadine Levin,et al.  Metabolizing Data in the Cloud. , 2017, Trends in biotechnology.

[36]  Bruce V. Taylor,et al.  Kynurenine pathway metabolomics predicts and provides mechanistic insight into multiple sclerosis progression , 2017, Scientific Reports.

[37]  Anne E. Trefethen,et al.  Toward interoperable bioscience data , 2012, Nature Genetics.

[38]  K. Reinert,et al.  OpenMS: a flexible open-source software platform for mass spectrometry data analysis , 2016, Nature Methods.

[39]  Robert D. Finn,et al.  Experience using web services for biological sequence analysis , 2008, Briefings Bioinform..

[40]  Sven Rahmann,et al.  Snakemake--a scalable bioinformatics workflow engine. , 2012, Bioinformatics.

[41]  Joerg M. Buescher,et al.  A roadmap for interpreting (13)C metabolite labeling patterns from cells. , 2015, Current opinion in biotechnology.

[42]  Sam Newman,et al.  Building Microservices , 2015 .

[43]  Matthias Müller-Hannemann,et al.  In silico fragmentation for computer assisted identification of metabolite mass spectra , 2010, BMC Bioinformatics.

[44]  Paolo Di Tommaso,et al.  Nextflow enables reproducible computational workflows , 2017, Nature Biotechnology.

[45]  Michael Simons,et al.  Endothelial cell metabolism in normal and diseased vasculature. , 2015, Circulation research.

[46]  S. Neumann,et al.  CAMERA: an integrated strategy for compound spectra extraction and annotation of liquid chromatography/mass spectrometry data sets. , 2012, Analytical chemistry.

[47]  Harald Barsnes,et al.  BioContainers: an open-source and community-driven framework for software standardization , 2017, Bioinform..

[48]  M. Mitchell Waldrop,et al.  Education online: The virtual lab , 2013, Nature.

[49]  Tao Huan,et al.  Data Streaming for Metabolomics: Accelerating Data Processing and Analysis from Days to Minutes , 2016, Analytical chemistry.

[50]  Albert Y. Zomaya,et al.  A Survey of Mobile Device Virtualization , 2016, ACM Comput. Surv..

[51]  Knut Reinert,et al.  OpenMS – An open-source software framework for mass spectrometry , 2008, BMC Bioinformatics.

[52]  David S. Wishart,et al.  MetaboAnalyst 2.0—a comprehensive server for metabolomic data analysis , 2012, Nucleic Acids Res..

[53]  Ola Spjuth,et al.  On-demand virtual research environments using microservices , 2018, PeerJ Comput. Sci..

[54]  Andrew Silver Software simplified , 2017, Nature.

[55]  Mona Singh,et al.  Computational solutions for omics data , 2013, Nature Reviews Genetics.

[56]  Robert Allan Virtual research environments : from portals to science gateways , 2009 .

[57]  R. Abagyan,et al.  XCMS: processing mass spectrometry data for metabolite profiling using nonlinear peak alignment, matching, and identification. , 2006, Analytical chemistry.

[58]  Payam Emami Khoonsari,et al.  Targeted metabolomics of CSF in healthy individuals and patients with secondary progressive multiple sclerosis using high-resolution mass spectrometry , 2020, Metabolomics.

[59]  Mauro Maccarrone,et al.  The endocannabinoid system is dysregulated in multiple sclerosis and in experimental autoimmune encephalomyelitis. , 2007, Brain : a journal of neurology.

[60]  R. Cox,et al.  A metabolomic comparison of urinary changes in type 2 diabetes in mouse, rat, and human. , 2007, Physiological genomics.

[61]  Ilia Semenov,et al.  Experience in Developing an FHIR Medical Data Management Platform to Provide Clinical Decision Support , 2019, International journal of environmental research and public health.

[62]  O Feron,et al.  Endothelial cell metabolism and tumour angiogenesis: glucose and glutamine as essential fuels and lactate as the driving force , 2013, Journal of internal medicine.