Fast and Efficient XML Data Access for Next-Generation Mass Spectrometry

Motivation In mass spectrometry-based proteomics, XML formats such as mzML and mzXML provide an open and standardized way to store and exchange the raw data (spectra and chromatograms) of mass spectrometric experiments. These file formats are being used by a multitude of open-source and cross-platform tools which allow the proteomics community to access algorithms in a vendor-independent fashion and perform transparent and reproducible data analysis. Recent improvements in mass spectrometry instrumentation have increased the data size produced in a single LC-MS/MS measurement and put substantial strain on open-source tools, particularly those that are not equipped to deal with XML data files that reach dozens of gigabytes in size. Results Here we present a fast and versatile parsing library for mass spectrometric XML formats available in C++ and Python, based on the mature OpenMS software framework. Our library implements an API for obtaining spectra and chromatograms under memory constraints using random access or sequential access functions, allowing users to process datasets that are much larger than system memory. For fast access to the raw data structures, small XML files can also be completely loaded into memory. In addition, we have improved the parsing speed of the core mzML module by over 4-fold (compared to OpenMS 1.11), making our library suitable for a wide variety of algorithms that need fast access to dozens of gigabytes of raw mass spectrometric data. Availability Our C++ and Python implementations are available for the Linux, Mac, and Windows operating systems. All proposed modifications to the OpenMS code have been merged into the OpenMS mainline codebase and are available to the community at https://github.com/OpenMS/OpenMS.

[1]  Chris F. Taylor,et al.  The work of the Human Proteome Organisation's Proteomics Standards Initiative (HUPO PSI). , 2006, Omics : a journal of integrative biology.

[2]  Natalie I. Tasman,et al.  A Cross-platform Toolkit for Mass Spectrometry and Proteomics , 2012, Nature Biotechnology.

[3]  E. Hall,et al.  The nature of biotechnology. , 1988, Journal of biomedical engineering.

[4]  Lennart Martens,et al.  mzML—a Community Standard for Mass Spectrometry Data* , 2010, Molecular & Cellular Proteomics.

[5]  Lennart Martens,et al.  The PRoteomics IDEntification (PRIDE) Converter 2 Framework: An Improved Suite of Tools to Facilitate Data Submission to the PRIDE Database and the ProteomeXchange Consortium , 2012, Molecular & Cellular Proteomics.

[6]  R. Aebersold,et al.  mProphet: automated data processing and statistical validation for large-scale SRM experiments , 2011, Nature Methods.

[7]  Lukas N. Mueller,et al.  SuperHirn – a novel tool for high resolution LC‐MS‐based peptide/protein profiling , 2007, Proteomics.

[8]  Johannes Griss,et al.  jmzReader: A Java parser library to process and visualize multiple text and XML-based mass spectrometry data formats , 2012, Proteomics.

[9]  Lennart Martens,et al.  jmzML, an open‐source Java API for mzML, the PSI standard for MS data , 2010, Proteomics.

[10]  Yasset Perez-Riverol,et al.  Open source libraries and frameworks for mass spectrometry based proteomics: A developer's perspective , 2014, Biochimica et biophysica acta.

[11]  Robert Burke,et al.  ProteoWizard: open source software for rapid proteomics tools development , 2008, Bioinform..

[12]  B. Garcia,et al.  Proteomics , 2011, Journal of biomedicine & biotechnology.

[13]  Edward M. Marcotte,et al.  mspire: mass spectrometry proteomics in Ruby , 2008, Bioinform..

[14]  Knut Reinert,et al.  OpenMS – An open-source software framework for mass spectrometry , 2008, BMC Bioinformatics.

[15]  Ludovic C. Gillet,et al.  Targeted Data Extraction of the MS/MS Spectra Generated by Data-independent Acquisition: A New Concept for Consistent and Accurate Proteome Analysis* , 2012, Molecular & Cellular Proteomics.

[16]  Lars Malmström,et al.  pyOpenMS: A Python‐based interface to the OpenMS mass‐spectrometry algorithm library , 2014, Proteomics.

[17]  Knut Reinert,et al.  Workflows for automated downstream data analysis and visualization in large-scale computational mass spectrometry , 2015, Proteomics.

[18]  R. Aebersold,et al.  A uniform proteomics MS/MS analysis platform utilizing open XML file formats , 2005, Molecular systems biology.

[19]  Knut Reinert,et al.  TOPP - the OpenMS proteomics pipeline , 2007, Bioinform..

[20]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[21]  R. Aebersold,et al.  Mass spectrometry-based proteomics , 2003, Nature.

[22]  Ben C. Collins,et al.  OpenSWATH enables automated, targeted analysis of data-independent acquisition MS data , 2014, Nature Biotechnology.

[23]  Ruedi Aebersold,et al.  Options and considerations when selecting a quantitative proteomics strategy , 2010, Nature Biotechnology.

[24]  S. Bryant,et al.  Open mass spectrometry search algorithm. , 2004, Journal of proteome research.

[25]  Robertson Craig,et al.  TANDEM: matching proteins with tandem mass spectra. , 2004, Bioinformatics.

[26]  Chris F. Taylor,et al.  A common open representation of mass spectrometry data and its application to proteomics research , 2004, Nature Biotechnology.

[27]  김삼묘,et al.  “Bioinformatics” 특집을 내면서 , 2000 .

[28]  Michael Specht,et al.  pymzML - Python module for high-throughput bioinformatics on mass spectrometry data , 2012, Bioinform..

[29]  J. Yates,et al.  Mass spectrometry for proteomics. , 2008, Current opinion in chemical biology.