Fast, axis-agnostic, dynamically summarized storage and retrieval for mass spectrometry data

Mass spectrometry, a popular technique for elucidating the molecular contents of experimental samples, creates data sets comprised of millions of three-dimensional (m/z, retention time, intensity) data points that correspond to the types and quantities of analyzed molecules. Open and commercial MS data formats are arranged by retention time, creating latency when accessing data across multiple m/z. Existing MS storage and retrieval methods have been developed to overcome the limitations of retention time-based data formats, but do not provide certain features such as dynamic summarization and storage and retrieval of point meta-data (such as signal cluster membership), precluding efficient viewing applications and certain data-processing approaches. This manuscript describes MzTree, a spatial database designed to provide real-time storage and retrieval of dynamically summarized standard and augmented MS data with fast performance in both m/z and RT directions. Performance is reported on real data with comparisons against related published retrieval systems.

[1]  E. Deutsch mzML: A single, unifying data format for mass spectrometer output , 2008, Proteomics.

[2]  Knut Reinert,et al.  OpenMS – An open-source software framework for mass spectrometry , 2008, BMC Bioinformatics.

[3]  Chris F. Taylor,et al.  A common open representation of mass spectrometry data and its application to proteomics research , 2004, Nature Biotechnology.

[4]  Andrew R. Jones,et al.  ProteomeXchange provides globally co-ordinated proteomics data submission and dissemination , 2014, Nature Biotechnology.

[5]  Robert Smith,et al.  Current controlled vocabularies are insufficient to uniquely map molecular entities to mass spectrometry signal , 2015, BMC Bioinformatics.

[6]  R. Breitling,et al.  PeakML/mzMatch: a file format, Java library, R library, and tool-chain for mass spectrometry data analysis. , 2011, Analytical chemistry.

[7]  Dan Ventura,et al.  Proteomics, lipidomics, metabolomics: a mass spectrometry tutorial from a computer scientist's point of view , 2014, BMC Bioinformatics.

[8]  Robert Burke,et al.  ProteoWizard: open source software for rapid proteomics tools development , 2008, Bioinform..

[9]  Lars Malmström,et al.  Fast and Efficient XML Data Access for Next-Generation Mass Spectrometry , 2015, PloS one.

[10]  Dan Ventura,et al.  A coherent mathematical characterization of isotope trace extraction, isotopic envelope extraction, and LC-MS correspondence , 2015, BMC Bioinformatics.

[11]  Mathias Wilhelm,et al.  mz5: Space- and Time-efficient Storage of Mass Spectrometry Data Sets* , 2011, Molecular & Cellular Proteomics.

[12]  M. Mann,et al.  MaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide protein quantification , 2008, Nature Biotechnology.

[13]  Ruedi Aebersold,et al.  mzDB: A File Format Using Multiple Indexing Strategies for the Efficient Analysis of Large LC-MS/MS and SWATH-MS Data Sets* , 2014, Molecular & Cellular Proteomics.

[14]  Mario A. López,et al.  STR: a simple and efficient algorithm for R-tree packing , 1997, Proceedings 13th International Conference on Data Engineering.

[15]  Antonin Guttman,et al.  R-trees: a dynamic index structure for spatial searching , 1984, SIGMOD '84.

[16]  Dan Ventura,et al.  LC-MS alignment in theory and practice: a comprehensive algorithmic review , 2013, Briefings Bioinform..

[17]  Dan Ventura,et al.  Controlling for confounding variables in MS-omics protocol: why modularity matters , 2014, Briefings Bioinform..

[18]  Fredrik Levander,et al.  Dinosaur: A Refined Open-Source Peptide MS Feature Detector , 2016, Journal of proteome research.