pymzML v2.0: introducing a highly compressed and seekable gzip format

Motivation: In the new release of pymzML (v2.0), we have optimized the speed of this established tool for mass spectrometry data analysis to adapt to increasing amounts of data in mass spectrometry. Thus, we integrated faster libraries for numerical calculations, improved data retrieving algorithms and have optimized the source code. Importantly, to adapt to rapidly growing file sizes, we developed a generalizable compression scheme for very fast random access and applied this concept to mzML files to retrieve spectral data. Results: pymzML performs at par with established C programs when it comes to processing times. However, it offers the versatility of a scripting language, while adding unprecedented fast random access to compressed files. Additionally, we designed our compression scheme in such a general way that it can be applied to any field where fast random access to large data blocks in compressed files is desired. Availability and implementation: pymzML is freely available on https://github.com/pymzML/pymzML under GPL license. pymzML requires Python3.4+ and optionally numpy. Documentation available on http://pymzml.readthedocs.io.

[1]  E. Deutsch mzML: A single, unifying data format for mass spectrometer output , 2008, Proteomics.

[2]  Lars Malmström,et al.  pyOpenMS: A Python‐based interface to the OpenMS mass‐spectrometry algorithm library , 2014, Proteomics.

[3]  K. Reinert,et al.  OpenMS: a flexible open-source software platform for mass spectrometry data analysis , 2016, Nature Methods.

[4]  Lars Malmström,et al.  Numerical Compression Schemes for Proteomics Mass Spectrometry Data* , 2014, Molecular & Cellular Proteomics.

[5]  G. Finazzi,et al.  Proton Gradient Regulation5-Like1-Mediated Cyclic Electron Flow Is Crucial for Acclimation to Anoxia and Complementary to Nonphotochemical Quenching in Stress Adaptation1[W] , 2014, Plant Physiology.

[6]  R. Rice,et al.  Adaptive Variable-Length Coding for Efficient Compression of Spacecraft Television Data , 1971 .

[7]  Lars Malmström,et al.  Efficient visualization of high-throughput targeted proteomics experiments: TAPIR , 2015, Bioinform..

[8]  S. Golomb Run-length encodings. , 1966 .

[9]  Robert Burke,et al.  ProteoWizard: open source software for rapid proteomics tools development , 2008, Bioinform..

[10]  Michael Specht,et al.  pymzML - Python module for high-throughput bioinformatics on mass spectrometry data , 2012, Bioinform..

[11]  Lev I Levitsky,et al.  Pyteomics—a Python Framework for Exploratory Data Analysis and Rapid Software Prototyping in Proteomics , 2013, Journal of The American Society for Mass Spectrometry.