A common open representation of mass spectrometry data and its application to proteomics research

A broad range of mass spectrometers are used in mass spectrometry (MS)-based proteomics research. Each type of instrument possesses a unique design, data system and performance specifications, resulting in strengths and weaknesses for different types of experiments. Unfortunately, the native binary data formats produced by each type of mass spectrometer also differ and are usually proprietary. The diverse, nontransparent nature of the data structure complicates the integration of new instruments into preexisting infrastructure, impedes the analysis, exchange, comparison and publication of results from different experiments and laboratories, and prevents the bioinformatics community from accessing data sets required for software development. Here, we introduce the 'mzXML' format, an open, generic XML (extensible markup language) representation of MS data. We have also developed an accompanying suite of supporting programs. We expect that this format will facilitate data management, interpretation and dissemination in proteomics research.

[1]  Jason E. Stewart,et al.  Design and implementation of microarray gene expression markup language (MAGE-ML) , 2002, Genome Biology.

[2]  R. Aebersold,et al.  Proteomics: the first decade and beyond , 2003, Nature Genetics.

[3]  Patrick G. A. Pedrioli,et al.  A tool to visualize and evaluate data obtained by liquid chromatography-electrospray ionization-mass spectrometry. , 2004, Analytical chemistry.

[4]  S. Gygi,et al.  Quantitative analysis of complex protein mixtures using isotope-coded affinity tags , 1999, Nature Biotechnology.

[5]  D. Goodlett,et al.  Shotgun collision‐induced dissociation of peptides using a time of flight mass analyzer , 2003, Proteomics.

[6]  J. Yates,et al.  An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database , 1994, Journal of the American Society for Mass Spectrometry.

[7]  Guanghui Wang,et al.  Global protein identification and quantification technology using two-dimensional liquid chromatography nanospray mass spectrometry. , 2003, Analytical chemistry.

[8]  M. Baldwin Protein Identification by Mass Spectrometry , 2004, Molecular & Cellular Proteomics.

[9]  Alexey I Nesvizhskii,et al.  Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. , 2002, Analytical chemistry.

[10]  R. Aebersold,et al.  Quantitative profiling of differentiation-induced microsomal proteins using isotope-coded affinity tags and mass spectrometry , 2001, Nature Biotechnology.

[11]  Rong Wang,et al.  The need for a public proteomics repository , 2004, Nature Biotechnology.

[12]  D. N. Perkins,et al.  Probability‐based protein identification by searching sequence databases using mass spectrometry data , 1999, Electrophoresis.

[13]  R. Aebersold,et al.  A statistical model for identifying proteins by tandem mass spectrometry. , 2003, Analytical chemistry.

[14]  R. Aebersold,et al.  ProbID: A probabilistic algorithm to identify peptides through sequence database searching using tandem mass spectral data , 2002, Proteomics.

[15]  P. Shannon,et al.  Cytoscape: a software environment for integrated models of biomolecular interaction networks. , 2003, Genome research.

[16]  R. Aebersold,et al.  Automated statistical analysis of protein abundance ratios from data generated by stable-isotope dilution and tandem mass spectrometry. , 2003, Analytical chemistry.