Mass spectrometry searches using MASST

To the Editor — We introduce a webenabled mass spectrometry (MS) search engine, named Mass Spectrometry Search Tool (MASST; https://masst.ucsd.edu). By enabling searches of all small-molecule tandem MS (MS/MS) data in public metabolomics repositories, we posit that MASST will unlock these resources for clinical, environmental and natural product applications. Introduced in 1990, a tool for discovering related protein or gene sequences named Basic Local Alignment Search Tool (BLAST) enabled researchers to query entire public sequence data repositories through a web interface (WebBLAST; https://blast.ncbi.nlm.nih.gov/Blast.cgi)1. WebBLAST is one of the most widely cited and used bioinformatics tools because it permits any researcher to answer simple questions, such as ‘is a protein or DNA sequence common or rare?’. In the early days of public gene and protein databases, metadata, which include descriptions of sample, population or technical details, were limited. No deposition standards existed, except for the Short Read Archive and European Nucleotide Archive, which include experimental details for sequencing, instrumental details and sample description, such as the source of a sample. The current status of much MS data in the public domain is reminiscent of the DNA databanks of the 1990s. To increase usage and unlock the potential of openly available MS resources, we set out to build an infrastructure to enable WebBLAST for MS. Algorithms developed for MS data, including molecular networking2 and fragmentation trees3, enable similarity searches against reference libraries of known molecules, whereas powerful metabolomics analysis software infrastructures, such as MS-DIAL4, MetaboAnalyst5, XCMS Online6 and HMDB7, focus on annotation of MS/MS spectra, or finding statistical relationships between molecular features. However, none of the existing tools enable searching a single MS/MS spectrum for identical or analogous MS/MS spectra against public data in repositories, including unknown molecules. Finding specific MS/MS spectra of interest, including unannotated spectra or structural analogs, in public repositories of metabolomics MS data and natural product MS data, is not possible. Deposition of untargeted MS data in the public domain is experiencing rapid growth. In March 2017, 910 metabolomics datasets were available8; by January 2019, there were >2,000 downloadable metabolomics datasets (about half of these datasets contain MS/MS data)9. Despite the availability of metabolomics and natural product data, including environmental and clinical MS datasets, public small-molecule MS data are hardly reused10. Now that there is a huge amount of small-molecule untargeted MS datasets publicly available (~1,100 untargeted datasets and ~110,000,000 spectra in ~150,000 files as of December 11, 2018), we felt that the time was right to develop MASST, to enable reuse of these MS data. MASST comprises a web-based system to search the public data repository part of the GNPS/MassIVE knowledge base11 and an analysis infrastructure for a single MS/ MS spectrum. The developments required for MASST searches included converting deposited public data to a uniform open format12 (irrespective of instrument type and original data format), the ability to trace the file from which each MS/MS spectrum originated, and a reporting system that shows all identical or similar MS/MS spectra found in public data along with their associated metadata. MASST development has been possible for two main reasons: first, adoption of universal, non-vendor-specific MS data formats has increased, which means that multiple publicly available datasets have been converted to the same data format13, and second, the recently developed ability to connect all public data in GNPS/MassIVE and connect each MS/MS spectrum to its metadata entries had not been developed yet. A MASST report also includes matches to any reference spectra in public MS/ MS spectral libraries, if the matches are within the user-specified search parameters. Libraries include GNPS usercontributed spectra11, GNPS libraries11, all three MassBanks14 (https://massbank.eu/ MassBank/, https://mona.fiehnlab.ucdavis. edu/), ReSpect15, MIADB/Beniddir16, Sumner/Bruker, CASMI17, PNNL lipids18, Sirenas/Gates, EMBL MCF and several other libraries, listed at https://gnps.ucsd.edu/ ProteoSAFe/libraries.jsp. Visualization of the MASST matches uses a mirror view (Fig. 1). MASST can search against various repositories, including GNPS/MassIVE11, Metabolomics Workbench19, MetaboLights20 or the non-redundant (nr) MS/MS library of all unique MS/MS spectra from all three repositories combined. MASST searching using multiple repositories was enabled by converting data uploaded to the Metabolomics Workbench and MetaboLights repositories to the same open MS format in the GNPS/MassIVE data storage environment. Instructions on how to upload to GNPS/MassIVE can be found at https://ccms-ucsd.github.io/ GNPSDocumentation/datasets/. All public data in GNPS/MassIVE becomes MASST-searchable. MASST searches output results according to userdefined search parameters. The report returns the origin of the matched MS/ MS spectrum with respect to the dataset and file information and any metadata associated with the file (Fig. 1). Datasets and files can be tagged with sample or spectral information by the community of MASST users, and this information then becomes part of the metadata reported back in future MASST searches. We also curated ~34,000 additional MS files with ~340,000 tags, mostly from human-associated samples, but also from microbes, food and indoor and outdoor environments, to provide a good foundation for MASST searches. Metadata can be associated with MS/ MS spectra in the GNPS/MassIVE upload portal at the dataset level, file level or single annotated spectrum level. Examples of metadata include instrument type, phylogeny (according to the National Center for Biotechnology Information (NCBI) taxonomy) and keywords at the dataset level; phylogeny, sample type, age, sex, body site (defined using the Uberon anatomy ontology21) and disease22 at the file level; and source, biological activity and structural class information at the single annotated spectrum level. In addition, GNPS/MassIVE is compatible with metadata formats from other software tools (e.g., QIIME2 and Qiita), which are used to analyze microbiome data and have a controlled vocabulary that can be imported23,24. Any sample information uploaded to GNPS/MassIVE from another repository (e.g., from MetaboLights and Metabolomics workbench) is also included in the MASST report. At present, there is only limited metadata at the dataset and file level, but the metadata in the public domain can provide insights into the types of MS/MS signals being analyzed (Box 1 contains examples of usage). Although the amount and quality of metadata is increasing25, datasets do not always have detailed metadata. To allay this problem, re-annotation of metadata as knowledge increases, while retaining provenance of all changes, is possible in

[1]  Kristian Fog Nielsen,et al.  Sharing and community curation of mass spectrometry data with Global Natural Products Social Molecular Networking , 2016, Nature Biotechnology.

[2]  Christoph Steinbeck,et al.  MetaboLights—an open-access general-purpose repository for metabolomics studies and associated meta-data , 2012, Nucleic Acids Res..

[3]  Mingxun Wang,et al.  Qiita: rapid, web-enabled microbiome meta-analysis , 2018, Nature Methods.

[4]  David S. Wishart,et al.  HMDB 4.0: the human metabolome database for 2018 , 2017, Nucleic Acids Res..

[5]  Nuno Bandeira,et al.  Mass spectral molecular networking of living microbial colonies , 2012, Proceedings of the National Academy of Sciences.

[6]  Florian Rasche,et al.  Computing fragmentation trees from tandem mass spectrometry data. , 2011, Analytical chemistry.

[7]  Matej Oresic,et al.  Data standards can boost metabolomics research, and if there is a will, there is a way , 2015, Metabolomics.

[8]  Yutaka Yamada,et al.  RIKEN tandem mass spectral database (ReSpect) for phytochemicals: a plant-specific MS/MS-based data resource and database. , 2012, Phytochemistry.

[9]  Eoin Fahy,et al.  Metabolomics Workbench: An international repository for metabolomics data and metadata, metabolite standards, protocols, tutorials and training, and analysis tools , 2015, Nucleic Acids Res..

[10]  Masanori Arita,et al.  Identifying epimetabolites by integrating metabolome databases with mass spectrometry cheminformatics , 2017, Nature Methods.

[11]  Robert Burke,et al.  ProteoWizard: open source software for rapid proteomics tools development , 2008, Bioinform..

[12]  R. Knight,et al.  Global chemical analysis of biology by mass spectrometry , 2017 .

[13]  Francesco Asnicar,et al.  Reproducible, interactive, scalable and extensible microbiome data science using QIIME 2 , 2019, Nature Biotechnology.

[14]  Robert Petryszak,et al.  Discovering and linking public omics data sets using the Omics Discovery Index , 2017, Nature Biotechnology.

[15]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[16]  Emma L. Schymanski,et al.  The Critical Assessment of Small Molecule Identification (CASMI): Challenges Solutions , 2013, Metabolites.

[17]  Pierre Champy,et al.  Theionbrunonines A and B: Dimeric Vobasine Alkaloids Tethered by a Thioether Bridge from Mostuea brunonis. , 2018, Organic letters.

[18]  Judith A J Steen,et al.  MGFp: an open Mascot Generic Format parser library implementation. , 2010, Journal of proteome research.

[19]  M. Hirai,et al.  MassBank: a public repository for sharing mass spectral data for life sciences. , 2010, Journal of mass spectrometry : JMS.

[20]  Nigel W. Hardy,et al.  Proposed minimum reporting standards for chemical analysis , 2007, Metabolomics.

[21]  Ngoc Hung Nguyen,et al.  Repository-scale Co- and Re-analysis of Tandem Mass Spectrometry Data , 2019, bioRxiv.

[22]  David S. Wishart,et al.  MetaboAnalyst 4.0: towards more transparent and integrative metabolomics analysis , 2018, Nucleic Acids Res..

[23]  Thomas O. Metz,et al.  LIQUID: an‐open source software for identifying lipids in LC‐MS/MS‐based lipidomics data , 2017, Bioinform..

[24]  G. Siuzdak,et al.  XCMS Online: a web-based platform to process untargeted metabolomic data. , 2012, Analytical chemistry.

[25]  Michelle Giglio,et al.  Human Disease Ontology 2018 update: classification, content and workflow expansion , 2018, Nucleic Acids Res..

[26]  S. Lewis,et al.  Uberon, an integrative multi-species anatomy ontology , 2012, Genome Biology.