Crowd-Sourced Chemistry: Considerations for Building a Standardized Database to Improve Omic Analyses

Mass spectrometry (MS) is used in multiple omics disciplines to generate large collections of data. This data enables advancements in biomedical research by providing global profiles of a given system. One of the main barriers to generating these profiles is the inability to accurately annotate omics data, especially small molecules. To complement pre-existing large databases that are not quite complete, research groups devote efforts to generating personal libraries to annotate their data. Scientific progress is impeded during the generation of these personal libraries because the data contained within them is often redundant and/or incompatible with other databases. To overcome these redundancies and incompatibilities, we propose that communal, crowd-sourced databases be curated in a standardized fashion. A small number of groups have shown this model is feasible and successful. While the needs of a specific field will dictate the functionality of a communal database, we discuss some features to consider during database development. Special emphasis is made on standardization of terminology, documentation, format, reference materials, and quality assurance practices. These standardization procedures enable a field to have higher confidence in the quality of the data within a given database. We also discuss the three conceptual pillars of database design as well as how crowd-sourcing is practiced. Generating open-source databases requires front-end effort, but the result is a well curated, high quality data set that all can use. Having a resource such as this fosters collaboration and scientific advancement.

[1]  G. Patti,et al.  Perspectives on Data Analysis in Metabolomics: Points of Agreement and Disagreement from the 2018 ASMS Fall Workshop , 2019, Journal of The American Society for Mass Spectrometry.

[2]  Enrico Riccardi,et al.  Envisioning data sharing for the biocomputing community , 2019, Interface Focus.

[3]  R. Wauer,et al.  Integrating terminologies into standard SQL: a new approach for research on routine data , 2019, J. Biomed. Semant..

[4]  Jody C. May,et al.  Predicting Ion Mobility Collision Cross-Sections Using a Deep Neural Network: DeepCCS. , 2019, Analytical chemistry.

[5]  John A McLean,et al.  Collision cross section compendium to annotate and predict multi-omic compound identities† †Electronic supplementary information (ESI) available. See DOI: 10.1039/c8sc04396e , 2018, Chemical science.

[6]  The UniProt Consortium,et al.  UniProt: a worldwide hub of protein knowledge , 2018, Nucleic Acids Res..

[7]  Evan Bolton,et al.  PubChem 2019 update: improved access to chemical data , 2018, Nucleic Acids Res..

[8]  F. Caruso,et al.  Robust Chemistry: The Importance of Data and Methods Sharing. , 2018, Angewandte Chemie.

[9]  S. Böcker,et al.  Significance estimation for large scale metabolomics annotations by spectral matching , 2017, Nature Communications.

[10]  David S. Wishart,et al.  HMDB 4.0: the human metabolome database for 2018 , 2017, Nucleic Acids Res..

[11]  Christoph Steinbeck,et al.  Compliance with minimum information guidelines in public metabolomics repositories , 2017, Scientific Data.

[12]  M. Peitsch,et al.  Crowd-Sourced Verification of Computational Methods and Data in Systems Toxicology: A Case Study with a Heat-Not-Burn Candidate Modified Risk Tobacco Product. , 2017, Chemical research in toxicology.

[13]  Evan Bolton,et al.  ClassyFire: automated chemical classification with a comprehensive, computable taxonomy , 2016, Journal of Cheminformatics.

[14]  Jody C. May,et al.  Advanced Multidimensional Separations in Mass Spectrometry: Navigating the Big Data Deluge. , 2016, Annual review of analytical chemistry.

[15]  A. Terzic,et al.  Big Data Transforms Discovery–Utilization Therapeutics Continuum , 2016, Clinical pharmacology and therapeutics.

[16]  Raphael Silberzahn,et al.  Crowdsourced research: Many hands make tight work , 2015, Nature.

[17]  Gary D Bader,et al.  A draft map of the human proteome , 2014, Nature.

[18]  John P. Wikswo,et al.  Phenotypic Mapping of Metabolic Profiles Using Self-Organizing Maps of High-Dimensional Mass Spectrometry Data , 2014, Analytical chemistry.

[19]  Fabrício F. Costa Big data in biomedicine. , 2014, Drug discovery today.

[20]  Mark I McCarthy,et al.  Data sharing in large research consortia: experiences and recommendations from ENGAGE , 2013, European Journal of Human Genetics.

[21]  Toni Carter,et al.  ALA Glossary of Library and Information Science , 2013 .

[22]  Adrien Treuille,et al.  Predicting protein structures with a multiplayer online game , 2010, Nature.

[23]  M. Hirai,et al.  MassBank: a public repository for sharing mass spectral data for life sciences. , 2010, Journal of mass spectrometry : JMS.

[24]  Nigel W. Hardy,et al.  Promoting coherent minimum reporting guidelines for biological and biomedical investigations: the MIBBI project , 2008, Nature Biotechnology.

[25]  Chris F. Taylor,et al.  The minimum information about a genome sequence (MIGS) specification , 2008, Nature Biotechnology.

[26]  Zhiyong Lu,et al.  Crowdsourcing in biomedicine: challenges and opportunities , 2016, Briefings Bioinform..

[27]  Juan P Albar,et al.  The Minimal Information about a Proteomics Experiment (MIAPE) from the Proteomics Standards Initiative. , 2014, Methods in molecular biology.

[28]  Bernd Mayer,et al.  Bioinformatics for Omics Data , 2011, Methods in Molecular Biology.

[29]  Doron Lancet,et al.  Omics data management and annotation. , 2011, Methods in molecular biology.

[30]  Chris F. Taylor,et al.  Data standards for Omics data: the basis of data sharing and reuse. , 2011, Methods in molecular biology.

[31]  S. Orchard,et al.  Omics technologies, data and bioinformatics principles. , 2011, Methods in molecular biology.