Data model, dictionaries, and desiderata for biomolecular simulation data indexing and sharing

BackgroundFew environments have been developed or deployed to widely share biomolecular simulation data or to enable collaborative networks to facilitate data exploration and reuse. As the amount and complexity of data generated by these simulations is dramatically increasing and the methods are being more widely applied, the need for new tools to manage and share this data has become obvious. In this paper we present the results of a process aimed at assessing the needs of the community for data representation standards to guide the implementation of future repositories for biomolecular simulations.ResultsWe introduce a list of common data elements, inspired by previous work, and updated according to feedback from the community collected through a survey and personal interviews. These data elements integrate the concepts for multiple types of computational methods, including quantum chemistry and molecular dynamics. The identified core data elements were organized into a logical model to guide the design of new databases and application programming interfaces. Finally a set of dictionaries was implemented to be used via SQL queries or locally via a Java API built upon the Apache Lucene text-search engine.ConclusionsThe model and its associated dictionaries provide a simple yet rich representation of the concepts related to biomolecular simulations, which should guide future developments of repositories and more complex terminologies and ontologies. The model still remains extensible through the decomposition of virtual experiments into tasks and parameter sets, and via the use of extended attributes. The benefits of a common logical model for biomolecular simulations was illustrated through various use cases, including data storage, indexing, and presentation. All the models and dictionaries introduced in this paper are available for download at http://ibiomes.chpc.utah.edu/mediawiki/index.php/Downloads.

[1]  Tamar Schlick,et al.  Molecular dynamics-based approaches for enhanced sampling of long-time, large-scale conformational changes in biomolecules , 2009, F1000 biology reports.

[2]  Gábor Terstyánszky,et al.  Application Repository and Science Gateway for Running Molecular Docking and Dynamics Simulations , 2012, HealthGrid.

[3]  Reagan Moore,et al.  iRODS Primer: Integrated Rule-Oriented Data System , 2010, iRODS Primer.

[4]  Michael Darsow,et al.  ChEBI: a database and ontology for chemical entities of biological interest , 2007, Nucleic Acids Res..

[5]  Anne E. Trefethen,et al.  ScalaLife - Scalable Software Services for Life Sciences , 2012 .

[6]  Egon L. Willighagen,et al.  The Blue Obelisk—Interoperability in Chemical Informatics , 2006, J. Chem. Inf. Model..

[7]  Peter Murray-Rust,et al.  The semantics of Chemical Markup Language (CML) for computational chemistry : CompChem , 2012, Journal of Cheminformatics.

[8]  K Schulten,et al.  VMD: visual molecular dynamics. , 1996, Journal of molecular graphics.

[9]  Modesto Orozco,et al.  MDWeb and MDMoby: an integrated web-based platform for molecular dynamics simulations , 2012, Bioinform..

[10]  Daniel R Roe,et al.  PTRAJ and CPPTRAJ: Software for Processing and Analysis of Molecular Dynamics Trajectory Data. , 2013, Journal of chemical theory and computation.

[11]  Steve McKeever,et al.  Converting Biomolecular Modelling Data Based on an XML Representation , 2008, J. Integr. Bioinform..

[12]  Julio C. Facelli,et al.  iBIOMES: Managing and Sharing Biomolecular Simulation Data in a Distributed Environment , 2013, J. Chem. Inf. Model..

[13]  David Weininger,et al.  SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules , 1988, J. Chem. Inf. Comput. Sci..

[14]  Holger Gohlke,et al.  The Amber biomolecular simulation programs , 2005, J. Comput. Chem..

[15]  K. Morokuma,et al.  ONIOM: A Multilayered Integrated MO + MM Method for Geometry Optimizations and Single Point Energy Predictions. A Test for Diels−Alder Reactions and Pt(P(t-Bu)3)2 + H2 Oxidative Addition , 1996 .

[16]  George Tillmann A Practical Guide to Logical Data Modeling , 1993 .

[17]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[18]  Nicholas R. Hardiker,et al.  Collaborative Development and Maintenance of Health Terminologies , 2013, AMIA.

[19]  Angel Herráez,et al.  Biomolecules in the computer: Jmol to the rescue , 2006, Biochemistry and molecular biology education : a bimonthly publication of the International Union of Biochemistry and Molecular Biology.

[20]  Peter Murray-Rust,et al.  The Quixote project: Collaborative and Open Quantum Chemistry data management in the Internet age , 2011, J. Cheminformatics.

[21]  Andrew K. McIntyre,et al.  Multi-National, Multi-Institutional Analysis of Clinical Decision Support Data Needs to Inform Development of the HL7 Virtual Medical Record Standard. , 2010, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[22]  Joel H. Saltz,et al.  caGrid: design and implementation of the core architecture of the cancer biomedical informatics grid , 2006, Bioinform..

[23]  Michal Otyepka,et al.  How to understand quantum chemical computations on DNA and RNA systems? A practical guide for non-specialists. , 2013, Methods.

[24]  Jonathan W. Essex,et al.  BioSimGrid: Grid-enabled biomolecular simulation data storage and analysis , 2006, Future Gener. Comput. Syst..

[25]  Valerie Daggett,et al.  Dynameomics: design of a computational lab workflow and scientific data repository for protein simulations. , 2008, Protein engineering, design & selection : PEDS.

[26]  Henry S. Rzepa,et al.  Chemical Markup, XML, and the World Wide Web. 4. CML Schema , 2003, J. Chem. Inf. Comput. Sci..

[27]  Tjerk P. Straatsma,et al.  NWChem: A comprehensive and scalable open-source solution for large scale molecular simulations , 2010, Comput. Phys. Commun..

[28]  Alan Mcnaught,et al.  The IUPAC international chemical identifier : InChl-A new standard for molecular informatics , 2006 .

[29]  Olivier Bodenreider,et al.  The Unified Medical Language System (UMLS): integrating biomedical terminology , 2004, Nucleic Acids Res..

[30]  Valerie Daggett,et al.  Implementation of 3D spatial indexing and compression in a large-scale molecular dynamics simulation database for rapid atomic contact detection , 2011, BMC Bioinformatics.

[31]  Arie Shoshani,et al.  The Grid 2: Blueprint for a New Computing Infrastructure (2nd edition), , 2003 .

[32]  Ian Foster,et al.  The Grid 2 - Blueprint for a New Computing Infrastructure, Second Edition , 1998, The Grid 2, 2nd Edition.

[33]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[34]  Abraham Silberschatz,et al.  HadoopDB in action: building real world applications , 2010, SIGMOD Conference.

[35]  Modesto Orozco,et al.  MoDEL (Molecular Dynamics Extended Library): a database of atomistic molecular dynamics trajectories. , 2010, Structure.

[36]  Marcus D. Hanwell,et al.  From data to analysis: linking NWChem and Avogadro with the syntax and semantics of Chemical Markup Language , 2013, Journal of Cheminformatics.

[37]  J. P. Grossman,et al.  Biomolecular simulation: a computational microscope for molecular biology. , 2012, Annual review of biophysics.

[38]  Henry S. Rzepa,et al.  Chemical Markup, XML and the World-Wide Web. 2. Information Objects and the CMLDOM , 2001, J. Chem. Inf. Comput. Sci..

[39]  M Karplus,et al.  Molecular and stochastic dynamics of proteins. , 1987, Proceedings of the National Academy of Sciences of the United States of America.

[40]  Carsten Kutzner,et al.  GROMACS 4:  Algorithms for Highly Efficient, Load-Balanced, and Scalable Molecular Simulation. , 2008, Journal of chemical theory and computation.

[41]  G J Williams,et al.  The Protein Data Bank: a computer-based archival file for macromolecular structures. , 1978, Archives of biochemistry and biophysics.

[42]  Oliver Beckstein,et al.  MDAnalysis: A toolkit for the analysis of molecular dynamics simulations , 2011, J. Comput. Chem..

[43]  Julian Tirado-Rives,et al.  Potential energy functions for atomic-level simulations of water and organic and biomolecular systems. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[44]  Zheng Shao,et al.  Hive - a petabyte scale data warehouse using Hadoop , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[45]  Jun Li,et al.  Basis Set Exchange: A Community Database for Computational Sciences , 2007, J. Chem. Inf. Model..

[46]  Tania Tudorache,et al.  Collaborative Ontology Development on the (Semantic) Web , 2008, AAAI Spring Symposium: Symbiotic Relationships between Semantic Web and Knowledge Engineering.