MEMOPS: Data modelling and automatic code generation

Summary In recent years the amount of biological data has exploded to the point where much useful information can only be extracted by complex computational analyses. Such analyses are greatly facilitated by metadata standards, both in terms of the ability to compare data originating from different sources, and in terms of exchanging data in standard forms, e.g. when running processes on a distributed computing infrastructure. However, standards thrive on stability whereas science tends to constantly move, with new methods being developed and old ones modified. Therefore maintaining both metadata standards, and all the code that is required to make them useful, is a non-trivial problem. Memops is a framework that uses an abstract definition of the metadata (described in UML) to generate internal data structures and subroutine libraries for data access (application programming interfaces - APIs - currently in Python, C and Java) and data storage (in XML files or databases). For the individual project these libraries obviate the need for writing code for input parsing, validity checking or output. Memops also ensures that the code is always internally consistent, massively reducing the need for code reorganisation. Across a scientific domain a Memops-supported data model makes it easier to support complex standards that can capture all the data produced in a scientific area, share them among all programs in a complex software pipeline, and carry them forward to deposition in an archive. The principles behind the Memops generation code will be presented, along with example applications in Nuclear Magnetic Resonance (NMR) spectroscopy and structural biology.

[1]  Miron Livny,et al.  RECOORD: A recalculated coordinate database of 500+ proteins from the PDB using restraints from the BioMagResBank , 2005, Proteins.

[2]  A. Brunger Version 1.2 of the Crystallography and NMR system , 2007, Nature Protocols.

[3]  T. N. Bhat,et al.  A framework for scientific data modeling and automated software development , 2005, Bioinform..

[4]  P E Bourne,et al.  Macromolecular Crystallographic Information File. , 1997, Methods in enzymology.

[5]  P. Bank,et al.  Protein Data Bank Contents Guide: Atomic Coordinate Entry Format , 1999 .

[6]  Ivar Jacobson,et al.  Unified Modeling Language Reference Manual, The (2nd Edition) , 2004 .

[7]  A. Brazma,et al.  Standards for systems biology , 2006, Nature Reviews Genetics.

[8]  Miron Livny,et al.  BioMagResBank , 2007, Nucleic Acids Res..

[9]  Wolfgang Rieping,et al.  Bmc Structural Biology Relationship between Chemical Shift Value and Accessible Surface Area for All Amino Acid Atoms , 2009 .

[10]  C. Dominguez,et al.  HADDOCK: a protein-protein docking approach based on biochemical or biophysical information. , 2003, Journal of the American Chemical Society.

[11]  Wayne Boucher,et al.  The CCPN data model for NMR spectroscopy: Development of a software pipeline , 2005, Proteins.

[12]  Jack Herrington,et al.  Code Generation in Action , 2003 .

[13]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[14]  Michael Nilges,et al.  ARIA2: Automated NOE assignment and data integration in NMR structure calculation , 2007, Bioinform..

[15]  Morris A. Swertz,et al.  Beyond standardization: dynamic software infrastructures for systems biology , 2007, Nature Reviews Genetics.

[16]  Wim Vranken,et al.  A global analysis of NMR distance constraints from the PDB , 2007, Journal of biomolecular NMR.