Standards-based curation of a decade-old digital repository dataset of molecular information

BackgroundThe desirable curation of 158,122 molecular geometries derived from the NCI set of reference molecules together with associated properties computed using the MOPAC semi-empirical quantum mechanical method and originally deposited in 2005 into the Cambridge DSpace repository as a data collection is reported.ResultsThe procedures involved in the curation included annotation of the original data using new MOPAC methods, updating the syntax of the CML documents used to express the data to ensure schema conformance and adding new metadata describing the entries together with a XML schema transformation to map the metadata schema to that used by the DataCite organisation. We have adopted a granularity model in which a DataCite persistent identifier (DOI) is created for each individual molecule to enable data discovery and data metrics at this level using DataCite tools.ConclusionsWe recommend that the future research data management (RDM) of the scientific and chemical data components associated with journal articles (the “supporting information”) should be conducted in a manner that facilitates automatic periodic curation. Graphical abstract:Standards and metadata-based curation of a decade-old digital repository dataset of molecular information.

[1]  Robin C. Cover,et al.  Metadata Encoding and Transmission Standard (METS) , 2002 .

[2]  Pavlo O. Dral,et al.  Quantum chemistry structures and properties of 134 kilo molecules , 2014, Scientific Data.

[3]  Alán Aspuru-Guzik,et al.  The Harvard Clean Energy Project: Large-Scale Computational Screening and Design of Organic Photovoltaics on the World Community Grid , 2011 .

[4]  Chris Morley,et al.  Open Babel: An open chemical toolbox , 2011, J. Cheminformatics.

[5]  James J. P. Stewart,et al.  Optimization of parameters for semiempirical methods VI: more modifications to the NDDO approximations and re-optimization of parameters , 2012, Journal of Molecular Modeling.

[6]  A bond, ring and cage resolved Poincaré–Hopf relationship for isomerisation reaction pathways , 2013 .

[7]  Stephen R. Heller,et al.  InChI - the worldwide chemical structure identifier standard , 2013, Journal of Cheminformatics.

[8]  Henry S. Rzepa,et al.  Digital Data Repositories in Chemistry and Their Integration with Journals and Electronic Notebooks , 2014, J. Chem. Inf. Model..

[9]  Jeremy G. Frey,et al.  Scientific and technical data sharing: a trading perspective , 2014, Journal of Computer-Aided Molecular Design.

[10]  Informatika Open Archives Initiative Object Reuse and Exchange , 2010 .

[11]  Henry S. Rzepa,et al.  Chemical Markup, XML, and the Worldwide Web. 1. Basic Principles , 1999, J. Chem. Inf. Comput. Sci..

[12]  Pablo de Castro,et al.  SWORD: Facilitating Deposit Scenarios , 2012, D Lib Mag..

[13]  Erica Yang,et al.  Experiences with a researcher-centric ELN† †Electronic supplementary information (ESI) available. See DOI: 10.1039/c4sc02128b Click here for additional data file. , 2014, Chemical science.

[14]  James J. P. Stewart,et al.  MOPAC: A semiempirical molecular orbital program , 1990, J. Comput. Aided Mol. Des..

[15]  Marcus D. Hanwell,et al.  Avogadro: an advanced semantic chemical editor, visualization, and analysis platform , 2012, Journal of Cheminformatics.

[16]  J. Sussman,et al.  JSmol and the Next-Generation Web-Based Representation of 3D Molecular Structure as Applied to Proteopedia , 2013 .

[17]  Laura Paglione,et al.  ORCID: a system to uniquely identify researchers , 2012, Learn. Publ..

[18]  Stuart Lewis,et al.  SWORD: Simple Web-service Offering Repository Deposit , 2008 .

[19]  H. Schaefer,et al.  Mindless chemistry. , 2006, The journal of physical chemistry. A.

[20]  Henry S. Rzepa Chemical datuments as scientific enablers , 2013, Journal of Cheminformatics.

[21]  Henry S. Rzepa,et al.  Standards-based metadata procedures for retrieving data for display or mining utilizing persistent (data-DOI) identifiers , 2015, Journal of Cheminformatics.

[22]  Henry S. Rzepa,et al.  The Application of Chemical Multipurpose Internet Mail Extensions (Chemical MIME) Internet Standards to Electronic Mail and World Wide Web Information Exchange , 1998, J. Chem. Inf. Comput. Sci..

[23]  Henry S Rzepa,et al.  The importance of being bonded. , 2009, Nature chemistry.

[24]  Dewi Handayani Untari Ningsih,et al.  Metode Preservation Metadata Implementation Strategies (Premis) bagi Standarisasi Dokumentasi Digital Batik Tulis Warisan Nusantara , 2015 .

[25]  Jonathan L. Zittrain,et al.  Perma: Scoping and Addressing the Problem of Link and Reference Rot in Legal Citations , 2014, Legal Information Management.

[26]  Henry S. Rzepa,et al.  SPECTRa: The Deposition and Validation of Primary Chemistry Research Data in Digital Repositories , 2008, J. Chem. Inf. Model..

[27]  MacKenzie Smith,et al.  DSpace: An Open Source Dynamic Digital Repository , 2003, D Lib Mag..

[28]  Henry S. Rzepa,et al.  A global resource for computational chemistry , 2005, Journal of molecular modeling.