Machines first, humans second: on the importance of algorithmic interpretation of open chemistry data

AbstractThe current rise in the use of open lab notebook techniques means that there are an increasing number of scientists who make chemical information freely and openly available to the entire community as a series of micropublications that are released shortly after the conclusion of each experiment. We propose that this trend be accompanied by a thorough examination of data sharing priorities. We argue that the most significant immediate benefactor of open data is in fact chemical algorithms, which are capable of absorbing vast quantities of data, and using it to present concise insights to working chemists, on a scale that could not be achieved by traditional publication methods. Making this goal practically achievable will require a paradigm shift in the way individual scientists translate their data into digital form, since most contemporary methods of data entry are designed for presentation to humans rather than consumption by machine learning algorithms. We discuss some of the complex issues involved in fixing current methods, as well as some of the immediate benefits that can be gained when open data is published correctly using unambiguous machine readable formats. Graphical AbstractLab notebook entries must target both visualisation by scientists and use by machine learning algorithms

[1]  Clark Alex,et al.  Living Molecules App to create Ingredients lists , 2013 .

[2]  Eugene Vodopianov,et al.  Automated structure verification based on a combination of 1D 1H NMR and 2D 1H13C HSQC spectra , 2007, Magnetic resonance in chemistry : MRC.

[3]  Antony J. Williams,et al.  ChemTrove: Enabling a Generic ELN To Support Chemistry through the Use of Transferable Plug-ins and Online Data Sources , 2015, J. Chem. Inf. Model..

[4]  John Wilbanks,et al.  Why Open Drug Discovery Needs Four Simple Rules for Licensing Data and Models , 2012, PLoS Comput. Biol..

[5]  Antony J. Williams ChemSpider: Integrating Structure-Based Resources Distributed across the Internet , 2010 .

[6]  Ingrid Fischer,et al.  Computational life sciences II , 2005 .

[7]  Henry S. Rzepa,et al.  Digital Data Repositories in Chemistry and Their Integration with Journals and Electronic Notebooks , 2014, J. Chem. Inf. Model..

[8]  S. Bryant,et al.  PubChem as a public resource for drug discovery. , 2010, Drug discovery today.

[9]  Alex M. Clark,et al.  The Open Drug Discovery Teams (ODDT) Mobile App For Green Chemistry , 2012 .

[10]  Igor V. Filippov,et al.  Optical Structure Recognition Software To Recover Chemical Information: OSRA, An Open Source Solution , 2009, J. Chem. Inf. Model..

[11]  Arthur Dalby,et al.  Description of several chemical structure file formats used by computer programs developed at Molecular Design Limited , 1992, J. Chem. Inf. Comput. Sci..

[12]  Peter Murray-Rust,et al.  CMLLite: a design philosophy for CML , 2011, J. Cheminformatics.

[13]  Barend Mons,et al.  Open PHACTS: semantic interoperability for drug discovery. , 2012, Drug discovery today.

[14]  Williams Antony,et al.  On the Accuracy of Chemical Structures Found on the Internet , 2013 .

[15]  Sean Ekins,et al.  Ten Simple Rules of Live Tweeting at Scientific Conferences , 2014, PLoS Comput. Biol..

[16]  Steve Pettifer,et al.  Utopia documents: linking scholarly literature with research data , 2010, Bioinform..

[17]  Alex M. Clark,et al.  Secure sharing with mobile cheminformatics apps , 2012 .

[18]  M. Hossain,et al.  Recent chemistry of the η5-cyclopentadienyl dicarbonyl iron anion , 2009 .

[19]  Paul Labute,et al.  Detection and assignment of common scaffolds in project databases of lead molecules. , 2009, Journal of medicinal chemistry.

[20]  Peter Murray-Rust,et al.  ChemicalTagger: A tool for semantic text-mining in chemistry , 2011, J. Cheminformatics.

[21]  Antony J. Williams,et al.  ChemSpider:: An Online Chemical Information Resource , 2010 .

[22]  Alex M. Clark,et al.  2D Depiction of Fragment Hierarchies , 2009, J. Chem. Inf. Model..

[23]  Jeremy G. Frey,et al.  First steps towards semantic descriptions of electronic laboratory notebook records , 2013, Journal of Cheminformatics.

[24]  Egon L. Willighagen,et al.  Scientific Lenses to Support Multiple Views over Linked Chemistry Data , 2014, SEMWEB.

[25]  F. Cotton,et al.  Basic Inorganic Chemistry , 1976 .

[26]  Jonathan Brecher Graphical representation standards for chemical structure diagrams (IUPAC Recommendations 2008) , 2008 .

[27]  Ubbo Visser,et al.  Fast and accurate semantic annotation of bioassays exploiting a hybrid of machine learning and user confirmation , 2014, PeerJ.

[28]  Egon L. Willighagen,et al.  OSCAR4: a flexible architecture for chemical text-mining , 2011, J. Cheminformatics.

[29]  Ryan G. Coleman,et al.  ZINC: A Free Tool to Discover Chemistry for Biology , 2012, J. Chem. Inf. Model..

[30]  Sean Ekins,et al.  A quality alert and call for improved curation of public chemistry databases. , 2011, Drug discovery today.

[31]  A. Peter Johnson,et al.  CLiDE Pro: The Latest Generation of CLiDE, a Tool for Optical Chemical Structure Recognition , 2009, J. Chem. Inf. Model..

[32]  C. Steinbeck,et al.  The Chemical Information Ontology: Provenance and Disambiguation for Chemical Data on the Biological Semantic Web , 2011, PloS one.

[33]  Alex M Clark Rendering Molecular Sketches for Publication Quality Output , 2013, Molecular informatics.

[34]  Peter Murray-Rust,et al.  High-Throughput Identification of Chemistry in Life Science Texts , 2006, CompLife.

[35]  Alex M. Clark,et al.  2D Structure Depiction , 2006, J. Chem. Inf. Model..

[36]  Alex M. Clark,et al.  Accurate Specification of Molecular Structures: The Case for Zero-Order Bonds and Explicit Hydrogen Counting , 2011, J. Chem. Inf. Model..

[37]  Henry S. Rzepa,et al.  The Application of Chemical Multipurpose Internet Mail Extensions (Chemical MIME) Internet Standards to Electronic Mail and World Wide Web Information Exchange , 1998, J. Chem. Inf. Comput. Sci..

[38]  A. Peter Johnson,et al.  Chemical literature data extraction: The CLiDE Project , 1993, J. Chem. Inf. Comput. Sci..

[39]  Alex M. Clark,et al.  Using The Open Drug Discovery Teams (ODDT) Mobile App To Bring Molecules & SAR From Behind Journal Paywalls , 2012 .

[40]  Harry E. Pence,et al.  Enhancing learning with online resources, social networking, and digital libraries , 2010 .

[41]  Sean Ekins,et al.  Towards a gold standard: regarding quality in public domain chemistry databases and approaches to improving the situation. , 2012, Drug discovery today.

[42]  Antony J. Williams,et al.  Automated structure verification based on 1H NMR prediction , 2006, Magnetic resonance in chemistry : MRC.

[43]  William R. Hersh,et al.  A Survey of Current Work in Biomedical Text Mining , 2005 .

[44]  Williams Antony,et al.  Mining public domain data as a basis for drug repurposing , 2013 .

[45]  Sean Ekins,et al.  Open Drug Discovery Teams: A Chemistry Mobile App for Collaboration , 2012, Molecular informatics.

[46]  Steven M. Bachrach,et al.  InChI: a user’s perspective , 2012, Journal of Cheminformatics.

[47]  J. Brecher Graphical representation of stereochemical configuration (IUPAC Recommendations 2006) , 2006 .