Chemical information matters: an e-Research perspective on information and data sharing in the chemical sciences.

Recently, a number of organisations have called for open access to scientific information and especially to the data obtained from publicly funded research, among which the Royal Society report and the European Commission press release are particularly notable. It has long been accepted that building research on the foundations laid by other scientists is both effective and efficient. Regrettably, some disciplines, chemistry being one, have been slow to recognise the value of sharing and have thus been reluctant to curate their data and information in preparation for exchanging it. The very significant increases in both the volume and the complexity of the datasets produced has encouraged the expansion of e-Research, and stimulated the development of methodologies for managing, organising, and analysing "big data". We review the evolution of cheminformatics, the amalgam of chemistry, computer science, and information technology, and assess the wider e-Science and e-Research perspective. Chemical information does matter, as do matters of communicating data and collaborating with data. For chemistry, unique identifiers, structure representations, and property descriptors are essential to the activities of sharing and exchange. Open science entails the sharing of more than mere facts: for example, the publication of negative outcomes can facilitate better understanding of which synthetic routes to choose, an aspiration of the Dial-a-Molecule Grand Challenge. The protagonists of open notebook science go even further and exchange their thoughts and plans. We consider the concepts of preservation, curation, provenance, discovery, and access in the context of the research lifecycle, and then focus on the role of metadata, particularly the ontologies on which the emerging chemical Semantic Web will depend. Among our conclusions, we present our choice of the "grand challenges" for the preservation and sharing of chemical information.

[1]  David De Roure e-Science and the Web , 2010, Computer.

[2]  Egon L. Willighagen,et al.  CDK-Taverna: an open workflow environment for cheminformatics , 2010, BMC Bioinformatics.

[3]  David Botstein,et al.  It's the Data! , 2010, Molecular biology of the cell.

[4]  Peter Willett,et al.  Chemoinformatics: a history , 2011 .

[5]  Susie Stephens,et al.  Aggregation of bioinformatics data using Semantic Web technology , 2006, J. Web Semant..

[6]  Jean-Marie Lehn,et al.  From supramolecular chemistry towards constitutional dynamic chemistry and adaptive chemistry. , 2007, Chemical Society reviews.

[7]  Stuart A. Sutton Proceedings of the 2003 international conference on Dublin Core and metadata applications: supporting communities of discourse and practice---metadata research & applications , 2003 .

[8]  David Weininger,et al.  SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules , 1988, J. Chem. Inf. Comput. Sci..

[9]  G Stix,et al.  The mice that warred. , 2001, Scientific American.

[10]  Jeremy G Frey,et al.  Web-based services for drug design and discovery , 2011, Expert opinion on drug discovery.

[11]  Laura Bonetta,et al.  Should You Be Tweeting? , 2009, Cell.

[12]  Hong Xu,et al.  Journal co-citation analysis of semiconductor literature , 2003, Scientometrics.

[13]  Albert-László Barabási,et al.  Linked: The New Science of Networks , 2002 .

[14]  Peter Gedeck,et al.  QSAR - How Good Is It in Practice? Comparison of Descriptor Sets on an Unbiased Cross Section of Corporate Data Sets , 2006, J. Chem. Inf. Model..

[15]  Thomas Bäck,et al.  Substructure Mining Using Elaborate Chemical Representation , 2006, J. Chem. Inf. Model..

[16]  Antony J Williams,et al.  Internet-based tools for communication and collaboration in chemistry. , 2008, Drug discovery today.

[17]  Henry S. Rzepa,et al.  SPECTRa-T: Machine-Based Data Extraction and Semantic Searching of Chemistry e-Theses , 2010, J. Chem. Inf. Model..

[18]  Brian B. Masek,et al.  Sharing Chemical Information without Sharing Chemical Structure , 2008, J. Chem. Inf. Model..

[19]  Yang Liu,et al.  An Efficient Implementation of a Drug Candidate Database , 2003, J. Chem. Inf. Comput. Sci..

[20]  Evelyn Fox Keller,et al.  Towards a science of informed matter. , 2011, Studies in history and philosophy of biological and biomedical sciences.

[21]  Henry S. Rzepa,et al.  Chemical Markup, XML, and the World Wide Web. 6. CMLReact, an XML Vocabulary for Chemical Reactions , 2006, J. Chem. Inf. Model..

[22]  Robert C. Glen,et al.  Solubility Challenge: Can You Predict Solubilities of 32 Molecules Using a Database of 100 Reliable Measurements? , 2008, J. Chem. Inf. Model..

[23]  J. F. Stoddart,et al.  The chemistry of the mechanical bond. , 2009, Chemical Society reviews.

[24]  Henry S. Rzepa,et al.  SPECTRa: The Deposition and Validation of Primary Chemistry Research Data in Digital Repositories , 2008, J. Chem. Inf. Model..

[25]  Allan Reese Databases and documenting data , 2007 .

[26]  Jane Hunter,et al.  Collaborative Annotation of 3D Crystallographic Models , 2007, J. Chem. Inf. Model..

[27]  A Crane,et al.  THE SUBMERGENCE OF WESTERN EUROPE PRIOR TO THE NEOLITHIC PERIOD. , 1895, Science.

[28]  Xiang Yao,et al.  Advanced Biological and Chemical Discovery (ABCD): Centralizing Discovery Knowledge in an Inherently Decentralized World , 2007, J. Chem. Inf. Model..

[29]  Ann M Richard,et al.  Future of toxicology--predictive toxicology: An expanded view of "chemical toxicity". , 2006, Chemical research in toxicology.

[30]  David J. Wild,et al.  Extraction of CYP Chemical Interactions from Biomedical Literature Using Natural Language Processing Methods , 2009, J. Chem. Inf. Model..

[31]  Henry S. Rzepa,et al.  Chemical Markup, XML, and the World Wide Web. 4. CML Schema , 2003, J. Chem. Inf. Comput. Sci..

[32]  Peter Murray-Rust,et al.  The semantics of Chemical Markup Language (CML): dictionaries and conventions , 2011, J. Cheminformatics.

[33]  Gerrit Kateman,et al.  Automatic Extraction of Analytical Chemical Information. System Description, Inventory of Tasks and Problems, and Preliminary Results , 1996, J. Chem. Inf. Comput. Sci..

[34]  Lee Feigenbaum,et al.  The Semantic Web in action. , 2007, Scientific American.

[35]  Barend Mons,et al.  Open PHACTS: semantic interoperability for drug discovery. , 2012, Drug discovery today.

[36]  Jun Li,et al.  Basis Set Exchange: A Community Database for Computational Sciences , 2007, J. Chem. Inf. Model..

[37]  Steven M. Bachrach,et al.  Chemistry publication – making the revolution , 2009, J. Cheminformatics.

[38]  Christine L. Borgman,et al.  Data, disciplines, and scholarly publishing , 2008, Learn. Publ..

[39]  Les Carr,et al.  An E-Science Environment for Service Crystallography-from Submission to Dissemination , 2006, J. Chem. Inf. Model..

[40]  Emma L. Tonkin Proc. Int’l Conf. on Dublin Core and Metadata Applications 2015 , 2015 .

[41]  Yun He,et al.  Learning from the Data: Mining of Large High-Throughput Screening Databases , 2006, J. Chem. Inf. Model..

[42]  Jacek Jeżowski,et al.  19th European Symposium on Computer Aided Process Engineering , 2009 .

[43]  Laurel L. Haak,et al.  Standards and Infrastructure for Innovation Data Exchange , 2012, Science.

[44]  魏屹东,et al.  Scientometrics , 2018, Encyclopedia of Big Data.

[45]  Rajarshi Guha,et al.  Advances in cheminformatics methodologies and infrastructure to support the data mining of large, heterogeneous chemical datasets. , 2010, Current computer-aided drug design.

[46]  Beat Ernst,et al.  Drug discovery today. , 2003, Current topics in medicinal chemistry.

[47]  David Abramson,et al.  Leveraging e-Science infrastructure for electrochemical research , 2011, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[48]  Nisha Gupta,et al.  Instrument Monitoring, Data Sharing, and Archiving Using Common Instrument Middleware Architecture (CIMA) , 2006, J. Chem. Inf. Model..

[49]  Henry S. Rzepa,et al.  CML: Evolution and design , 2011, J. Cheminformatics.

[50]  Donald F. McMullen,et al.  Connecting users to instruments and sensors: portals as multi‐user GUIs for instrument and sensor facilities , 2007, Concurr. Comput. Pract. Exp..

[51]  Eugene M. Fluder,et al.  Text Influenced Molecular Indexing (TIMI): A Literature Database Mining Approach that Handles Text and Chemistry , 2003, J. Chem. Inf. Comput. Sci..

[52]  Ivan Janciak,et al.  UK e-Science All Hands Meeting , 2009 .

[53]  Frank H. Allen,et al.  The Cambridge Structural Database: experimental three‐dimensional information on small molecules is a vital resource for interdisciplinary research and learning , 2011 .

[54]  Bin Chen,et al.  Chem2Bio2RDF: a semantic framework for linking and data mining chemogenomic and systems chemical biology data , 2010, BMC Bioinformatics.

[55]  Egon L. Willighagen,et al.  The Blue Obelisk—Interoperability in Chemical Informatics , 2006, J. Chem. Inf. Model..

[56]  Luc Patiny,et al.  www.nmrdb.org: Resurrecting and processing NMR spectra on-line , 2008 .

[57]  James A. Evans,et al.  Open Access and Global Participation in Science , 2009, Science.

[58]  Igor V. Filippov,et al.  Open Data, Open Source and Open Standards in chemistry: The Blue Obelisk five years on , 2011, J. Cheminformatics.

[59]  Ted Slater,et al.  Beyond data integration. , 2008, Drug discovery today.

[60]  William J. Wiswesser,et al.  The Wiswesser line-formula chemical notation , 1968 .

[61]  Henry S. Rzepa,et al.  Chemical Markup, XML and the World-Wide Web. 8. Polymer Markup Language , 2008, J. Chem. Inf. Model..

[62]  Henry S. Rzepa,et al.  Chemical Markup, XML, and the World-Wide Web. 3. Toward a Signed Semantic Chemical Web of Trust , 2001, J. Chem. Inf. Comput. Sci..

[63]  Shaoyi He,et al.  Informatics: a brief survey , 2003, Electron. Libr..

[64]  B. Library Patterns of information use and exchange: case studies of researchers in the life sciences , 2009 .

[65]  Jane Hunter,et al.  Provenance Explorer-a graphical interface for constructing scientific publication packages from provenance trails , 2007, International Journal on Digital Libraries.

[66]  Junguk Hur,et al.  PubChemSR: A search and retrieval tool for PubChem , 2008, Chemistry Central journal.

[67]  Jonathan W. Essex,et al.  Bringing Chemical Data onto the Semantic Web , 2006, J. Chem. Inf. Model..

[68]  Irwin D Kuntz,et al.  The collaboratory for MS3D: a new cyberinfrastructure for the structural elucidation of biological macromolecules and their assemblies using mass spectrometry-based approaches. , 2008, Journal of proteome research.

[69]  Henry S. Rzepa,et al.  Chemical Markup, XML, and the Worldwide Web. 1. Basic Principles , 1999, J. Chem. Inf. Comput. Sci..

[70]  Kazuhiro Saitou,et al.  Tunable Machine Vision-Based Strategy for Automated Annotation of Chemical Databases , 2009, J. Chem. Inf. Model..

[71]  David J. Wild,et al.  Grand challenges for cheminformatics , 2009, J. Cheminformatics.

[72]  Carole L. Palmer,et al.  The analytic potential of scientific data: Understanding re-use value , 2011, ASIST.

[73]  Lars Ruddigkeit,et al.  The enumeration of chemical space , 2012 .

[74]  Punnaivanam Sankar,et al.  Model Tool to Describe Chemical Structures in XML Format Utilizing Structural Fragments and Chemical Ontology , 2010, J. Chem. Inf. Model..

[75]  Henry S. Rzepa,et al.  Chemical Machine Vision: Automated Extraction of Chemical Metadata from Raster Images , 2003, J. Chem. Inf. Comput. Sci..

[76]  Jeremy G. Frey,et al.  Curation of Laboratory Experimental Data as Part of the Overall Data Lifecycle , 2006, Int. J. Digit. Curation.

[77]  Sean Ekins,et al.  Novel web-based tools combining chemistry informatics, biology and social networks for drug discovery. , 2009, Drug discovery today.

[78]  C. Rusbridge,et al.  The International Journal of Digital Curation , 2008 .

[79]  Jong-Nam Kim,et al.  The 235th ACS National Meeting , 2008 .

[80]  Anne E. Trefethen,et al.  Cyberinfrastructure for e-Science , 2005, Science.

[81]  Egon L. Willighagen,et al.  Linked open drug data for pharmaceutical research and development , 2011, J. Cheminformatics.

[82]  Vladimir Poroikov,et al.  Why relevant chemical information cannot be exchanged without disclosing structures , 2005, J. Comput. Aided Mol. Des..

[83]  David J Wild,et al.  Mining large heterogeneous data sets in drug discovery , 2009, Expert opinion on drug discovery.

[84]  Henry S Rzepa,et al.  Enhancement of the chemical semantic web through the use of InChI identifiers. , 2005, Organic & biomolecular chemistry.

[85]  Bin Zhou,et al.  Chemical-Text Hybrid Search Engines , 2010, J. Chem. Inf. Model..

[86]  Sangtae Kim Cyberinfrastructure: Enabling the Chemical Sciences , 2006, J. Chem. Inf. Model..

[87]  Peter Murray-Rust,et al.  Open Data in Science , 2008 .

[88]  Henry S. Rzepa,et al.  SemanticEye: A Semantic Web Application to Rationalize and Enhance Chemical Electronic Publishing , 2006, J. Chem. Inf. Model..

[89]  Egon L. Willighagen,et al.  The Chemistry Development Kit (CDK): An Open-Source Java Library for Chemo-and Bioinformatics , 2003, J. Chem. Inf. Comput. Sci..

[90]  Peter Murray-Rust Semantic science and its communication - a personal view , 2011, J. Cheminformatics.

[91]  C. Steinbeck,et al.  The Chemical Information Ontology: Provenance and Disambiguation for Chemical Data on the Biological Semantic Web , 2011, PloS one.

[92]  Bartolomé M. Simonet,et al.  Types of analytical information and their mutual relationships , 2008 .

[93]  Rajarshi Guha,et al.  Web Service Infrastructure for Chemoinformatics , 2007, J. Chem. Inf. Model..

[94]  Jonathan D. Hirst,et al.  Contemporary QSAR Classifiers Compared , 2007, J. Chem. Inf. Model..

[95]  Ambuj K. Singh,et al.  Mining Statistically Significant Molecular Substructures for Efficient Molecular Classification , 2009, J. Chem. Inf. Model..

[96]  Rajarshi Guha,et al.  Improving Usability and Accessibility of Cheminformatics Tools for Chemists through Cyberinfrastructure and Education , 2012, Silico Biol..

[97]  Nicolas Le Novère,et al.  MIRIAM Resources: tools to generate and resolve robust cross-references in Systems Biology , 2007, BMC Systems Biology.

[98]  James D. Myers,et al.  Collaboratories: Doing Science on the Internet , 1996, Computer.

[99]  Amanda Clare,et al.  Wiki based management of chemometric research projects , 2010 .

[100]  Jennifer J. Kohler Chemical biology meets networks , 2007, Nature Chemical Biology.

[101]  M. Karelson,et al.  QSPR: the correlation and quantitative prediction of chemical and physical properties from structure , 1995 .

[102]  Emilio Xavier Esposito,et al.  Findings of the Challenge To Predict Aqueous Solubility , 2009, J. Chem. Inf. Model..

[103]  Wolfgang Pempe,et al.  Towards an Open Repository Environment , 2010, J. Digit. Inf..

[104]  Jeremy G Frey,et al.  The value of the Semantic Web in the laboratory. , 2009, Drug discovery today.

[105]  Julie Carpenter,et al.  Researchers of Tomorrow: The research behaviour of Generation Y doctoral students , 2012, Inf. Serv. Use.

[106]  Michel Dumontier,et al.  Chemical Entity Semantic Specification: Knowledge representation for efficient semantic cheminformatics and facile data integration , 2011, J. Cheminformatics.

[107]  Egon L. Willighagen,et al.  Chemical Markup, XML, and the World Wide Web, 7. CMLSpect, an XML Vocabulary for Spectral Data , 2007, J. Chem. Inf. Model..

[108]  W. Tobler A Computer Movie Simulating Urban Growth in the Detroit Region , 1970 .