A survey on Big Data and Machine Learning for Chemistry

Herein we review aspects of leading-edge research and innovation in chemistry which exploits big data and machine learning (ML), two computer science fields that combine to yield machine intelligence. ML can accelerate the solution of intricate chemical problems and even solve problems that otherwise would not be tractable. But the potential benefits of ML come at the cost of big data production; that is, the algorithms, in order to learn, demand large volumes of data of various natures and from different sources, from materials properties to sensor data. In the survey, we propose a roadmap for future developments, with emphasis on materials discovery and chemical sensing, and within the context of the Internet of Things (IoT), both prominent research fields for ML in the context of big data. In addition to providing an overview of recent advances, we elaborate upon the conceptual and practical limitations of big data and ML applied to chemistry, outlining processes, discussing pitfalls, and reviewing cases of success and failure.

[1]  Juho Rousu,et al.  Critical Assessment of Small Molecule Identification 2016: automated methods , 2017, Journal of Cheminformatics.

[2]  Marwin H. S. Segler,et al.  Modelling Chemical Reasoning to Predict Reactions , 2016, Chemistry.

[3]  Andrea Cadeddu,et al.  Organic chemistry as a language and the implications of chemical linguistics for structural and retrosynthetic analyses. , 2014, Angewandte Chemie.

[4]  D. Diamond,et al.  Chemo/bio-sensor networks , 2006, Nature materials.

[5]  Mohak Shah,et al.  Comparative Study of Deep Learning Software Frameworks , 2015, 1511.06435.

[6]  Jitender Verma,et al.  3D-QSAR in drug design--a review. , 2010, Current topics in medicinal chemistry.

[7]  Fernando V Paulovich,et al.  On the convergence of nanotechnology and Big Data analysis for computer-aided diagnosis. , 2016, Nanomedicine.

[8]  Heather J Kulik,et al.  Predicting electronic structure properties of transition metal complexes with neural networks† †Electronic supplementary information (ESI) available. See DOI: 10.1039/c7sc01247k , 2017, Chemical science.

[9]  Manuela Pavan,et al.  DRAGON SOFTWARE: AN EASY APPROACH TO MOLECULAR DESCRIPTOR CALCULATIONS , 2006 .

[10]  Kristin A. Persson,et al.  Commentary: The Materials Project: A materials genome approach to accelerating materials innovation , 2013 .

[11]  Piotr Dittwald,et al.  Computer-Assisted Synthetic Planning: The End of the Beginning , 2016 .

[12]  Walter Thiel,et al.  Structure-based sampling and self-correcting machine learning for accurate calculations of potential energy surfaces and vibrational levels. , 2017, The Journal of chemical physics.

[13]  Constantine Bekas,et al.  “Found in Translation”: predicting outcomes of complex organic chemistry reactions using neural sequence-to-sequence models† †Electronic supplementary information (ESI) available: Time-split test set and example predictions, together with attention weights, confidence and token probabilities. See DO , 2017, Chemical science.

[14]  C. L. Philip Chen,et al.  Data-intensive applications, challenges, techniques and technologies: A survey on Big Data , 2014, Inf. Sci..

[15]  E J Corey,et al.  Computer-assisted design of complex organic syntheses. , 1969, Science.

[16]  Paola Gramatica,et al.  Principles of QSAR models validation: internal and external , 2007 .

[17]  Maho Nakata,et al.  PubChemQC Project: A Large-Scale First-Principles Electronic Structure Database for Data-Driven Chemistry , 2017, J. Chem. Inf. Model..

[18]  Hanna M. Wallach Computational social science ≠ computer science + social data , 2018, Commun. ACM.

[19]  Walter Thiel,et al.  Machine Learning of Parameters for Accurate Semiempirical Quantum Chemical Calculations , 2015, Journal of chemical theory and computation.

[20]  Pierre Baldi,et al.  Deep architectures for protein contact map prediction , 2012, Bioinform..

[21]  Sandra M. Aluísio,et al.  Enriching Complex Networks with Word Embeddings for Detecting Mild Cognitive Impairment from Speech Transcripts , 2017, ACL.

[22]  Yang Li,et al.  Stalking the Materials Genome: A Data‐Driven Approach to the Virtual Design of Nanostructured Polymers , 2013, Advanced functional materials.

[23]  Feliu Maseras,et al.  Managing the Computational Chemistry Big Data Problem: The ioChem-BD Platform , 2015, J. Chem. Inf. Model..

[24]  Ross McGuire,et al.  Data-driven medicinal chemistry in the era of big data. , 2014, Drug discovery today.

[25]  Michael K. Gilson,et al.  BindingDB in 2015: A public database for medicinal chemistry, computational chemistry and systems pharmacology , 2015, Nucleic Acids Res..

[26]  Manfred Baerns,et al.  An evolutionary approach in the combinatorial selection and optimization of catalytic materials , 2000 .

[27]  D. Shaywitz,et al.  Found in translation , 2007, Nature Biotechnology.

[28]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[29]  Tony Badrick,et al.  Higher Dimensions : Machine-Learning and Enhanced Prediction from Routine Clinical Chemistry Data , 2016 .

[30]  Alfred Inselberg,et al.  The plane with parallel coordinates , 1985, The Visual Computer.

[31]  Sergei V. Kalinin,et al.  Big-deep-smart data in imaging for guiding materials design. , 2015, Nature materials.

[32]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[33]  Alan R. Katritzky,et al.  Quantum-Chemical Descriptors in QSAR/QSPR Studies , 1996 .

[34]  Steven E. Brenner,et al.  SCOPe: Structural Classification of Proteins—extended, integrating SCOP and ASTRAL data and classification of new structures , 2013, Nucleic Acids Res..

[35]  Rachel Cardell-Oliver,et al.  A Reactive Soil Moisture Sensor Network: Design and Field Evaluation , 2005, Int. J. Distributed Sens. Networks.

[36]  H. Rabitz,et al.  Discovering predictive rules of chemistry from property landscapes , 2013 .

[37]  Joshua M. Korn,et al.  High-throughput screening using patient-derived tumor xenografts to predict clinical trial drug response , 2015, Nature Medicine.

[38]  Sean Ekins The Next Era: Deep Learning in Pharmaceutical Research , 2016, Pharmaceutical Research.

[39]  Regina Barzilay,et al.  Prediction of Organic Reaction Outcomes Using Machine Learning , 2017, ACS central science.

[40]  Joshua Lederberg,et al.  DENDRAL: A Case Study of the First Expert System for Scientific Hypothesis Formation , 1993, Artif. Intell..

[41]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[42]  M. Karelson,et al.  QSPR: the correlation and quantitative prediction of chemical and physical properties from structure , 1995 .

[43]  Shu-Hsien Liao,et al.  Expert system methodologies and applications - a decade review from 1995 to 2004 , 2005, Expert Syst. Appl..

[44]  Jun-Ying Miao,et al.  Microwave-assisted parallel synthesis of benzofuran-2-carboxamide derivatives bearing anti-inflammatory, analgesic and antipyretic agents , 2014 .

[45]  Christopher M Wolverton,et al.  Atomistic calculations and materials informatics: A review , 2017 .

[46]  D. Winkler,et al.  Discovery and Optimization of Materials Using Evolutionary Approaches. , 2016, Chemical reviews.

[47]  Amay J. Bandodkar,et al.  Wearable Chemical Sensors: Present Challenges and Future Prospects , 2016 .

[48]  Grzegorz Rozenberg,et al.  Handbook of Natural Computing , 2011, Springer Berlin Heidelberg.

[49]  Dermot Diamond,et al.  Neural network based recognition of flow injection patterns , 1993 .

[50]  Mostafa Langarizadeh,et al.  A novel method for fuzzy diagnostic system design , 2018, Medical journal of the Islamic Republic of Iran.

[51]  Stamatios V. Kartalopoulos,et al.  Understanding neural networks and fuzzy logic , 1995 .

[52]  Daniel M. Lowe,et al.  Big Data from Pharmaceutical Patents: A Computational Analysis of Medicinal Chemists' Bread and Butter. , 2016, Journal of medicinal chemistry.

[53]  Frank R Burden,et al.  Quantitative structure-property relationship modeling of diverse materials properties. , 2012, Chemical reviews.

[54]  Abhinav Vishnu,et al.  Deep learning for computational chemistry , 2017, J. Comput. Chem..

[55]  Diego R. Amancio,et al.  On the role of words in the network structure of texts: application to authorship attribution , 2017, ArXiv.

[56]  Metin Bulut,et al.  Directed development of high-performance membranes via high-throughput and combinatorial strategies. , 2006, Journal of combinatorial chemistry.