Building and deploying a cyberinfrastructure for the data-driven design of chemical systems and the exploration of chemical space

Abstract The use of modern data science has recently emerged as a promising new path to tackling the complex challenges involved in the creation of next-generation chemistry and materials. However, despite the appeal of this potentially transformative development, the chemistry community has yet to incorporate it as a central tool in every-day work. Our research program is designed to enable and advance this emerging research approach. It is centred around the creation of a software ecosystem that brings together physics-based modelling, high-throughput in silico screening and data analytics (i.e. the use of machine learning and informatics for the validation, mining and modelling of chemical data). This cyberinfrastructure is devised to offer a comprehensive set of data science techniques and tools as well as a general-purpose scope to make it as versatile and widely applicable as possible. It also emphasises user-friendliness to make it accessible to the community at large. It thus provides the means for the large-scale exploration of chemical space and for a better understanding of the hidden mechanisms that determine the properties of complex chemical systems. Such insights can dramatically accelerate, streamline and ultimately transform the way chemical research is conducted. Aside from serving as a production-level tool, our cyberinfrastructure is also designed to facilitate and assess methodological innovation. Both the software and method development work are driven by concrete molecular design problems, which also allow us to assess the efficacy of the overall cyberinfrastructure.

[1]  Sanguthevar Rajasekaran,et al.  Accelerating materials property predictions using machine learning , 2013, Scientific Reports.

[2]  B. Kowalski,et al.  Pattern recognition. Powerful approach to interpreting chemical data , 1972 .

[3]  Alexandre Tkatchenko,et al.  Quantum-chemical insights from deep tensor neural networks , 2016, Nature Communications.

[4]  Frank Neese,et al.  The ORCA program system , 2012 .

[5]  Chong Cheng,et al.  Combining first-principles and data modeling for the accurate prediction of the refractive index of organic polymers. , 2018, The Journal of chemical physics.

[6]  Jean-Louis Reymond,et al.  Enumeration of 166 Billion Organic Small Molecules in the Chemical Universe Database GDB-17 , 2012, J. Chem. Inf. Model..

[7]  Geoffrey J. Gordon,et al.  Constant size descriptors for accurate machine learning models of molecular properties. , 2018, The Journal of chemical physics.

[8]  G. Schneider,et al.  From Molecular Shape to Potent Bioactive Agents II: Fragment‐Based de novo Design , 2009, ChemMedChem.

[9]  Chang-Ki Moon,et al.  Highly Enhanced Light Extraction from Surface Plasmonic Loss Minimized Organic Light‐Emitting Diodes , 2013, Advanced materials.

[10]  C. Dobson Chemical space and biology , 2004, Nature.

[11]  Alán Aspuru-Guzik,et al.  Lead candidates for high-performance organic photovoltaics from high-throughput quantum chemistry – the Harvard Clean Energy Project , 2014 .

[12]  Alán Aspuru-Guzik,et al.  Advances in molecular quantum chemistry contained in the Q-Chem 4 program package , 2014, Molecular Physics.

[13]  Ivonne M C M Rietjens,et al.  Promises and pitfalls of quantitative structure-activity relationship approaches for predicting metabolism and toxicity. , 2008, Chemical research in toxicology.

[14]  Sridhar Krishnaswamy,et al.  Direct Laser Writing Polymer Micro-Resonators for Refractive Index Sensors , 2016, IEEE Photonics Technology Letters.

[15]  Ramaswamy Nilakantan,et al.  Topological torsion: a new molecular descriptor for SAR applications. Comparison with other descriptors , 1987, J. Chem. Inf. Comput. Sci..

[16]  A. Hopkins,et al.  Navigating chemical space for biology and medicine , 2004, Nature.

[17]  Irwin D. Kuntz,et al.  A genetic algorithm for structure-based de novo design , 2001, J. Comput. Aided Mol. Des..

[18]  Martin Korth,et al.  Large-scale virtual high-throughput screening for the identification of new battery electrolyte solvents: evaluation of electronic structure theory methods. , 2014, Physical chemistry chemical physics : PCCP.

[19]  P. Kirkpatrick,et al.  Chemical space , 2004, Nature.

[20]  Ching-Yen Shih Systematic trends in results from different density functional theory models , 2015 .

[21]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[22]  Krishna Rajan,et al.  Informatics for Materials Science and Engineering: Data-Driven Discovery for Accelerated Experimentation and Application , 2013 .

[23]  Berk Hess,et al.  GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers , 2015 .

[24]  Alán Aspuru-Guzik,et al.  The Harvard organic photovoltaic dataset , 2016, Scientific Data.

[25]  Gregory J O Beran,et al.  Practical quantum mechanics-based fragment methods for predicting molecular crystal properties. , 2012, Physical chemistry chemical physics : PCCP.

[26]  Corey Oses,et al.  Materials Cartography: Representing and Mining Material Space Using Structural and Electronic Fingerprints , 2014, 1412.4096.

[27]  Yujie Tian Inheritance of molecular orbital energies from monomer building blocks to larger copolymers in organic semiconductors , 2016 .

[28]  Liliane Mouawad,et al.  vSDC: a method to improve early recognition in virtual screening when limited experimental resources are available , 2016, Journal of Cheminformatics.

[29]  Klaus-Robert Müller,et al.  Assessment and Validation of Machine Learning Methods for Predicting Molecular Atomization Energies. , 2013, Journal of chemical theory and computation.

[30]  Bernadette Govaerts,et al.  A review of quantitative structure-activity relationship (QSAR) models , 2007 .

[31]  Mohini Sain,et al.  Development of transparent bacterial cellulose nanocomposite film as substrate for flexible organic light emitting diode (OLED) display , 2012 .

[32]  Ashley A. White Big data are shaping the future of materials science , 2013 .

[33]  Stefano Curtarolo,et al.  Assessing the Thermoelectric Properties of Sintered Compounds via High-Throughput Ab-Initio Calculations , 2011 .

[34]  Charles H. Ward Materials Genome Initiative for Global Competitiveness , 2012 .

[35]  O. Anatole von Lilienfeld,et al.  Machine Learning, Quantum Chemistry, and Chemical Space , 2017 .

[36]  Sergei Manzhos,et al.  A random-sampling high dimensional model representation neural network for building potential energy surfaces. , 2006, The Journal of chemical physics.

[37]  Rachael A Mansbach,et al.  Machine learning of single molecule free energy surfaces and the impact of chemistry and environment upon structure and dynamics. , 2015, The Journal of chemical physics.

[38]  George E. Dahl Deep Learning Approaches to Problems in Speech Recognition, Computational Chemistry, and Natural Language Text Processing , 2015 .

[39]  Krishna Rajan,et al.  Combinatorial and high-throughput screening of materials libraries: review of state of the art. , 2011, ACS combinatorial science.

[40]  David E. Leahy,et al.  Chemical Descriptors Library (CDL): A Generic, Open Source Software Library for Chemical Informatics , 2008, J. Chem. Inf. Model..

[41]  Alán Aspuru-Guzik,et al.  From computational discovery to experimental characterization of a high hole mobility organic crystal , 2011, Nature communications.

[42]  Roger A. Sayle,et al.  Comparing structural fingerprints using a literature-based similarity benchmark , 2016, Journal of Cheminformatics.

[43]  Alán Aspuru-Guzik,et al.  Accelerated computational discovery of high-performance materials for organic photovoltaics by means of cheminformatics , 2011 .

[44]  Markus Hartenfeller,et al.  De novo drug design. , 2010, Methods in molecular biology.

[45]  H. Ihara,et al.  The simplest method for fabrication of high refractive index polymer-metal oxide hybrids based on a soap-free process. , 2014, Chemical communications.

[46]  Richard F. Haglund,et al.  Anti-reflective polymer-nanocomposite coatings fabricated by RIR-MAPLE , 2013, Photonics West - Lasers and Applications in Science and Engineering.

[47]  Gunnar Rätsch,et al.  An introduction to kernel-based learning algorithms , 2001, IEEE Trans. Neural Networks.

[48]  Chris Morley,et al.  Open Babel: An open chemical toolbox , 2011, J. Cheminformatics.

[49]  Jean-Louis Reymond,et al.  Visualisation and subsets of the chemical universe database GDB-13 for virtual screening , 2011, J. Comput. Aided Mol. Des..

[50]  Klaus-Robert Müller,et al.  Finding Density Functionals with Machine Learning , 2011, Physical review letters.

[51]  Junji Nishii,et al.  Microlens arrays of high-refractive-index glass fabricated by femtosecond laser lithography , 2009 .

[52]  C. Selassie,et al.  History of Quantitative Structure–Activity Relationships , 2010 .

[53]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[54]  Jian Pei,et al.  Roles of Flexible Chains in Organic Semiconducting Materials , 2014 .

[55]  Jingang Liu,et al.  High refractive index polymers: fundamental research and practical applications , 2009 .

[56]  Anubhav Jain,et al.  Evaluation of Tavorite-Structured Cathode Materials for Lithium-Ion Batteries Using High-Throughput Computing , 2011 .

[57]  Vladan Stevanović,et al.  Assessing capability of semiconductors to split water using ionization potentials and electron affinities only. , 2014, Physical chemistry chemical physics : PCCP.

[58]  R. Kondor,et al.  On representing chemical environments , 2012, 1209.3140.

[59]  Alán Aspuru-Guzik,et al.  The Harvard Clean Energy Project: Large-Scale Computational Screening and Design of Organic Photovoltaics on the World Community Grid , 2011 .

[60]  Arun Mannodi-Kanakkithodi,et al.  Accelerated materials property predictions and design using motif-based fingerprints , 2015, 1503.07503.

[61]  Lavanya Ramakrishnan,et al.  Community Accessible Datastore of High-Throughput Calculations: Experiences from the Materials Project , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[62]  Cormac Toher,et al.  Universal fragment descriptors for predicting properties of inorganic crystals , 2016, Nature Communications.

[63]  Lars Ruddigkeit,et al.  The enumeration of chemical space , 2012 .

[64]  T Scior,et al.  How to recognize and workaround pitfalls in QSAR studies: a critical review. , 2009, Current medicinal chemistry.

[65]  Nicola Cioffi,et al.  Carbon based materials for electronic bio-sensing , 2011 .

[66]  Alok Choudhary,et al.  A General-Purpose Machine Learning Framework for Predicting Properties of Inorganic Materials , 2016 .

[67]  Michele Parrinello,et al.  Generalized neural-network representation of high-dimensional potential-energy surfaces. , 2007, Physical review letters.

[68]  Geoffrey J. Gordon,et al.  Constant Size Molecular Descriptors For Use With Machine Learning , 2017 .

[69]  K. Müller,et al.  Machine Learning Predictions of Molecular Properties: Accurate Many-Body Potentials and Nonlocality in Chemical Space , 2015, The journal of physical chemistry letters.

[70]  Y. Kokubun,et al.  Athermal waveguides for temperature-independent lightwave devices , 1993, IEEE Photonics Technology Letters.

[71]  Roberto Todeschini,et al.  Handbook of Molecular Descriptors , 2002 .

[72]  Yutaka Saito,et al.  An efficient algorithm for de novo predictions of biochemical pathways between chemical compounds , 2012, BMC Bioinformatics.

[73]  Alán Aspuru-Guzik,et al.  What Is High-Throughput Virtual Screening? A Perspective from Organic Materials Discovery , 2015 .

[74]  Alán Aspuru-Guzik,et al.  Chapter 17 – Organic Photovoltaics , 2013 .

[75]  Kipton Barros,et al.  Learning molecular energies using localized graph kernels. , 2016, The Journal of chemical physics.

[76]  Francois Kajzar,et al.  Optical control of an integrated interferometer using a photochromic polymer , 2001 .

[77]  K. Müller,et al.  Fast and accurate modeling of molecular atomization energies with machine learning. , 2011, Physical review letters.

[78]  Gabi Gruetzner,et al.  New inks for the direct drop-on-demand fabrication of polymer lenses , 2011 .

[79]  C. Wilmer,et al.  Large-scale screening of hypothetical metal-organic frameworks. , 2012, Nature chemistry.

[80]  Gisbert Schneider,et al.  Virtual screening: an endless staircase? , 2010, Nature Reviews Drug Discovery.

[81]  Hyunsu Cho,et al.  A Facile Route to Efficient, Low‐Cost Flexible Organic Light‐Emitting Diodes: Utilizing the High Refractive Index and Built‐In Scattering Properties of Industrial‐Grade PEN Substrates , 2015, Advanced materials.

[82]  Gunnar Rätsch,et al.  Classifying 'Drug-likeness' with Kernel-Based Learning Methods , 2005, J. Chem. Inf. Model..