Data-Driven Strategies for Accelerated Materials Design

Conspectus The ongoing revolution of the natural sciences by the advent of machine learning and artificial intelligence sparked significant interest in the material science community in recent years. The intrinsically high dimensionality of the space of realizable materials makes traditional approaches ineffective for large-scale explorations. Modern data science and machine learning tools developed for increasingly complicated problems are an attractive alternative. An imminent climate catastrophe calls for a clean energy transformation by overhauling current technologies within only several years of possible action available. Tackling this crisis requires the development of new materials at an unprecedented pace and scale. For example, organic photovoltaics have the potential to replace existing silicon-based materials to a large extent and open up new fields of application. In recent years, organic light-emitting diodes have emerged as state-of-the-art technology for digital screens and portable devices and are enabling new applications with flexible displays. Reticular frameworks allow the atom-precise synthesis of nanomaterials and promise to revolutionize the field by the potential to realize multifunctional nanoparticles with applications from gas storage, gas separation, and electrochemical energy storage to nanomedicine. In the recent decade, significant advances in all these fields have been facilitated by the comprehensive application of simulation and machine learning for property prediction, property optimization, and chemical space exploration enabled by considerable advances in computing power and algorithmic efficiency. In this Account, we review the most recent contributions of our group in this thriving field of machine learning for material science. We start with a summary of the most important material classes our group has been involved in, focusing on small molecules as organic electronic materials and crystalline materials. Specifically, we highlight the data-driven approaches we employed to speed up discovery and derive material design strategies. Subsequently, our focus lies on the data-driven methodologies our group has developed and employed, elaborating on high-throughput virtual screening, inverse molecular design, Bayesian optimization, and supervised learning. We discuss the general ideas, their working principles, and their use cases with examples of successful implementations in data-driven material discovery and design efforts. Furthermore, we elaborate on potential pitfalls and remaining challenges of these methods. Finally, we provide a brief outlook for the field as we foresee increasing adaptation and implementation of large scale data-driven approaches in material discovery and design campaigns.

[1]  T. Kuhn Historical structure of scientific discovery. , 1962, Science.

[2]  David Weininger,et al.  SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules , 1988, J. Chem. Inf. Comput. Sci..

[3]  Salvatore T. March,et al.  Design and natural science research on information technology , 1995, Decis. Support Syst..

[4]  G. Ceder,et al.  Identification of cathode materials for lithium batteries guided by first-principles calculations , 1998, Nature.

[5]  P. Luksch,et al.  New developments in the Inorganic Crystal Structure Database (ICSD): accessibility in support of materials research and design. , 2002, Acta crystallographica. Section B, Structural science.

[6]  Geoffrey J McLachlan,et al.  Selection bias in gene extraction on the basis of microarray gene-expression data , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[7]  Ruben Abagyan,et al.  Discovery of diverse thyroid hormone receptor antagonists by high-throughput docking , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[8]  A. Dupuy,et al.  Critical review of published microarray studies for cancer outcome and guidelines on statistical analysis and reporting. , 2007, Journal of the National Cancer Institute.

[9]  Tony Hey,et al.  The Fourth Paradigm: Data-Intensive Scientific Discovery , 2009 .

[10]  Kevin C. Dorff,et al.  The MicroArray Quality Control (MAQC)-II study of common practices for the development and validation of microarray-based predictive models , 2010, Nature Biotechnology.

[11]  Gavin C. Cawley,et al.  On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation , 2010, J. Mach. Learn. Res..

[12]  Alán Aspuru-Guzik,et al.  The Harvard Clean Energy Project: Large-Scale Computational Screening and Design of Organic Photovoltaics on the World Community Grid , 2011 .

[13]  Alán Aspuru-Guzik,et al.  From computational discovery to experimental characterization of a high hole mobility organic crystal , 2011, Nature communications.

[14]  Alán Aspuru-Guzik,et al.  Accelerated computational discovery of high-performance materials for organic photovoltaics by means of cheminformatics , 2011 .

[15]  Maqc Consortium The MicroArray Quality Control ( MAQC )-II study of common practices for the development and validation of microarray-based predictive models , 2012 .

[16]  Alán Aspuru-Guzik,et al.  Lead candidates for high-performance organic photovoltaics from high-throughput quantum chemistry – the Harvard Clean Energy Project , 2014 .

[17]  Maciej Haranczyk,et al.  Computation-Ready, Experimental Metal–Organic Frameworks: A Tool To Enable High-Throughput Screening of Nanoporous Crystals , 2014 .

[18]  Alán Aspuru-Guzik,et al.  Convolutional Networks on Graphs for Learning Molecular Fingerprints , 2015, NIPS.

[19]  Michael P. Marshak,et al.  Computational design of molecules for an all-quinone redox flow battery , 2014, Chemical science.

[20]  Samuel Schindler Scientific Discovery: That-Whats and What-Thats , 2015 .

[21]  Edward O. Pyzer-Knapp,et al.  Learning from the Harvard Clean Energy Project: The Use of Neural Networks to Accelerate Materials Discovery , 2015 .

[22]  Alán Aspuru-Guzik,et al.  What Is High-Throughput Virtual Screening? A Perspective from Organic Materials Discovery , 2015 .

[23]  Nando de Freitas,et al.  Taking the Human Out of the Loop: A Review of Bayesian Optimization , 2016, Proceedings of the IEEE.

[24]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[25]  Alán Aspuru-Guzik,et al.  A redox-flow battery with an alloxazine-based organic electrolyte , 2016, Nature Energy.

[26]  Ryan P. Adams,et al.  Design of efficient molecular organic light-emitting diodes by a high-throughput virtual screening and experimental approach. , 2016, Nature materials.

[27]  Oksana Ostroverkhova,et al.  Organic Optoelectronic Materials: Mechanisms and Applications. , 2016, Chemical reviews.

[28]  M. R. Palacín,et al.  Towards a calcium-based rechargeable battery. , 2016, Nature materials.

[29]  Alán Aspuru-Guzik,et al.  Objective-Reinforced Generative Adversarial Networks (ORGAN) for Sequence Generation Models , 2017, ArXiv.

[30]  Ifor D. W. Samuel,et al.  Light Harvesting for Organic Photovoltaics , 2016, Chemical reviews.

[31]  Alán Aspuru-Guzik,et al.  Design Principles and Top Non-Fullerene Acceptor Candidates for Organic Photovoltaics , 2017 .

[32]  S. Narayanan,et al.  A New Michael-Reaction-Resistant Benzoquinone for Aqueous Organic Redox Flow Batteries , 2017 .

[33]  Alán Aspuru-Guzik,et al.  Inverse molecular design using machine learning: Generative models for matter engineering , 2018, Science.

[34]  Hui‐Ming Cheng,et al.  Reversible calcium alloying enables a practical room-temperature rechargeable calcium-ion battery with a high discharge voltage , 2018, Nature Chemistry.

[35]  David G. Kwabi,et al.  Alkaline Quinone Flow Battery with Long Lifetime at pH 12 , 2018, Joule.

[36]  Alán Aspuru-Guzik,et al.  ChemOS: Orchestrating autonomous experimentation , 2018, Science Robotics.

[37]  Alán Aspuru-Guzik,et al.  Chimera: enabling hierarchy based multi-objective optimization for self-driving laboratories , 2018, Chemical science.

[38]  Alán Aspuru-Guzik,et al.  Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules , 2016, ACS central science.

[39]  Alán Aspuru-Guzik,et al.  Phoenics: A Bayesian Optimizer for Chemistry , 2018, ACS central science.

[40]  Alán Aspuru-Guzik,et al.  The Matter Simulation (R)evolution , 2018, ACS central science.

[41]  Alán Aspuru-Guzik,et al.  Alkaline Benzoquinone Aqueous Flow Battery for Large‐Scale Storage of Electrical Energy , 2018 .

[42]  Daniel W. Davies,et al.  Machine learning for molecular and materials science , 2018, Nature.

[43]  P. de Silva On the Inverted Singlet-Triplet Gaps and Their Relevance to Thermally-Activated Delayed Fluorescence. , 2019, The journal of physical chemistry letters.

[44]  Leroy Cronin,et al.  Organic synthesis in a modular robotic system driven by a chemical programming language , 2019, Science.

[45]  Alán Aspuru-Guzik,et al.  Next-Generation Experimentation with Self-Driving Laboratories , 2019, Trends in Chemistry.

[46]  Daniel P. Tabor,et al.  Mapping the frontiers of quinone stability in aqueous media: implications for organic aqueous redox flow batteries , 2019, Journal of Materials Chemistry A.

[47]  M. R. Palacín,et al.  Achievements, Challenges, and Prospects of Calcium Batteries. , 2019, Chemical reviews.

[48]  Pieter P. Plehiers,et al.  A robotic platform for flow synthesis of organic compounds informed by AI planning , 2019, Science.

[49]  Alán Aspuru-Guzik,et al.  Discovery of Calcium‐Metal Alloy Anodes for Reversible Ca‐Ion Batteries , 2018, Advanced Energy Materials.

[50]  Cody W. Schlenker,et al.  Singlet-Triplet Inversion in Heptazine and in Polymeric Carbon Nitrides. , 2019, The journal of physical chemistry. A.

[51]  Alán Aspuru-Guzik,et al.  Identification Schemes for Metal–Organic Frameworks To Enable Rapid Search and Cheminformatics Analysis , 2019, Crystal Growth & Design.

[52]  Eugene E. Kwan,et al.  Extending the Lifetime of Organic Flow Batteries via Redox State Management. , 2019, Journal of the American Chemical Society.

[53]  T. L. Liu,et al.  Status and Prospects of Organic Redox Flow Batteries toward Sustainable Energy Storage , 2019, ACS Energy Letters.

[54]  Yan-Qing Li,et al.  Recent advances in organic light-emitting diodes: toward smart lighting and displays , 2020, Materials Chemistry Frontiers.

[55]  Alán Aspuru-Guzik,et al.  ChemOS: An orchestration software to democratize autonomous discovery , 2020, PloS one.

[56]  Alán Aspuru-Guzik,et al.  Data-science driven autonomous process optimization , 2020, Communications Chemistry.

[57]  Reiner Sebastian Sprick,et al.  A mobile robotic chemist , 2020, Nature.

[58]  Shangfeng Yang,et al.  18% Efficiency organic solar cells. , 2020, Science bulletin.

[59]  Pascal Friederich,et al.  Self-referencing embedded strings (SELFIES): A 100% robust molecular string representation , 2019, Mach. Learn. Sci. Technol..

[60]  Alán Aspuru-Guzik,et al.  Materials Acceleration Platforms: On the way to autonomous experimentation , 2020 .

[61]  G dosPassosGomes,et al.  Automatic discovery of chemical reactions using imposed activation , 2020 .

[62]  Pascal Friederich,et al.  Machine learning dihydrogen activation in the chemical space surrounding Vaska's complex , 2020, Chemical science.

[63]  Florian Häse,et al.  Gryffin: An algorithm for Bayesian optimization for categorical variables informed by physical intuition with applications to chemistry , 2020, ArXiv.

[64]  Bun Chan,et al.  How accurate are approximate quantum chemical methods at modelling solute-solvent interactions in solvated clusters? , 2020, Physical chemistry chemical physics : PCCP.

[65]  Alán Aspuru-Guzik,et al.  Film Fabrication Techniques: Beyond Ternary OPV: High‐Throughput Experimentation and Self‐Driving Laboratories Optimize Multicomponent Systems (Adv. Mater. 14/2020) , 2020 .

[66]  Alán Aspuru-Guzik,et al.  Augmenting Genetic Algorithms with Deep Neural Networks for Exploring the Chemical Space , 2019, ICLR.

[67]  S. Wuttke,et al.  Digital Reticular Chemistry , 2020, Chem.

[68]  Alán Aspuru-Guzik,et al.  Beyond Ternary OPV: High‐Throughput Experimentation and Self‐Driving Laboratories Optimize Multicomponent Systems , 2019, Advanced materials.

[69]  A. Aspuru-Guzik,et al.  Self-driving laboratory for accelerated discovery of thin-film materials , 2019, Science Advances.

[70]  Alán Aspuru-Guzik,et al.  Inverse design of nanoporous crystalline reticular materials with deep generative models , 2021, Nat. Mach. Intell..

[71]  Isaac Tamblyn,et al.  Scientific intuition inspired by machine learning-generated hypotheses , 2020, Mach. Learn. Sci. Technol..

[72]  Organic molecules with inverted gaps between first excited singlet and triplet states and appreciable fluorescence rates , 2021 .

[73]  Matteo Aldeghi,et al.  Olympus: a benchmarking framework for noisy optimization and experiment planning , 2020, Mach. Learn. Sci. Technol..

[74]  Pascal Friederich,et al.  Neural Message Passing on High Order Paths , 2020, Mach. Learn. Sci. Technol..