Nanoinformatics, and the big challenges for the science of small things.

The combination of computational chemistry and computational materials science with machine learning and artificial intelligence provides a powerful way of relating structural features of nanomaterials with functional properties. However, combining these fundamentally different scientific approaches is not as straightforward as it seems. Machine learning methods were developed for large data sets with small numbers of consistent features. Typically nanomaterials data sets are small, with high dimensionality and high variance in the feature space, and suffer from numerous destructive biases. None of the established data science or machine learning methods in widespread use today were devised with (nano)materials data sets in mind, but there are ways to overcome these challenges and use them reliably. In this review we will discuss domain-specific constraints on data-driven nanomaterials design, and explore the differences between nanomaterials simulation and nanoinformatics that can be leveraged for greater impact.

[1]  Jaime Ortegon,et al.  Material phase classification by means of Support Vector Machines , 2017, Computational Materials Science.

[2]  Liping Zhu,et al.  A Review on Dimension Reduction , 2013, International statistical review = Revue internationale de statistique.

[3]  Michael Fernandez,et al.  Statistics, damned statistics and nanoscience - using data science to meet the challenge of nanomaterial complexity. , 2016, Nanoscale horizons.

[4]  Yingjie Tian,et al.  A Comprehensive Survey of Clustering Algorithms , 2015, Annals of Data Science.

[5]  Jürgen Schmidhuber,et al.  Deep learning in neural networks: An overview , 2014, Neural Networks.

[6]  Haibo He,et al.  ADASYN: Adaptive synthetic sampling approach for imbalanced learning , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[7]  Yoshua Bengio,et al.  Deep Sparse Rectifier Neural Networks , 2011, AISTATS.

[8]  J. E. Gubernatis,et al.  Machine learning in materials design and discovery: Examples from the present and suggestions for the future , 2018, Physical Review Materials.

[9]  Christopher Wolverton,et al.  Accelerated discovery of metallic glasses through iteration of machine learning and high-throughput experiments , 2018, Science Advances.

[10]  Onno E. de Noord,et al.  Machine Learning and Statistical Analysis for Materials Science: Stability and Transferability of Fingerprint Descriptors and Chemical Insights , 2017 .

[11]  Anubhav Jain,et al.  Data mined ionic substitutions for the discovery of new compounds. , 2011, Inorganic chemistry.

[12]  Stan Matwin,et al.  Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.

[13]  Kristin A. Persson,et al.  Predicting crystal structures with data mining of quantum calculations. , 2003, Physical review letters.

[14]  Anubhav Jain,et al.  Research Update: The materials genome initiative: Data sharing and the impact of collaborative ab initio databases , 2016 .

[15]  Michael Fernández,et al.  Quantitative Structure-Property Relationship Modeling of Electronic Properties of Graphene Using Atomic Radial Distribution Function Scores , 2015, J. Chem. Inf. Model..

[16]  A. Barnard,et al.  Impact of distributions and mixtures on the charge transfer properties of graphene nanoflakes. , 2015, Nanoscale.

[17]  Michael Fernandez,et al.  Identification of Nanoparticle Prototypes and Archetypes. , 2015, ACS nano.

[18]  T. Mizoguchi,et al.  Quantitative estimation of properties from core-loss spectrum via neural network , 2019, Journal of Physics: Materials.

[19]  T. Mizoguchi,et al.  Machine learning for structure determination and investigating the structure-property relationships of interfaces , 2019, Journal of Physics: Materials.

[20]  Deborah F. Swayne,et al.  Data Visualization With Multidimensional Scaling , 2008 .

[21]  A. Valencia,et al.  Information Retrieval and Text Mining Technologies for Chemistry. , 2017, Chemical reviews.

[22]  Aron Walsh,et al.  The 2019 materials by design roadmap , 2018, Journal of physics D: Applied physics.

[23]  Sotiris B. Kotsiantis,et al.  Decision trees: a recent overview , 2011, Artificial Intelligence Review.

[24]  David L. McDowell,et al.  The materials innovation ecosystem: A key enabler for the Materials Genome Initiative , 2016 .

[25]  Sotiris B. Kotsiantis,et al.  Machine learning: a review of classification and combining techniques , 2006, Artificial Intelligence Review.

[26]  Jacqueline M. Cole,et al.  ChemDataExtractor: A Toolkit for Automated Extraction of Chemical Information from the Scientific Literature , 2016, J. Chem. Inf. Model..

[27]  I Takeuchi,et al.  High-throughput determination of structural phase diagram and constituent phases using GRENDEL , 2015, Nanotechnology.

[28]  Naomie Salim,et al.  Chemical named entities recognition: a review on approaches and applications , 2014, Journal of Cheminformatics.

[29]  N. Altman An Introduction to Kernel and Nearest-Neighbor Nonparametric Regression , 1992 .

[30]  Seokho Kang,et al.  Deep-learning-based inverse design model for intelligent discovery of organic molecules , 2018, npj Computational Materials.

[31]  Christopher M Wolverton,et al.  Atomistic calculations and materials informatics: A review , 2017 .

[32]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[33]  S. Woodley,et al.  Crystal structure prediction from first principles. , 2008, Nature materials.

[34]  Bin Zheng,et al.  Research Paper: Enhancing Text Categorization with Semantic-enriched Representation and Training Data Augmentation , 2006, J. Am. Medical Informatics Assoc..

[35]  Halbert White,et al.  Learning in Artificial Neural Networks: A Statistical Perspective , 1989, Neural Computation.

[36]  Yoshua Bengio,et al.  Algorithms for Hyper-Parameter Optimization , 2011, NIPS.

[37]  Phillip B. Messersmith,et al.  Bioinspired antifouling polymers , 2005 .

[38]  Amanda S Barnard,et al.  Predicting archetypal nanoparticle shapes using a combination of thermodynamic theory and machine learning. , 2018, Nanoscale.

[39]  D. Coomans,et al.  Alternative k-nearest neighbour rules in supervised pattern recognition : Part 1. k-Nearest neighbour classification by using alternative voting rules , 1982 .

[40]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[41]  Lior Rokach,et al.  Top-down induction of decision trees classifiers - a survey , 2005, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[42]  David A. Winkler,et al.  Multivariate analysis of ToF‐SIMS data using mass segmented peak lists , 2018 .

[43]  G. R. Schleder,et al.  From DFT to machine learning: recent approaches to materials science–a review , 2019, Journal of Physics: Materials.

[44]  Tony Hey,et al.  The Fourth Paradigm: Data-Intensive Scientific Discovery , 2009 .

[45]  A. Zunger Beware of plausible predictions of fantasy materials , 2019, Nature.

[46]  Anil K. Jain Data clustering: 50 years beyond K-means , 2010, Pattern Recognit. Lett..

[47]  John M. Gregoire,et al.  Perspective: Composition–structure–property mapping in high-throughput experiments: Turning data into knowledge , 2016 .

[48]  Tom Drummond,et al.  A review of deep learning in the study of materials degradation , 2018, npj Materials Degradation.

[49]  Alán Aspuru-Guzik,et al.  Inverse molecular design using machine learning: Generative models for matter engineering , 2018, Science.

[50]  Pierre Geurts,et al.  Extremely randomized trees , 2006, Machine Learning.

[51]  Anil K. Jain Data clustering: 50 years beyond K-means , 2008, Pattern Recognit. Lett..

[52]  Seiji Kajita,et al.  A Universal 3D Voxel Descriptor for Solid-State Material Informatics with Deep Convolutional Neural Networks , 2017, Scientific Reports.

[53]  Robert P. Sheridan,et al.  Random Forest: A Classification and Regression Tool for Compound Classification and QSAR Modeling , 2003, J. Chem. Inf. Comput. Sci..

[54]  S. Sisson,et al.  A comparative review of dimension reduction methods in approximate Bayesian computation , 2012, 1202.3819.

[55]  Kristin A. Persson,et al.  Commentary: The Materials Project: A materials genome approach to accelerating materials innovation , 2013 .

[56]  Ichiro Takeuchi,et al.  Fulfilling the promise of the materials genome initiative with high-throughput experimental methodologies , 2017 .

[57]  Amanda S Barnard,et al.  Impact of distributions on the archetypes and prototypes in heterogeneous nanoparticle ensembles. , 2017, Nanoscale.

[58]  Wei Chen,et al.  A Statistical Learning Framework for Materials Science: Application to Elastic Moduli of k-nary Inorganic Polycrystalline Compounds , 2016, Scientific Reports.

[59]  Czech Republic,et al.  Learning physical descriptors for materials science by compressed sensing , 2016, 1612.04285.

[60]  Gerbrand Ceder,et al.  Predicting crystal structure by merging data mining with quantum mechanics , 2006, Nature materials.

[61]  Sotiris B. Kotsiantis,et al.  Supervised Machine Learning: A Review of Classification Techniques , 2007, Informatica.

[62]  Amanda S. Barnard,et al.  Representing molecular and materials data for unsupervised machine learning , 2018 .

[63]  M. Scheffler,et al.  Simultaneous learning of several materials properties from incomplete databases with multi-task SISSO , 2019, Journal of Physics: Materials.

[64]  Krishna Rajan,et al.  Combinatorial Materials Sciences: Experimental Strategies for Accelerated Knowledge Discovery , 2008 .

[65]  Krishna Rajan,et al.  New frontiers for the materials genome initiative , 2019, npj Computational Materials.

[66]  A. McCallum,et al.  Materials Synthesis Insights from Scientific Literature via Text Extraction and Machine Learning , 2017 .

[67]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[68]  Teuvo Kohonen,et al.  The self-organizing map , 1990, Neurocomputing.

[69]  Claudia Draxl,et al.  The NOMAD laboratory: from data sharing to artificial intelligence , 2019, Journal of Physics: Materials.

[70]  W Y Zhang,et al.  Discussion on `Sure independence screening for ultra-high dimensional feature space' by Fan, J and Lv, J. , 2008 .

[71]  Madan Somvanshi,et al.  A review of machine learning techniques using decision tree and support vector machine , 2016, 2016 International Conference on Computing Communication Control and automation (ICCUBEA).

[72]  Identifying hidden high-dimensional structure/property relationships using self-organizing maps , 2019, MRS Communications.

[73]  H. Wilson,et al.  Water bilayers on ZnO(100) surfaces: data-driven structural search , 2016 .

[74]  Ronald J. Williams,et al.  A Learning Algorithm for Continually Running Fully Recurrent Neural Networks , 1989, Neural Computation.

[75]  Shou-Cheng Zhang,et al.  Learning atoms for materials discovery , 2018, Proceedings of the National Academy of Sciences.

[76]  José Alfredo Ferreira Costa,et al.  An Empirical Analysis of Under-Sampling Techniques to Balance a Protein Structural Class Dataset , 2006, ICONIP.

[77]  J. Gregoire,et al.  Analyzing machine learning models to accelerate generation of fundamental materials insights , 2019, npj Computational Materials.

[78]  Electronic-structure-based material descriptors: (in)dependence on self-interaction and Hartree-Fock exchange. , 2015, Chemical communications.

[79]  Erik Cambria,et al.  Recent Trends in Deep Learning Based Natural Language Processing , 2017, IEEE Comput. Intell. Mag..

[80]  A. Choudhary,et al.  Perspective: Materials informatics and big data: Realization of the “fourth paradigm” of science in materials science , 2016 .

[81]  S. Shinomoto,et al.  Correlations and forecast of death tolls in the Syrian conflict , 2016, Scientific Reports.

[82]  Kyle Chard,et al.  Matminer: An open source toolkit for materials data mining , 2018, Computational Materials Science.

[83]  D. Dimiduk,et al.  Perspectives on the Impact of Machine Learning, Deep Learning, and Artificial Intelligence on Materials, Processes, and Structures Engineering , 2018, Integrating Materials and Manufacturing Innovation.

[84]  A. Barnard,et al.  Geometrical features can predict electronic properties of graphene nanoflakes , 2016 .

[85]  Ryan P. Adams,et al.  Design of efficient molecular organic light-emitting diodes by a high-throughput virtual screening and experimental approach. , 2016, Nature materials.

[86]  Karl Pearson F.R.S. LIII. On lines and planes of closest fit to systems of points in space , 1901 .

[87]  Michael Fernandez,et al.  Artificial neural network analysis of the catalytic efficiency of platinum nanoparticles , 2017 .

[88]  Chiho Kim,et al.  Machine learning in materials informatics: recent applications and prospects , 2017, npj Computational Materials.

[89]  Alán Aspuru-Guzik,et al.  Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules , 2016, ACS central science.

[90]  Rajarshi Guha,et al.  On the interpretation and interpretability of quantitative structure–activity relationship models , 2008, J. Comput. Aided Mol. Des..

[91]  Anubhav Jain,et al.  Finding Nature’s Missing Ternary Oxide Compounds Using Machine Learning and Density Functional Theory , 2010 .

[92]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[93]  Amanda S. Barnard,et al.  Texture based image classification for nanoparticle surface characterisation and machine learning , 2018, Journal of Physics: Materials.

[94]  B. Meredig,et al.  Materials science with large-scale data and informatics: Unlocking new opportunities , 2016 .

[95]  A. Barnard,et al.  Machine Learning Prediction of the Energy Gap of Graphene Nanoflakes Using Topological Autocorrelation Vectors. , 2016, ACS combinatorial science.

[96]  Robert M T Madiona,et al.  Distinguishing Chemically Similar Polyamide Materials with ToF-SIMS Using Self-Organizing Maps and a Universal Data Matrix. , 2018, Analytical chemistry.

[97]  Wenqing Zhang,et al.  Data mining-aided materials discovery and optimization , 2017 .

[98]  Tara N. Sainath,et al.  Deep Learning for Audio Signal Processing , 2019, IEEE Journal of Selected Topics in Signal Processing.

[99]  Amanda S Barnard,et al.  Visualising multi-dimensional structure/property relationships with machine learning , 2019 .

[100]  Olga Kononova,et al.  Unsupervised word embeddings capture latent knowledge from materials science literature , 2019, Nature.

[101]  Yue Liu,et al.  Materials discovery and design using machine learning , 2017 .

[102]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[103]  S. Lipovetsky,et al.  Analysis of regression in game theory approach , 2001 .

[104]  Yair Zick,et al.  Algorithmic Transparency via Quantitative Input Influence: Theory and Experiments with Learning Systems , 2016, 2016 IEEE Symposium on Security and Privacy (SP).

[105]  Jianqing Fan,et al.  Sure independence screening in generalized linear models with NP-dimensionality , 2009, The Annals of Statistics.

[106]  Stefanie Jegelka,et al.  Virtual screening of inorganic materials synthesis parameters with deep learning , 2017, npj Computational Materials.

[107]  Patrick J. F. Groenen,et al.  Modern Multidimensional Scaling: Theory and Applications , 2003 .

[108]  Geoff S. Nitschke,et al.  Improving Deep Learning with Generic Data Augmentation , 2018, 2018 IEEE Symposium Series on Computational Intelligence (SSCI).

[109]  Jianqing Fan,et al.  Sure independence screening for ultrahigh dimensional feature space , 2006, math/0612857.

[110]  Atsuto Seko,et al.  Descriptors for Machine Learning of Materials Data , 2017, 1709.01666.

[111]  Alok Choudhary,et al.  A General-Purpose Machine Learning Framework for Predicting Properties of Inorganic Materials , 2016 .

[112]  Lei Wu,et al.  A Review on Dimensionality Reduction Techniques , 2019, Int. J. Pattern Recognit. Artif. Intell..

[113]  B. Motevalli,et al.  Classifying and predicting the electron affinity of diamond nanoparticles using machine learning , 2019, Nanoscale Horizons.

[114]  Yuqing He,et al.  Catalogue of topological electronic materials , 2018, Nature.

[115]  J. Pablo,et al.  The Materials Genome Initiative, the interplay of experiment, theory and computation , 2014 .

[116]  Theofanis Sapatinas,et al.  Discriminant Analysis and Statistical Pattern Recognition , 2005 .

[117]  A. Barnard,et al.  Machine Learning for Silver Nanoparticle Electron Transfer Property Prediction , 2017, J. Chem. Inf. Model..

[118]  Wei Chen,et al.  High-throughput screening of inorganic compounds for the discovery of novel dielectric and optical materials , 2017, Scientific Data.

[119]  Frank Noé,et al.  Learning Continuous and Data-Driven Molecular Descriptors by Translating Equivalent Chemical Representations , 2018 .

[120]  Stefano Curtarolo,et al.  SISSO: A compressed-sensing method for identifying the best low-dimensional descriptor in an immensity of offered candidates , 2017, Physical Review Materials.

[121]  Krishna Rajan,et al.  Information Science for Materials Discovery and Design , 2016 .

[122]  Jan Hannig,et al.  Support vector machine classification of suspect powders using laser‐induced breakdown spectroscopy (LIBS) spectral data , 2012 .

[123]  Christopher J. C. Burges,et al.  Dimension Reduction: A Guided Tour , 2010, Found. Trends Mach. Learn..

[124]  Michael R. Anderberg,et al.  Cluster Analysis for Applications , 1973 .

[125]  G. Baudat,et al.  Generalized Discriminant Analysis Using a Kernel Approach , 2000, Neural Computation.

[126]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[127]  Pavel Pudil,et al.  Novel Methods for Feature Subset Selection with Respect to Problem Knowledge , 1998 .

[128]  Randal S. Olson,et al.  Evaluation of a Tree-based Pipeline Optimization Tool for Automating Data Science , 2016, GECCO.

[129]  Aron Walsh,et al.  Materials discovery by chemical analogy: role of oxidation states in structure prediction. , 2018, Faraday discussions.

[130]  Amanda S Barnard In silico veritas. , 2014, ACS nano.

[131]  Zhenghao Li,et al.  An artificial intelligence atomic force microscope enabled by machine learning. , 2018, Nanoscale.

[132]  S. Ong,et al.  New opportunities for materials informatics: Resources and data mining techniques for uncovering hidden relationships , 2016 .

[133]  Boyuan Huang,et al.  Artificial Intelligent Atomic Force Microscope Enabled by Machine Learning , 2018, 1807.09985.

[134]  Andreas Bender,et al.  Melting Point Prediction Employing k-Nearest Neighbor Algorithms and Genetic Parameter Optimization , 2006, J. Chem. Inf. Model..

[135]  Erik Strumbelj,et al.  Explaining prediction models and individual predictions with feature contributions , 2014, Knowledge and Information Systems.

[136]  Sue Holwell,et al.  Information, Systems and Information Systems: Making Sense of the Field , 1998 .

[137]  Michael Krauthammer,et al.  GENIES: a natural-language processing system for the extraction of molecular pathways from journal articles , 2001, ISMB.

[138]  K. L. Edwards,et al.  The use of artificial neural networks in materials science based research , 2007 .

[139]  Paul Raccuglia,et al.  Machine-learning-assisted materials discovery using failed experiments , 2016, Nature.

[140]  A Method for Separating Crystallograpically Similar Phases in Steels using EBSD and Machine Learning , 2017, Microscopy and Microanalysis.

[141]  A. Barnard,et al.  From Process to Properties: Correlating Synthesis Conditions and Structural Disorder of Platinum Nanocatalysts , 2018, The Journal of Physical Chemistry C.

[142]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.