Independent Vector Analysis for Data Fusion Prior to Molecular Property Prediction with Machine Learning

Due to its high computational speed and accuracy compared to ab-initio quantum chemistry and forcefield modeling, the prediction of molecular properties using machine learning has received great attention in the fields of materials design and drug discovery. A main ingredient required for machine learning is a training dataset consisting of molecular features\textemdash for example fingerprint bits, chemical descriptors, etc. that adequately characterize the corresponding molecules. However, choosing features for any application is highly non-trivial. No "universal" method for feature selection exists. In this work, we propose a data fusion framework that uses Independent Vector Analysis to exploit underlying complementary information contained in different molecular featurization methods, bringing us a step closer to automated feature generation. Our approach takes an arbitrary number of individual feature vectors and automatically generates a single, compact (low dimensional) set of molecular features that can be used to enhance the prediction performance of regression models. At the same time our methodology retains the possibility of interpreting the generated features to discover relationships between molecular structures and properties. We demonstrate this on the QM7b dataset for the prediction of several properties such as atomization energy, polarizability, frontier orbital eigenvalues, ionization potential, electron affinity, and excitation energies. In addition, we show how our method helps improve the prediction of experimental binding affinities for a set of human BACE-1 inhibitors. To encourage more widespread use of IVA we have developed the PyIVA Python package, an open source code which is available for download on Github.

[1]  James Barker,et al.  LC-GAP: Localized Coulomb Descriptors for the Gaussian Approximation Potential , 2016, Scientific Computing and Algorithms in Industrial Simulations.

[2]  K. Müller,et al.  Fast and accurate modeling of molecular atomization energies with machine learning. , 2011, Physical review letters.

[3]  Pascal Vasseur,et al.  Introduction to Multisensor Data Fusion , 2005, The Industrial Information Technology Handbook.

[4]  P. Selzer,et al.  Fast calculation of molecular polar surface area as a sum of fragment-based contributions and its application to the prediction of drug transport properties. , 2000, Journal of medicinal chemistry.

[5]  M. Rupp,et al.  Machine learning of molecular electronic properties in chemical compound space , 2013, 1305.7074.

[6]  David Weininger,et al.  SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules , 1988, J. Chem. Inf. Comput. Sci..

[7]  Vijay S. Pande,et al.  MoleculeNet: a benchmark for molecular machine learning , 2017, Chemical science.

[8]  William D. Mattson,et al.  Machine Learning of Energetic Material Properties , 2018, 1807.06156.

[9]  Erkki Oja,et al.  Independent component analysis: algorithms and applications , 2000, Neural Networks.

[10]  Guillermo Sapiro,et al.  Online dictionary learning for sparse coding , 2009, ICML '09.

[11]  Andrzej Cichocki,et al.  Nonnegative Matrix and Tensor Factorization T , 2007 .

[12]  Tülay Adali,et al.  Independent Component Analysis by Entropy Bound Minimization , 2010, IEEE Transactions on Signal Processing.

[13]  Zois Boukouvalas,et al.  Development of ICA and IVA Algorithms with Application to Medical Image Analysis , 2018, 1801.08600.

[14]  Te-Won Lee,et al.  Blind Source Separation Exploiting Higher-Order Frequency Dependencies , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  Lemont B. Kier,et al.  Electrotopological State Indices for Atom Types: A Novel Combination of Electronic, Topological, and Valence State Information , 1995, J. Chem. Inf. Comput. Sci..

[16]  K. Müller,et al.  Machine Learning Predictions of Molecular Properties: Accurate Many-Body Potentials and Nonlocality in Chemical Space , 2015, The journal of physical chemistry letters.

[17]  Pascal Frossard,et al.  Dictionary Learning , 2011, IEEE Signal Processing Magazine.

[18]  Alán Aspuru-Guzik,et al.  Inverse molecular design using machine learning: Generative models for matter engineering , 2018, Science.

[19]  Vince D. Calhoun,et al.  Sparsity and Independence: Balancing Two Objectives in Optimization for Source Separation with Application to fMRI Analysis , 2017, J. Frankl. Inst..

[20]  Terrence J. Sejnowski,et al.  The “independent components” of natural scenes are edge filters , 1997, Vision Research.

[21]  Lorenz C. Blum,et al.  970 million druglike small molecules for virtual screening in the chemical universe database GDB-13. , 2009, Journal of the American Chemical Society.

[22]  Tülay Adalı,et al.  Diversity in Independent Component and Vector Analyses: Identifiability, algorithms, and applications in medical imaging , 2014, IEEE Signal Processing Magazine.

[23]  Daniel C Elton,et al.  Applying machine learning techniques to predict the properties of energetic materials , 2018, Scientific Reports.

[24]  Pierre Comon,et al.  Handbook of Blind Source Separation: Independent Component Analysis and Applications , 2010 .

[25]  Federico Castanedo,et al.  A Review of Data Fusion Techniques , 2013, TheScientificWorldJournal.

[26]  Vijay S. Pande,et al.  Computational Modeling of β-Secretase 1 (BACE-1) Inhibitors Using Ligand Based Approaches , 2016, J. Chem. Inf. Model..

[27]  Andrew D. Back,et al.  A First Application of Independent Component Analysis to Extracting Structure from Stock Returns , 1997, Int. J. Neural Syst..

[28]  Bing Huang,et al.  Machine learning prediction errors better than DFT accuracy , 2017, 1702.05532.

[29]  Te-Won Lee,et al.  Fast fixed-point independent vector analysis algorithms for convolutive blind source separation , 2007, Signal Process..