Impact of distance-based metric learning on classification and visualization model performance and structure–activity landscapes

This study concerns large margin nearest neighbors classifier and its multi-metric extension as the efficient approaches for metric learning which aimed to learn an appropriate distance/similarity function for considered case studies. In recent years, many studies in data mining and pattern recognition have demonstrated that a learned metric can significantly improve the performance in classification, clustering and retrieval tasks. The paper describes application of the metric learning approach to in silico assessment of chemical liabilities. Chemical liabilities, such as adverse effects and toxicity, play a significant role in drug discovery process, in silico assessment of chemical liabilities is an important step aimed to reduce costs and animal testing by complementing or replacing in vitro and in vivo experiments. Here, to our knowledge for the first time, a distance-based metric learning procedures have been applied for in silico assessment of chemical liabilities, the impact of metric learning on structure–activity landscapes and predictive performance of developed models has been analyzed, the learned metric was used in support vector machines. The metric learning results have been illustrated using linear and non-linear data visualization techniques in order to indicate how the change of metrics affected nearest neighbors relations and descriptor space.

[1]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[2]  Dieter Jungnickel,et al.  Graphs, Networks, and Algorithms , 1980 .

[3]  Gregg D. Wilensky,et al.  Neural Network Studies , 1993 .

[4]  N. Bodor,et al.  Neural network studies: Part 3. Prediction of partition coefficients , 1994 .

[5]  K. Sen,et al.  Molecular Similarity II , 1995 .

[6]  Christopher M. Bishop,et al.  GTM: A Principled Alternative to the Self-Organizing Map , 1996, NIPS.

[7]  Christopher M. Bishop,et al.  GTM: The Generative Topographic Mapping , 1998, Neural Computation.

[8]  Igor I. Baskin,et al.  Molecular Similarity. 1. Analytical Description of the Set of Graph Similarity Measures , 1998, J. Chem. Inf. Comput. Sci..

[9]  Igor V. Tetko,et al.  Neural Network Studies, 4. Introduction to Associative Neural Networks , 2002, J. Chem. Inf. Comput. Sci..

[10]  Michael I. Jordan,et al.  Distance Metric Learning with Application to Clustering with Side-Information , 2002, NIPS.

[11]  H. van de Waterbeemd,et al.  ADMET in silico modelling: towards prediction paradise? , 2003, Nature reviews. Drug discovery.

[12]  Dragos Horvath,et al.  Neighborhood Behavior of in Silico Structural Spaces with Respect to in Vitro Activity Spaces-A Novel Understanding of the Molecular Similarity Principle in the Context of Multiple Receptor Binding Profiles , 2003, J. Chem. Inf. Comput. Sci..

[13]  Zafer Barutçuoglu A Comparison of Model Aggregation Methods for Regression , 2003, ICANN.

[14]  Geoffrey E. Hinton,et al.  Neighbourhood Components Analysis , 2004, NIPS.

[15]  Yoram Singer,et al.  Online and batch learning of pseudo-metrics , 2004, ICML.

[16]  Dragos Horvath,et al.  Molecular similarity and property similarity. , 2004, Current topics in medicinal chemistry.

[17]  Dimitrios Gunopulos,et al.  Large margin nearest neighbor classifiers , 2005, IEEE Transactions on Neural Networks.

[18]  Kilian Q. Weinberger,et al.  Distance Metric Learning for Large Margin Nearest Neighbor Classification , 2005, NIPS.

[19]  Yann LeCun,et al.  Learning a similarity metric discriminatively, with application to face verification , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[20]  Ian T. Nabney,et al.  Data Visualization during the Early Stages of Drug Discovery , 2006, J. Chem. Inf. Model..

[21]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[22]  Stan Szpakowicz,et al.  Beyond Accuracy, F-Score and ROC: A Family of Discriminant Measures for Performance Evaluation , 2006, Australian Conference on Artificial Intelligence.

[23]  I. Tetko,et al.  ISIDA - Platform for Virtual Screening Based on Fragment and Pharmacophoric Descriptors , 2008 .

[24]  Rajarshi Guha,et al.  On the interpretation and interpretability of quantitative structure–activity relationship models , 2008, J. Comput. Aided Mol. Des..

[25]  José G. Dias,et al.  A bootstrap-based aggregate classifier for model-based clustering , 2008, Comput. Stat..

[26]  Rajarshi Guha,et al.  Structure—Activity Landscape Index: Identifying and Quantifying Activity Cliffs. , 2008 .

[27]  Associative Neural Network , 2009, Artificial Neural Networks.

[28]  Jürgen Bajorath,et al.  Combining Cluster Analysis, Feature Selection and Multiple Support Vector Machine Models for the Identification of Human Ether‐a‐go‐go Related Gene Channel Blocking Compounds , 2009, Chemical biology & drug design.

[29]  Alexander Tropsha,et al.  Best Practices for QSAR Model Development, Validation, and Exploitation , 2010, Molecular informatics.

[30]  Charles C. Persinger,et al.  How to improve R&D productivity: the pharmaceutical industry's grand challenge , 2010, Nature Reviews Drug Discovery.

[31]  Mathias Wawer,et al.  Similarity-Potency Trees: A Method to Search for SAR Information in Compound Data Sets and Derive SAR Rules , 2010, J. Chem. Inf. Model..

[32]  J. Bajorath,et al.  Activity landscape representations for structure-activity relationship analysis. , 2010, Journal of medicinal chemistry.

[33]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[34]  Gilles Marcou,et al.  Local neighborhood behavior in a combinatorial library context , 2011, J. Comput. Aided Mol. Des..

[35]  Vicent Caselles,et al.  Improved Support Vector Machines with Distance Metric Learning , 2011, ACIVS.

[36]  Curt M. Breneman,et al.  Rank Order Entropy: Why One Metric Is Not Enough , 2011, J. Chem. Inf. Model..

[37]  José L. Medina-Franco,et al.  Visualization of Molecular Fingerprints , 2011, J. Chem. Inf. Model..

[38]  Héléna A. Gaspar,et al.  Generative Topographic Mapping (GTM): Universal Tool for Data Visualization, Structure‐Activity Modeling and Dataset Comparison , 2012, Molecular informatics.

[39]  John B. O. Mitchell,et al.  Predicting the mechanism of phospholipidosis , 2012, Journal of Cheminformatics.

[40]  Changshui Zhang,et al.  Learning similarity metric with SVM , 2012, The 2012 International Joint Conference on Neural Networks (IJCNN).

[41]  Jürgen Bajorath,et al.  Modeling of activity landscapes for drug discovery , 2012, Expert opinion on drug discovery.

[42]  Kilian Q. Weinberger,et al.  Distance Metric Learning for Kernel Machines , 2012, ArXiv.

[43]  Igor I. Baskin,et al.  Machine Learning Methods for Property Prediction in Chemoinformatics: Quo Vadis? , 2012, J. Chem. Inf. Model..

[44]  Jun Wang,et al.  A metric learning perspective of SVM: on the relation of LMNN and SVM , 2012, AISTATS.

[45]  Natalia Kireeva,et al.  Toward Navigating Chemical Space of Ionic Liquids: Prediction of Melting Points Using Generative Topographic Maps , 2012 .

[46]  Rajarshi Guha,et al.  Exploring structure–activity data using the landscape paradigm , 2012, Wiley interdisciplinary reviews. Computational molecular science.

[47]  Preeti Iyer,et al.  Activity Landscapes, Information Theory, and Structure – Activity Relationships , 2013, Molecular informatics.

[48]  N. Kireeva,et al.  Towards in silico identification of the human ether-a-go-go-related gene channel blockers: discriminative vs. generative classification models , 2013, SAR and QSAR in environmental research.

[49]  Gregory W. Kauffman,et al.  Interpretable, Probability-Based Confidence Metric for Continuous Quantitative Structure-Activity Relationship Models , 2013, J. Chem. Inf. Model..

[50]  Kimito Funatsu,et al.  Prediction of ProteinProtein Interaction Pocket Using L‐Shaped PLS Approach and Its Visualizations by Generative Topographic Mapping , 2014, Molecular informatics.

[51]  I. Jolliffe Principal Component Analysis , 2005 .