Feature Extraction Methods in Quantitative Structure–Activity Relationship Modeling: A Comparative Study

Computational approaches for synthesizing new chemical compounds have resulted in a major explosion of chemical data in the field of drug discovery. The quantitative structure–activity relationship (QSAR) is a widely used classification and regression method used to represent the relationship between a chemical structure and its activities. This research focuses on the effect of dimensionality-reduction techniques on a high-dimensional QSAR dataset. Because of the multi-dimensional nature of QSAR, dimensionality-reduction techniques have become an integral part of its modeling process. Principal component analysis (PCA) is a feature extraction technique with several applications in exploratory data analysis, visualization and dimensionality reduction. However, linear PCA is inadequate to handle the complex structure of QSAR data. In light of the wide array of current feature-extraction techniques, we perform a comparative empirical study to investigate five feature-extraction techniques: PCA, kernel PCA, deep generalized autoencoder (dGAE), Gaussian random projection (GRP), and sparse random projection (SRP). The experiments are performed on a high-dimensional QSAR dataset, which comprises 6394 features. The transformed low-dimensional dataset is inputted into a deep learning classification model to predict a QSAR biological activity. Three approaches are adopted to validate and measure the proposed techniques: (i) comparing the performance of the classification models, (ii) visualizing the relationship (correlation) between features in the low-dimension Euclidean space, and (iii) validating the proposed techniques using an external dataset. To the best of our knowledge, this study is the first to investigate and compare the aforementioned feature-extraction techniques in QSAR modeling context. The results obtained provide invaluable insights regarding the behavior of different techniques with both negative and positive classes. With linear PCA as a baseline, we prove that the investigated techniques substantially outperform the baseline in multiple accuracy measures and demonstrate useful ways of extracting significant features.

[1]  J. A. López del Val,et al.  Principal Components Analysis , 2018, Applied Univariate, Bivariate, and Multivariate Statistics Using Python.

[2]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[3]  Wei Wang,et al.  Generalized Autoencoder: A Neural Network Framework for Dimensionality Reduction , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[4]  Mehryar Mohri,et al.  AUC Optimization vs. Error Rate Minimization , 2003, NIPS.

[5]  Kimito Funatsu,et al.  GA Strategy for Variable Selection in QSAR Studies: GA-Based PLS Analysis of Calcium Channel Antagonists , 1997, J. Chem. Inf. Comput. Sci..

[6]  Dimitris Achlioptas,et al.  Database-friendly random projections , 2001, PODS.

[7]  Antonio Chana,et al.  CODES/neural network model: A useful tool for in silico prediction of oral absorption and blood-brain barrier permeability of structurally diverse drugs , 2004 .

[8]  Xuezhong He,et al.  MoleGear: A Java-Based Platform for Evolutionary De Novo Molecular Design , 2019, Molecules.

[9]  Alán Aspuru-Guzik,et al.  Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules , 2016, ACS central science.

[10]  Francesca Odone,et al.  Feature selection for high-dimensional data , 2009, Comput. Manag. Sci..

[11]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[12]  Yoshua Bengio,et al.  Locally Linear Embedding for dimensionality reduction in QSAR , 2004, J. Comput. Aided Mol. Des..

[13]  Igor V. Tetko,et al.  Online chemical modeling environment (OCHEM): web platform for data storage, model development and publishing of chemical information , 2011, J. Comput. Aided Mol. Des..

[14]  Duncan Fyfe Gillies,et al.  A Review of Feature Selection and Feature Extraction Methods Applied on Microarray Data , 2015, Adv. Bioinformatics.

[15]  Roberto Todeschini,et al.  Impact of Molecular Descriptors on Computational Models. , 2018, Methods in molecular biology.

[16]  J. Tenenbaum,et al.  A global geometric framework for nonlinear dimensionality reduction. , 2000, Science.

[17]  Robert P. Sheridan,et al.  Deep Neural Nets as a Method for Quantitative Structure-Activity Relationships , 2015, J. Chem. Inf. Model..

[18]  Iuliana F. Iatan The Fisher's Linear Discriminant , 2010, SMPS.

[19]  Thomas Blaschke,et al.  The rise of deep learning in drug discovery. , 2018, Drug discovery today.

[20]  Valerio Pascucci,et al.  Visualizing High-Dimensional Data: Advances in the Past Decade , 2017, IEEE Transactions on Visualization and Computer Graphics.

[21]  Zhi-Wei Cao,et al.  Effect of Selection of Molecular Descriptors on the Prediction of Blood-Brain Barrier Penetrating and Nonpenetrating Agents by Statistical Learning Methods , 2005, J. Chem. Inf. Model..

[22]  Jahan B. Ghasemi,et al.  Multivariate statistical analysis methods in QSAR , 2015 .

[23]  Xiaohui Liu,et al.  Combining multiple classifiers for wrapper feature selection , 2008, Int. J. Data Min. Model. Manag..

[24]  F. S. Tsai Comparative Study of Dimensionality Reduction Techniques for Data Visualization , 2010 .

[25]  Jianxin Wu Fisher’s Linear Discriminant , 2020, Essentials of Pattern Recognition.

[26]  D. Tax,et al.  Feature scaling in support vector data description , 2002 .

[27]  Rui Miao,et al.  Improved Classification of Blood-Brain-Barrier Drugs Using Deep Learning , 2019, Scientific Reports.

[28]  Jacqueline J Meulman,et al.  Nonlinear principal components analysis: introduction and application. , 2007, Psychological methods.

[29]  Charles X. Ling,et al.  Using AUC and accuracy in evaluating learning algorithms , 2005, IEEE Transactions on Knowledge and Data Engineering.

[30]  Scott Boyer,et al.  Binary classification of imbalanced datasets using conformal prediction. , 2017, Journal of molecular graphics & modelling.

[31]  Samina Khalid,et al.  A survey of feature selection and feature extraction techniques in machine learning , 2014, 2014 Science and Information Conference.

[32]  Kilian Q. Weinberger,et al.  Learning a kernel matrix for nonlinear dimensionality reduction , 2004, ICML.

[33]  G. Dunteman Principal Components Analysis , 1989 .

[34]  D. Donoho,et al.  Hessian eigenmaps: Locally linear embedding techniques for high-dimensional data , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[35]  Dimitris Achlioptas,et al.  Database-friendly random projections: Johnson-Lindenstrauss with binary coins , 2003, J. Comput. Syst. Sci..

[36]  Bruno O. Villoutreix,et al.  Strategies in the Search for New Lead Compounds or Original Working Hypotheses , 2015 .

[37]  Michael K. Ng,et al.  Multi-Instance Dimensionality Reduction , 2010, AAAI.

[38]  Yvan Vander Heyden,et al.  Towards better understanding of feature-selection or reduction techniques for Quantitative Structure–Activity Relationship models , 2013 .

[39]  Maykel Pérez González,et al.  Variable selection methods in QSAR: an overview. , 2008, Current topics in medicinal chemistry.

[40]  Gustavo Deco,et al.  Two Strategies to Avoid Overfitting in Feedforward Networks , 1997, Neural Networks.

[41]  Fang Zheng,et al.  Improved Prediction of Blood–Brain Barrier Permeability Through Machine Learning with Combined Use of Molecular Property-Based Descriptors and Fingerprints , 2018, The AAPS Journal.

[42]  Dimitrios Gunopulos,et al.  Dimensionality reduction by random projection and latent semantic indexing , 2003 .

[43]  Lan Huang,et al.  A Feature Extraction Method Based on Differential Entropy and Linear Discriminant Analysis for Emotion Recognition , 2019, Sensors.

[44]  Ganesh N. Prabhu,et al.  Applications of Genetic Algorithms in QSAR/QSPR Modeling , 2014 .

[45]  Jing Wang,et al.  A Folded Neural Network Autoencoder for Dimensionality Reduction , 2012, INNS-WC.

[46]  M. Natália D. S. Cordeiro,et al.  On the Relevance of Feature Selection Algorithms While Developing Non-linear QSARs , 2020 .

[47]  Josephine Sarpong Akosa,et al.  Predictive Accuracy : A Misleading Performance Measure for Highly Imbalanced Data , 2017 .

[48]  Heikki Mannila,et al.  Random projection in dimensionality reduction: applications to image and text data , 2001, KDD '01.

[49]  Chaoyang Zhang,et al.  A Review of Feature Reduction Methods for QSAR-Based Toxicity Prediction , 2019, Challenges and Advances in Computational Chemistry and Physics.

[50]  Alireza Akhbardeh,et al.  Comparative analysis of nonlinear dimensionality reduction techniques for breast MRI segmentation. , 2012, Medical physics.

[51]  Tamara Munzner,et al.  Empirical Guidance on Scatterplot and Dimension Reduction Technique Choices , 2013, IEEE Transactions on Visualization and Computer Graphics.

[52]  H. Davson Blood–brain barrier , 1977, Nature.

[53]  Stephen J. McKenna,et al.  Comparative Analysis of Feature Extraction Methods for Colorectal Polyp Images in Optical Projection Tomography , 2013 .

[54]  Zhihui Lai,et al.  Structured optimal graph based sparse feature extraction for semi-supervised learning , 2020, Signal Process..

[55]  David A. Winkler,et al.  The role of quantitative structure-activity relationships (QSAR) in biomolecular discovery , 2002, Briefings Bioinform..

[56]  Wolfgang Marquardt,et al.  What is Wrong with Quantitative Structure-Property Relations Models Based on Three-Dimensional Descriptors? , 2012, J. Chem. Inf. Model..

[57]  Avinash C. Kak,et al.  PCA versus LDA , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[58]  S T Roweis,et al.  Nonlinear dimensionality reduction by locally linear embedding. , 2000, Science.

[59]  Egon L. Willighagen,et al.  The Chemistry Development Kit (CDK): An Open-Source Java Library for Chemo-and Bioinformatics , 2003, J. Chem. Inf. Comput. Sci..

[60]  Sri Ramakrishna,et al.  FEATURE SELECTION METHODS AND ALGORITHMS , 2011 .

[61]  Yatindra Kumar,et al.  Feature extraction and classification for EMG signals using linear discriminant analysis , 2016, 2016 2nd International Conference on Advances in Computing, Communication, & Automation (ICACCA) (Fall).

[62]  Mitchell H. Tsai,et al.  The Curse of Dimensionality. , 2018, Anesthesiology.

[63]  José Fco. Martínez-Trinidad,et al.  A review of unsupervised feature selection methods , 2019, Artificial Intelligence Review.

[64]  L. Ladha,et al.  FEATURE SELECTION METHODS AND ALGORITHMS , 2011 .

[65]  Ismael Zamora,et al.  Discriminant and quantitative PLS analysis of competitive CYP2C9 inhibitors versus non-inhibitors using alignment independent GRIND descriptors , 2002, J. Comput. Aided Mol. Des..

[66]  Zoubin Ghahramani,et al.  Unifying linear dimensionality reduction , 2014, 1406.0873.

[67]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[68]  Paul J. Kennedy,et al.  Relational autoencoder for feature extraction , 2017, 2017 International Joint Conference on Neural Networks (IJCNN).

[69]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[70]  Luis Pinheiro,et al.  A Bayesian Approach to in Silico Blood-Brain Barrier Penetration Modeling , 2012, J. Chem. Inf. Model..

[71]  Stan Matwin,et al.  Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.

[72]  M. Köppen,et al.  The Curse of Dimensionality , 2010 .

[73]  Kristina Lerman,et al.  A Survey on Bias and Fairness in Machine Learning , 2019, ACM Comput. Surv..

[74]  Sanjiv Kumar,et al.  A Survey of Modern Questions and Challenges in Feature Extraction , 2015, FE@NIPS.

[75]  Eric O. Postma,et al.  Dimensionality Reduction: A Comparative Review , 2008 .

[76]  G. Theraulaz,et al.  Inspiration for optimization from social insect behaviour , 2000, Nature.

[77]  Thy-Hou Lin,et al.  Implementing the Fisher's Discriminant Ratio in a k-Means Clustering Algorithm for Feature Selection and Data Set Trimming , 2004, Journal of Chemical Information and Modeling.

[78]  Shulin Wang,et al.  Feature selection in machine learning: A new perspective , 2018, Neurocomputing.

[79]  Hongbin Yang,et al.  In Silico Prediction of Blood–Brain Barrier Permeability of Compounds by Machine Learning and Resampling Methods , 2018, ChemMedChem.

[80]  Nuria E. Campillo,et al.  Artificial Neural Networks in ADMET Modeling: Prediction of Blood–Brain Barrier Permeation , 2008 .

[81]  Scott Boyer,et al.  Choosing Feature Selection and Learning Algorithms in QSAR , 2014, J. Chem. Inf. Model..

[82]  Ms. Preeti Sharma,et al.  A Review on Non Linear Dimensionality Reduction Techniques for Face Recognition , 2017 .

[83]  Li Wen,et al.  Optimized regularized linear discriminant analysis for feature extraction in face recognition , 2018, Evolutionary Intelligence.

[84]  Annika Tillander Classification models for high-dimensional data with sparsity patterns , 2013 .

[85]  Esben Jannik Bjerrum,et al.  Improving Chemical Autoencoder Latent Space and Molecular De Novo Generation Diversity with Heteroencoders , 2018, Biomolecules.

[86]  ChangKyoo Yoo,et al.  The applications of PCA in QSAR studies: A case study on CCR5 antagonists , 2018, Chemical biology & drug design.

[87]  Danishuddin,et al.  Descriptors and their selection methods in QSAR analysis: paradigm for drug design. , 2016, Drug discovery today.

[88]  W. B. Johnson,et al.  Extensions of Lipschitz mappings into Hilbert space , 1984 .

[89]  Luhua Lai,et al.  Prediction of Drug-Likeness Using Deep Autoencoder Neural Networks , 2018, Front. Genet..

[90]  Yang Liu,et al.  Locally linear embedding: a survey , 2011, Artificial Intelligence Review.