A Review of Feature Reduction Methods for QSAR-Based Toxicity Prediction

Thousands of molecular descriptors (1D to 4D) can be generated and used as features to model quantitative structure–activity or toxicity relationship (QSAR or QSTR) for chemical toxicity prediction. This often results in models that suffer from the “curse of dimensionality”, a problem that can occur in machine learning practice when too many features are employed to train a model. Here we discuss different methods of eliminating redundant and irrelevant features to enhance prediction performance, increase interpretability, and reduce computational complexity. Several feature selection and extraction methods are summarized along with their strengths and shortcomings. We also highlight some commonly overlooked challenges such as algorithm instability and selection bias while offering possible solutions.

[1]  Alan Julian Izenman,et al.  Introduction to manifold learning , 2012 .

[2]  Mátyás Brendel,et al.  A quick sequential forward floating feature selection algorithm for emotion detection from speech , 2010, INTERSPEECH.

[3]  Duncan Fyfe Gillies,et al.  A Review of Feature Selection and Feature Extraction Methods Applied on Microarray Data , 2015, Adv. Bioinformatics.

[4]  Alex Alves Freitas,et al.  Pre-processing Feature Selection for Improved C&RT Models for Oral Absorption , 2013, J. Chem. Inf. Model..

[5]  R D Benz,et al.  (Q)SAR Modeling and Safety Assessment in Regulatory Review , 2012, Clinical pharmacology and therapeutics.

[6]  Eric O. Postma,et al.  Dimensionality Reduction: A Comparative Review , 2008 .

[7]  Mohammad Ali Zare Chahooki,et al.  A Survey on semi-supervised feature selection methods , 2017, Pattern Recognit..

[8]  Zheng Rong Yang,et al.  Evaluation of Mutual Information and Genetic Programming for Feature Selection in QSAR , 2004, J. Chem. Inf. Model..

[9]  Günter Klambauer,et al.  DeepTox: Toxicity Prediction using Deep Learning , 2016, Front. Environ. Sci..

[10]  Hu Yan,et al.  The Comparison of Five Discriminant Methods , 2011, 2011 International Conference on Management and Service Science.

[11]  Jean Yee Hwa Yang,et al.  Gene-gene interaction filtering with ensemble of filters , 2011, BMC Bioinformatics.

[12]  Abhinav Vishnu,et al.  Deep learning for computational chemistry , 2017, J. Comput. Chem..

[13]  I. Johnstone,et al.  Statistical challenges of high-dimensional data , 2009, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[14]  Xingquan Zhu,et al.  Class Noise vs. Attribute Noise: A Quantitative Study , 2003, Artificial Intelligence Review.

[15]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[16]  Richard Bellman,et al.  Adaptive Control Processes: A Guided Tour , 1961, The Mathematical Gazette.

[17]  M. Shahlaei Descriptor selection methods in quantitative structure-activity relationship studies: a review study. , 2013, Chemical reviews.

[18]  Antonio Lavecchia,et al.  Machine-learning approaches in drug discovery: methods and applications. , 2015, Drug discovery today.

[19]  Habibollah Haron,et al.  Supervised, Unsupervised, and Semi-Supervised Feature Selection: A Review on Gene Selection , 2016, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[20]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[21]  Shulin Wang,et al.  Feature selection in machine learning: A new perspective , 2018, Neurocomputing.

[22]  Ping Liu,et al.  Predicting the aquatic toxicity mode of action using logistic regression and linear discriminant analysis , 2016, SAR and QSAR in environmental research.

[23]  Bieke Dejaegher,et al.  Feature selection methods in QSAR studies. , 2012, Journal of AOAC International.

[24]  Donghai Guan,et al.  A Review of Ensemble Learning Based Feature Selection , 2014 .

[25]  E. Gehan,et al.  The properties of high-dimensional data spaces: implications for exploring gene and protein expression data , 2008, Nature Reviews Cancer.

[26]  Albert Y. Zomaya,et al.  Stability of feature selection algorithms and ensemble feature selection methods in bioinformatics , 2013 .

[27]  Ting Chen,et al.  Ensemble Feature Selection: Consistent Descriptor Subsets for Multiple QSAR Models , 2007, J. Chem. Inf. Model..

[28]  Richard Simon,et al.  Overfitting in prediction models - is it a problem only in high dimensions? , 2013, Contemporary clinical trials.

[29]  Xiang-Wei Zhu,et al.  Recursive Random Forests Enable Better Predictive Performance and Model Interpretation than Variable Selection by LASSO , 2015, J. Chem. Inf. Model..

[30]  Ferat Sahin,et al.  A survey on feature selection methods , 2014, Comput. Electr. Eng..

[31]  Alexander Tropsha,et al.  Best Practices for QSAR Model Development, Validation, and Exploitation , 2010, Molecular informatics.

[32]  Divya Jain,et al.  Feature selection and classification systems for chronic disease prediction: A review , 2018, Egyptian Informatics Journal.

[33]  Jürgen Bajorath,et al.  Selected Concepts and Investigations in Compound Classification, Molecular Descriptor Analysis, and Virtual Screening , 2001, J. Chem. Inf. Comput. Sci..

[34]  B. Chandra,et al.  Exploring autoencoders for unsupervised feature selection , 2015, 2015 International Joint Conference on Neural Networks (IJCNN).

[35]  Daniel Neagu,et al.  A Study on Feature Selection for Toxicity Prediction , 2005, FSKD.

[36]  Gerhard F. Ecker,et al.  Ligand and Structure-Based Classification Models for Prediction of P-Glycoprotein Inhibitors , 2013, J. Chem. Inf. Model..

[37]  Jian-Hui Jiang,et al.  Modified Ant Colony Optimization Algorithm for Variable Selection in QSAR Modeling: QSAR Studies of Cyclooxygenase Inhibitors , 2005, J. Chem. Inf. Model..

[38]  Nigel Greene,et al.  Computational toxicology, friend or foe? , 2015 .

[39]  Danishuddin,et al.  Descriptors and their selection methods in QSAR analysis: paradigm for drug design. , 2016, Drug discovery today.

[40]  Pierre Baldi,et al.  Autoencoders, Unsupervised Learning, and Deep Architectures , 2011, ICML Unsupervised and Transfer Learning.

[41]  Melanie Hilario,et al.  Knowledge and Information Systems , 2007 .

[42]  ChangKyoo Yoo,et al.  The applications of PCA in QSAR studies: A case study on CCR5 antagonists , 2018, Chemical biology & drug design.

[43]  Rajni Garg,et al.  Hybrid-genetic algorithm based descriptor optimization and QSAR models for predicting the biological activity of Tipranavir analogs for HIV protease inhibition. , 2010, Journal of molecular graphics & modelling.

[44]  Thomas Blaschke,et al.  Application of Generative Autoencoder in De Novo Molecular Design , 2017, Molecular informatics.

[45]  Hui-Huang Hsu,et al.  Hybrid feature selection by combining filters and wrappers , 2011, Expert Syst. Appl..

[46]  Feng Yang,et al.  Robust Feature Selection for Microarray Data Based on Multicriterion Fusion , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[47]  Wilfried N. Gansterer,et al.  On the Relationship Between Feature Selection and Classification Accuracy , 2008, FSDM.

[48]  Lyle D. Burgoon Autoencoder Predicting Estrogenic Chemical Substances (APECS): An improved approach for screening potentially estrogenic chemicals using in vitro assays and deep learning , 2017 .

[49]  Haidar Osman,et al.  Automatic feature selection by regularization to improve bug prediction accuracy , 2017, 2017 IEEE Workshop on Machine Learning Techniques for Software Quality Evaluation (MaLTeSQuE).

[50]  Ruili Huang,et al.  Tox21Challenge to Build Predictive Models of Nuclear Receptor and Stress Response Pathways as Mediated by Exposure to Environmental Chemicals and Drugs , 2016, Front. Environ. Sci..

[51]  David Rogers,et al.  Extended-Connectivity Fingerprints , 2010, J. Chem. Inf. Model..

[52]  Lei Wang,et al.  The Effect of the Characteristics of the Dataset on the Selection Stability , 2011, 2011 IEEE 23rd International Conference on Tools with Artificial Intelligence.

[53]  Mikhail Belkin,et al.  Laplacian Eigenmaps for Dimensionality Reduction and Data Representation , 2003, Neural Computation.

[54]  Carla E. Brodley,et al.  Feature Selection for Unsupervised Learning , 2004, J. Mach. Learn. Res..

[55]  Josef Kittler,et al.  Floating search methods in feature selection , 1994, Pattern Recognit. Lett..

[56]  Anna Maria Almerico,et al.  Combined Use of PCA and QSAR/QSPR to Predict the Drugs Mechanism of Action. An Application to the NCI ACAM Database , 2009 .

[57]  T. Kohonen Self-organized formation of topographically correct feature maps , 1982 .

[58]  Bahram Hemmateenejad,et al.  Multiple Linear Regression and Principal Component Analysis‐Based Prediction of the Anti‐Tuberculosis Activity of Some 2‐aryl‐1,3,4‐Thiadiazole Derivatives , 2006 .

[59]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[60]  Thibault Helleputte,et al.  Robust biomarker identification for cancer diagnosis with ensemble feature selection methods , 2010, Bioinform..

[61]  Rajarshi Guha,et al.  Development of Linear, Ensemble, and Nonlinear Models for the Prediction and Interpretation of the Biological Activity of a Set of PDGFR Inhibitors , 2004, J. Chem. Inf. Model..

[62]  Vladimir B Bajic,et al.  In silico toxicology: computational methods for the prediction of chemical toxicity , 2016, Wiley interdisciplinary reviews. Computational molecular science.

[63]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[64]  Geoffrey J McLachlan,et al.  Selection bias in gene extraction on the basis of microarray gene-expression data , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[65]  Ferran Reverter,et al.  Kernel-PCA data integration with enhanced interpretability , 2014, BMC Systems Biology.

[66]  James Kennedy,et al.  Particle swarm optimization , 2002, Proceedings of ICNN'95 - International Conference on Neural Networks.

[67]  Mohamed Limam,et al.  Ensemble feature selection for high dimensional data: a new method and a comparative study , 2017, Advances in Data Analysis and Classification.

[68]  J. Tenenbaum,et al.  A global geometric framework for nonlinear dimensionality reduction. , 2000, Science.

[69]  Verónica Bolón-Canedo,et al.  Ensemble feature selection: Homogeneous and heterogeneous approaches , 2017, Knowl. Based Syst..

[70]  Woody Sherman,et al.  Analysis and comparison of 2D fingerprints: insights into database screening performance using eight fingerprint methods , 2010, J. Cheminformatics.

[71]  José Francisco Martínez Trinidad,et al.  Hybrid feature selection method for biomedical datasets , 2012, 2012 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB).

[72]  G. Manikandan,et al.  A Survey on Feature Selection and Extraction Techniques for High-Dimensional Microarray Datasets , 2018 .

[73]  Roberto Todeschini,et al.  Handbook of Molecular Descriptors , 2002 .

[74]  Thomas Lengauer,et al.  Ensemble Methods for Classification in Cheminformatics , 2004, J. Chem. Inf. Model..

[75]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[76]  Weihua Li,et al.  In Silico Prediction of Chemical Toxicity for Drug Design Using Machine Learning Methods and Structural Alerts , 2018, Front. Chem..

[77]  Teuvo Kohonen,et al.  Self-organized formation of topologically correct feature maps , 2004, Biological Cybernetics.