Artificial intelligence paradigm for ligand-based virtual screening on the drug discovery of type 2 diabetes mellitus

Background New dipeptidyl peptidase-4 (DPP-4) inhibitors need to be developed to be used as agents with low adverse effects for the treatment of type 2 diabetes mellitus. This study aims to build quantitative structure-activity relationship (QSAR) models using the artificial intelligence paradigm. Rotation Forest and Deep Neural Network (DNN) are used to predict QSAR models. We compared principal component analysis (PCA) with sparse PCA (SPCA) as methods for transforming Rotation Forest. K-modes clustering with Levenshtein distance was used for the selection method of molecules, and CatBoost was used for the feature selection method. Results The amount of the DPP-4 inhibitor molecules resulting from the selection process of molecules using K-Modes clustering algorithm is 1020 with logP range value of -1.6693 to 4.99044. Several fingerprint methods such as extended connectivity fingerprint and functional class fingerprint with diameters of 4 and 6 were used to construct four fingerprint datasets, ECFP_4, ECFP_6, FCFP_4, and FCFP_6. There are 1024 features from the four fingerprint datasets that are then selected using the CatBoost method. CatBoost can represent QSAR models with good performance for machine learning and deep learning methods respectively with evaluation metrics, such as Sensitivity, Specificity, Accuracy, and Matthew’s correlation coefficient, all valued above 70% with a feature importance level of 60%, 70%, 80%, and 90%. Conclusion The K-modes clustering algorithm can produce a representative subset of DPP-4 inhibitor molecules. Feature selection in the fingerprint dataset using CatBoost is best used before making QSAR Classification and QSAR Regression models. QSAR Classification using Machine Learning and QSAR Classification using Deep Learning, each of which has an accuracy of above 70%. The QSAR RFC-PCA and QSAR RFR-PCA models performed better than QSAR RFC-SPCA and QSAR RFR-SPCA models because QSAR RFC-PCA and QSAR RFR-PCA models have more effective time than the QSAR RFC-SPCA and QSAR RFR-SPCA models.

[1]  F. Lombardo,et al.  Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. , 2001, Advanced drug delivery reviews.

[2]  Russ B Altman,et al.  Machine learning in chemoinformatics and drug discovery. , 2018, Drug discovery today.

[3]  D. Chicco,et al.  The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation , 2020, BMC Genomics.

[4]  Andreas Bender,et al.  Handbook of Chemoinformatics Algorithms , 2010 .

[5]  M. Estrada,et al.  Application of k-means clustering, linear discriminant analysis and multivariate linear regression for the development of a predictive QSAR model on 5-lipoxygenase inhibitors , 2015 .

[6]  Kunal Roy,et al.  A Primer on QSAR/QSPR Modeling: Fundamental Concepts , 2015 .

[7]  Alhadi Bustamam,et al.  Performance of rotation forest ensemble classifier and feature extractor in predicting protein interactions using amino acid sequences , 2019, BMC Genomics.

[8]  Mohamed Elhoseny,et al.  Feature selection based on artificial bee colony and gradient boosting decision tree , 2019, Appl. Soft Comput..

[9]  Jun Xu,et al.  Predicting DPP-IV inhibitors with machine learning approaches , 2017, Journal of Computer-Aided Molecular Design.

[10]  Alhadi Bustamam,et al.  Implementation of parallel k-means algorithm for two-phase method biclustering in Carcinoma tumor gene expression data , 2017 .

[11]  Gordon M. Crippen,et al.  Atomic physicochemical parameters for three-dimensional-structure-directed quantitative structure-activity relationships. 2. Modeling dispersive and hydrophobic interactions , 1987, J. Chem. Inf. Comput. Sci..

[13]  M. Ghate,et al.  Recent approaches to medicinal chemistry and therapeutic potential of dipeptidyl peptidase-4 (DPP-4) inhibitors. , 2014, European journal of medicinal chemistry.

[14]  George Papadatos,et al.  Beyond the hype: deep neural networks outperform established methods using a ChEMBL bioactivity benchmark set , 2017, bioRxiv.

[15]  Daria Goldmann,et al.  TeachOpenCADD-KNIME: A Teaching Platform for Computer-Aided Drug Design Using KNIME Workflows , 2019, J. Chem. Inf. Model..

[16]  Andrew R. Leach,et al.  An Introduction to Chemoinformatics , 2003 .

[17]  Francesc Rosselló,et al.  Chemical Graphs, Chemical Reaction Graphs, and Chemical Graph Transformation , 2005, GraBaTs.

[18]  David Weininger,et al.  SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules , 1988, J. Chem. Inf. Comput. Sci..

[19]  David Rogers,et al.  Extended-Connectivity Fingerprints , 2010, J. Chem. Inf. Model..

[20]  Chun-Xia Zhang,et al.  An empirical study of using Rotation Forest to improve regressors , 2008, Appl. Math. Comput..

[21]  Adrià Cereto-Massagué,et al.  Molecular fingerprint similarity search in virtual screening. , 2015, Methods.

[22]  Gordon M. Crippen,et al.  Prediction of Physicochemical Parameters by Atomic Contributions , 1999, J. Chem. Inf. Comput. Sci..

[23]  J. Dearden The History and Development of Quantitative Structure-Activity Relationships (QSARs) , 2016 .

[24]  Alireza Mehridehnavi,et al.  Deep neural network in QSAR studies using deep belief network , 2018, Appl. Soft Comput..

[25]  Robert P. Sheridan,et al.  Deep Neural Nets as a Method for Quantitative Structure-Activity Relationships , 2015, J. Chem. Inf. Model..

[26]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .