A novel molecular descriptor selection method in QSAR classification model based on weighted penalized logistic regression

Molecular descriptor selection is a pivotal tool for quantitative structure–activity relationship modeling. This paper proposes a novel molecular descriptor selection method on the basis of taking into account the information of the group type that the descriptor belongs to. This descriptor selection method is on the basis of combining penalized logistic regression with 2‐sample t test. The proposed method can perform filtering and weighting simultaneously. Specifically, 2‐sample t test is employed as filter method by removing the descriptor which is not show statistically significant difference. On the other hand, a weighted penalized logistic regression is used by assigning a weight depending on the 2‐sample t test value inside the descriptor type block. The proposed method is experimentally tested and compared with state‐of‐the‐art selection methods. The results show that our proposed method is simpler and faster with efficient classification performance.

[1]  J. Friedman,et al.  A Statistical View of Some Chemometrics Regression Tools , 1993 .

[2]  Uko Maran,et al.  QSAR DataBank - an approach for the digital organization and archiving of QSAR model information , 2014, Journal of Cheminformatics.

[3]  Hong Yan,et al.  An accurate nonlinear QSAR model for the antitumor activities of chloroethylnitrosoureas using neural networks. , 2011, Journal of molecular graphics & modelling.

[4]  Y. Chao,et al.  Design, synthesis, and anti-HCV activity of thiourea compounds. , 2009, Bioorganic & medicinal chemistry letters.

[5]  Zakariya Yahya Algamal,et al.  High Dimensional Logistic Regression Model using Adjusted Elastic Net Penalty , 2015 .

[6]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[7]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[8]  Muhammad Hisyam Lee,et al.  High‐dimensional quantitative structure–activity relationship modeling of influenza neuraminidase a/PR/8/34 (H1N1) inhibitors based on a two‐stage adaptive penalized rank regression , 2016 .

[9]  Ke Zhang,et al.  Analysis of High-Dimensional Structure-Activity Screening Datasets Using the Optimal Bit String Tree , 2013, Technometrics.

[10]  Meimei Chen,et al.  A QSAR classification study on inhibitory activities of 2-arylbenzoxazoles against cholesteryl ester transfer protein , 2014, Medicinal Chemistry Research.

[11]  Aixia Yan,et al.  Using Support Vector Machine (SVM) for Classification of Selectivity of H1N1 Neuraminidase Inhibitors , 2016, Molecular informatics.

[12]  Yiyuan She,et al.  Multivariate calibration maintenance and transfer through robust fused LASSO , 2013 .

[13]  C. Braak,et al.  Regression by L1 regularization of smart contrasts and sums (ROSCAS) beats PLS and elastic net in latent variable model , 2009 .

[14]  A. Yueh,et al.  Design and efficient synthesis of novel arylthiourea derivatives as potent hepatitis C virus inhibitors. , 2009, Bioorganic & medicinal chemistry letters.

[15]  Roberto Todeschini,et al.  Defining a novel k-nearest neighbours approach to assess the applicability domain of a QSAR model for reliable predictions , 2013, Journal of Cheminformatics.

[16]  Muhammad Hisyam Lee,et al.  Penalized logistic regression with the adaptive LASSO for gene selection in high-dimensional cancer classification , 2015, Expert Syst. Appl..

[17]  Fernanda Borges,et al.  Combining QSAR classification models for predictive modeling of human monoamine oxidase inhibitors. , 2013, European journal of medicinal chemistry.

[18]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[19]  Zakariya Yahya Algamal,et al.  High‐dimensional QSAR prediction of anticancer potency of imidazo[4,5‐b]pyridine derivatives using adjusted adaptive LASSO , 2015 .

[20]  Ruifeng Liu,et al.  QSAR Classification Model for Antibacterial Compounds and Its Use in Virtual Screening , 2012, J. Chem. Inf. Model..

[21]  Jelle J Goeman,et al.  Efficient approximate k‐fold and leave‐one‐out cross‐validation for ridge regression , 2013, Biometrical journal. Biometrische Zeitschrift.

[22]  Muhammad Hisyam Lee,et al.  Applying Penalized Binary Logistic Regression with Correlation Based Elastic Net for Variables Selection , 2015 .

[23]  Knut Baumann,et al.  inSARa: intuitive single-target (large-scale) SAR interpretation and multi-target cross-reactivity analysis , 2014, Journal of Cheminformatics.

[24]  Z Y Algamal,et al.  A new adaptive L1-norm for optimal descriptor selection of high-dimensional QSAR classification model for anti-hepatitis C virus activity of thiourea derivatives , 2017, SAR and QSAR in environmental research.

[25]  Jianqing Fan,et al.  Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties , 2001 .

[26]  H. Si,et al.  Quantitative structure–activity relationship study on antitumour activity of a series of flavonoids , 2012 .

[27]  Rasmus Bro,et al.  A tutorial on the Lasso approach to sparse modeling , 2012 .

[28]  Juan M. Corchado,et al.  Identification of informative genes and pathways using an improved penalized support vector machine with a weighting scheme , 2016, Comput. Biol. Medicine.

[29]  H. Zou The Adaptive Lasso and Its Oracle Properties , 2006 .

[30]  Muhammad Hisyam Lee,et al.  Regularized logistic regression with adjusted adaptive elastic net for gene selection in high dimensional cancer classification , 2015, Comput. Biol. Medicine.

[31]  Alexander Tropsha,et al.  Best Practices for QSAR Model Development, Validation, and Exploitation , 2010, Molecular informatics.

[32]  Irene Luque Ruiz,et al.  QSAR model based on weighted MCS trees approach for the representation of molecule data sets , 2013, Journal of Computer-Aided Molecular Design.

[33]  A. Yueh,et al.  Synthesis, activity, and pharmacokinetic properties of a series of conformationally-restricted thiourea analogs as novel hepatitis C virus inhibitors. , 2010, Bioorganic & medicinal chemistry.

[34]  M. Novič,et al.  Assessment of applicability domain for multivariate counter-propagation artificial neural network predictive models by minimum euclidean distance space analysis: a case study. , 2013, Analytica chimica acta.

[35]  Zakariya Yahya Algamal,et al.  High Dimensional QSAR Study of Mild Steel Corrosion Inhibition in acidic medium by Furan Derivatives , 2015, International Journal of Electrochemical Science.

[36]  Xiao-Ying Liu,et al.  Feature Selection and Cancer Classification via Sparse Logistic Regression with the Hybrid L1/2 +2 Regularization , 2016, PloS one.

[37]  Frank R. Burden,et al.  Relevance Vector Machines: Sparse Classification Methods for QSAR , 2015, J. Chem. Inf. Model..

[38]  Liming Yang,et al.  A sparse logistic regression framework by difference of convex functions programming , 2016, Applied Intelligence.

[39]  Eslam Pourbasheer,et al.  2D and 3D Quantitative Structure-Activity Relationship Study of Hepatitis C Virus NS5B Polymerase Inhibitors by Comparative Molecular Field Analysis and Comparative Molecular Similarity Indices Analysis Methods , 2014, J. Chem. Inf. Model..

[40]  Motoko Yanagita,et al.  Antagonistic Functions of USAG-1 and RUNX2 during Tooth Development , 2016, PloS one.

[41]  Viney Lather,et al.  Diverse classification models for anti-hepatitis C virus activity of thiourea derivatives , 2015 .

[42]  Xiaohui Fan,et al.  Reliably assessing prediction reliability for high dimensional QSAR data , 2012, Molecular Diversity.

[43]  Hasmerya Maarof,et al.  Quantitative structure–activity relationship model for prediction study of corrosion inhibition efficiency using two‐stage sparse multiple linear regression , 2016 .

[44]  Jianhua Xuan,et al.  Applications of Different Weighting Schemes to Improve Pathway-Based Analysis , 2011, Comparative and functional genomics.

[45]  Chin Yee Liew,et al.  QSAR classification of metabolic activation of chemicals into covalently reactive species , 2012, Molecular Diversity.

[46]  Peter Filzmoser,et al.  Review of sparse methods in regression and classification with application to chemometrics , 2012 .

[47]  Rasmus Bro,et al.  Variable selection in regression—a tutorial , 2010 .

[48]  Z. Algamal,et al.  High-dimensional QSAR modelling using penalized linear regression model with L1/2-norm , 2016, SAR and QSAR in environmental research.