Hybrid feature selection and peptide binding affinity prediction using an EDA based algorithm

Protein function prediction is an important problem in functional genomics. Typically, protein sequences are represented by feature vectors. A major problem of protein datasets that increase the complexity of classification models is their large number of features. The process of drug discovery often involves the use of quantitative structure-activity relationship (QSAR) models to identify chemical structures that could have good inhibitory effects on specific targets and have low toxicity (non-specific activity). QSAR models are regression or classification models used in the chemical and biological sciences. Because of high dimensionality problems, a feature selection problem is imminent. In this study, we thus employ a hybrid Estimation of Distribution Algorithm (EDA) based filter-wrapper methodology to simultaneously extract informative feature subsets and build robust QSAR models. The performance of the algorithm was tested on the benchmark classification challenge datasets obtained from the CoePRa competition platform, developed in 2006. Our results clearly demonstrate the efficacy of a hybrid EDA filter-wrapper algorithm in comparison to the results reported earlier.

[1]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[2]  R. Suganya,et al.  Data Mining Concepts and Techniques , 2010 .

[3]  Ruisheng Zhang,et al.  QSAR Models for the Prediction of Binding Affinities to Human Serum Albumin Using the Heuristic Method and a Support Vector Machine , 2004, J. Chem. Inf. Model..

[4]  Pedro Larrañaga,et al.  EDA-PSO: A Hybrid Paradigm Combining Estimation of Distribution Algorithms and Particle Swarm Optimization , 2010, ANTS Conference.

[5]  Sayan Mukherjee,et al.  Permutation Tests for Classification , 2005, COLT.

[6]  Nasser Ghasem-Aghaee,et al.  A novel ACO-GA hybrid algorithm for feature selection in protein function prediction , 2009, Expert Syst. Appl..

[7]  D. Agrafiotis,et al.  Variable selection for QSAR by artificial ant colony systems , 2002, SAR and QSAR in environmental research.

[8]  Shameek Ghosh,et al.  Simultaneous Informative Gene Extraction and Cancer Classification Using ACO-AntMiner and ACO-Random Forests , 2012 .

[9]  V. K. Jayaraman,et al.  Feature selection and classification employing hybrid ant colony optimization/random forest methodology. , 2009, Combinatorial chemistry & high throughput screening.

[10]  Vaidyanathan K. Jayaraman,et al.  Biogeography-based informative gene selection and cancer classification using SVM and Random Forests , 2012, 2012 IEEE Congress on Evolutionary Computation.

[11]  Ernst-Walter Knapp,et al.  Exploring classification strategies with the CoEPrA 2006 contest , 2010, Bioinform..

[12]  Shumeet Baluja,et al.  A Method for Integrating Genetic Search Based Function Optimization and Competitive Learning , 1994 .

[13]  Loren Hansen,et al.  Controlling feature selection in random forests of decision trees using a genetic algorithm: classification of class I MHC peptides. , 2009, Combinatorial chemistry & high throughput screening.

[14]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[15]  William L. Jorgensen,et al.  Journal of Chemical Information and Modeling , 2005, J. Chem. Inf. Model..

[16]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[17]  Pedro Larrañaga,et al.  Research topics in discrete estimation of distribution algorithms based on factorizations , 2009, Memetic Comput..

[18]  Jörg Huwyler,et al.  A Binary Ant Colony Optimization Classifier for Molecular Activities , 2011, J. Chem. Inf. Model..

[19]  Bhaskar D. Kulkarni,et al.  Feature Selection for Cancer Classification Using Ant Colony Optimization and Support Vector Machines , 2007, Analysis of Biological Data: A Soft Computing Approach.

[20]  Concha Bielza,et al.  Estimation of Distribution Algorithms as Logistic Regression Regularizers of Microarray Classifiers , 2009, Methods of Information in Medicine.

[21]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[22]  Concha Bielza,et al.  Regularized continuous estimation of distribution algorithms , 2013, Appl. Soft Comput..

[23]  Concha Bielza,et al.  Affinity propagation enhanced by estimation of distribution algorithms , 2011, GECCO '11.

[24]  Angelo Carotti,et al.  Improving Quantitative Structure-Activity Relationships through Multiobjective Optimization , 2009, J. Chem. Inf. Model..

[25]  Julio Caballero,et al.  Genetic Algorithm Optimization in Drug Design QSAR: Bayesian‐Regularized Genetic Neural Networks (BRGNN) and Genetic Algorithm‐Optimized Support Vectors Machines (GA‐SVM) , 2011 .

[26]  Ş. Niculescu Artificial neural networks and genetic algorithms in QSAR , 2003 .

[27]  Hitoshi Iba,et al.  Selection of the most useful subset of genes for gene expression-based classification , 2004, Proceedings of the 2004 Congress on Evolutionary Computation (IEEE Cat. No.04TH8753).

[28]  D. Agrafiotis,et al.  Feature selection for structure-activity correlation using binary particle swarms. , 2002, Journal of medicinal chemistry.

[29]  David E. Goldberg,et al.  A Survey of Optimization by Building and Using Probabilistic Models , 2002, Comput. Optim. Appl..

[30]  M Karplus,et al.  Evolutionary optimization in quantitative structure-activity relationship: an application of genetic neural networks. , 1996, Journal of medicinal chemistry.

[31]  Pedro Larrañaga,et al.  Estimation of Distribution Algorithms , 2002, Genetic Algorithms and Evolutionary Computation.

[32]  H. Mühlenbein,et al.  From Recombination of Genes to the Estimation of Distributions I. Binary Parameters , 1996, PPSN.

[33]  Zhiwei Wang,et al.  Particle swarm optimization and neural network application for QSAR , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[34]  A. Sarai,et al.  Genetic algorithm optimization in drug design QSAR: Bayesian-regularized genetic neural networks (BRGNN) and genetic algorithm-optimized support vectors machines (GA-SVM) , 2011, Molecular Diversity.

[35]  Walter Cedeño,et al.  Using particle swarms for the development of QSAR models based on K-nearest neighbor and kernel regression , 2003, J. Comput. Aided Mol. Des..

[36]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[37]  Concha Bielza,et al.  A review of estimation of distribution algorithms in bioinformatics , 2008, BioData Mining.

[38]  Martin Pelikan,et al.  An introduction and survey of estimation of distribution algorithms , 2011, Swarm Evol. Comput..

[39]  Maykel Pérez González,et al.  Variable selection methods in QSAR: an overview. , 2008, Current topics in medicinal chemistry.