Evolutionary based optimal ensemble classifiers for HIV-1 protease cleavage sites prediction

Abstract HIV-1 protease site helps to understand the specificity of substrates which antagonizes AIDS by restraining the replication of HIV-1 through inhibitors. Identification of HIV-1 protease cleavage sites by experimental methods are usually labor-intensive thus time-consuming. Several computational intelligence methods have been evaluated to predict cleavage sites. However, due to inconsistent findings regarding the superiority of one classifier over another and the usefulness of encoding techniques in general, more research is needed to provide advance confidence in computational results. The success of an HIV cleavage site prediction system depends heavily on two things: the classifier being used and the features encoding technique applied. For the cleavage sites identification, the role of appropriate feature encoding has not been paid adequate importance. In this investigation, we use two novel ideas for HIV Cleavage site prediction. First, we propose an optimal ensemble formation technique that optimizes the search space of 228 formed by seven encoding techniques and four SVM kernels (7 × 4) with the use of genetic algorithm. The second is the utilization of area under receiver operating characteristics (AUC) as a fitness measure for the evaluation of optimal ensemble. The evolutionary algorithm is encoded with binary strings to decide the correlation between the encoding-classifier pair in an ensemble. The proposed method with new ensembling encoding-classifier pair increases the HIV cleavage site prediction significantly. Overall, an appealing degree of predictive accuracy is observed by evolutionary-based ensemble model and hence becomes a valid and best alternative for peptide classification.

[1]  K. Chou Prediction of human immunodeficiency virus protease cleavage sites in proteins. , 1996, Analytical biochemistry.

[2]  Zheng Rong Yang,et al.  Mining HIV protease cleavage data using genetic programming with a sum-product function , 2004, Bioinform..

[3]  Loris Nanni,et al.  Machine learning for HIV-1 protease cleavage site prediction , 2006, Pattern Recognit. Lett..

[4]  Chee Keong Kwoh,et al.  Drug-target interaction prediction via class imbalance-aware ensemble learning , 2016, BMC Bioinformatics.

[5]  Lakhmi C. Jain,et al.  Designing classifier fusion systems by genetic algorithms , 2000, IEEE Trans. Evol. Comput..

[6]  Yiying Zhang,et al.  Specificity rule discovery in HIV-1 protease cleavage site analysis , 2008, Comput. Biol. Chem..

[7]  Bhaskar D. Kulkarni,et al.  Using pseudo amino acid composition to predict protein subnuclear localization: Approached with PSSM , 2007, Pattern Recognit. Lett..

[8]  Liwen You,et al.  Detection of cleavage sites for HIV-1 protease in native proteins. , 2006, Computational systems bioinformatics. Computational Systems Bioinformatics Conference.

[9]  Stefan C. Kremer,et al.  Amino acid encoding schemes for machine learning methods , 2011, 2011 IEEE International Conference on Bioinformatics and Biomedicine Workshops (BIBMW).

[10]  Y. Hochberg A sharper Bonferroni procedure for multiple tests of significance , 1988 .

[11]  A. T. Özcerit,et al.  OETMAP: a new feature encoding scheme for MHC class I binding prediction , 2011, Molecular and Cellular Biochemistry.

[12]  T. Sejnowski,et al.  Predicting the secondary structure of globular proteins using neural network models. , 1988, Journal of molecular biology.

[13]  Loris Nanni,et al.  A genetic approach for building different alphabets for peptide and protein classification , 2008, BMC Bioinformatics.

[14]  Loris Nanni,et al.  MppS: An ensemble of support vector machine based on multiple physicochemical properties of amino acids , 2006, Neurocomputing.

[15]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[16]  K. Chou,et al.  Signal-3L: A 3-layer approach for predicting signal peptides. , 2007, Biochemical and biophysical research communications.

[17]  Cheng-Yan Kao,et al.  An evolutionary algorithm for large traveling salesman problems , 2004, IEEE Trans. Syst. Man Cybern. Part B.

[18]  Loris Nanni,et al.  Comparison among feature extraction methods for HIV-1 protease cleavage site prediction , 2006, Pattern Recognit..

[19]  Naoki Abe,et al.  Query Learning Strategies Using Boosting and Bagging , 1998, ICML.

[20]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[21]  Gonzalo Nápoles,et al.  Two-steps learning of Fuzzy Cognitive Maps for prediction and knowledge discovery on the HIV-1 drug resistance , 2014, Expert Syst. Appl..

[22]  Loris Nanni,et al.  A genetic encoding approach for learning methods for combining classifiers , 2009, Expert Syst. Appl..

[23]  Sung-Bae Cho,et al.  An Evolutionary Algorithm Approach to Optimal Ensemble Classifiers for DNA Microarray Data Analysis , 2008, IEEE Transactions on Evolutionary Computation.

[24]  Oliver Schilling,et al.  Proteome-derived, database-searchable peptide libraries for identifying protease cleavage sites , 2008, Nature Biotechnology.

[25]  Kalyanmoy Deb,et al.  Simulated Binary Crossover for Continuous Search Space , 1995, Complex Syst..

[26]  Ester Bernadó-Mansilla,et al.  Genetic-based machine learning systems are competitive for pattern recognition , 2008, Evol. Intell..

[27]  Kai Xu,et al.  Parallel algorithms for large-scale biological sequence alignment on Xeon-Phi based clusters , 2016, BMC Bioinformatics.

[28]  George M. Whitson,et al.  PROCANS: a protein classification system using a neural network , 1990, 1990 IJCNN International Joint Conference on Neural Networks.

[29]  Bernard Zenko,et al.  Is Combining Classifiers Better than Selecting the Best One , 2002, ICML.

[30]  Loris Nanni,et al.  Using ensemble of classifiers for predicting HIV protease cleavage sites in proteins , 2009, Amino Acids.

[31]  W. Taylor,et al.  The classification of amino acid conservation. , 1986, Journal of theoretical biology.

[32]  Murat Gök,et al.  A new feature encoding scheme for HIV-1 protease cleavage site prediction , 2012, Neural Computing and Applications.

[33]  Shuai Zhang,et al.  A novel ensemble method for credit scoring: Adaption of different imbalance ratios , 2018, Expert Syst. Appl..

[34]  M. Sternberg,et al.  Prediction of protein secondary structure and active sites using the alignment of homologous sequences. , 1987, Journal of molecular biology.

[35]  Loris Nanni,et al.  A new encoding technique for peptide classification , 2011, Expert Syst. Appl..

[36]  Francisco Herrera,et al.  A study of statistical techniques and performance measures for genetics-based machine learning: accuracy and interpretability , 2009, Soft Comput..

[37]  Thorsteinn S. Rögnvaldsson,et al.  Why neural networks should not be used for HIV-1 protease cleavage site prediction , 2004, Bioinform..

[38]  Li Li,et al.  A robust hybrid between genetic algorithm and support vector machine for extracting an optimal feature gene subset. , 2005, Genomics.

[39]  Jan Komorowski,et al.  Computational proteomics analysis of HIV‐1 protease interactome , 2007, Proteins.

[40]  K. Chou,et al.  Virus-PLoc: a fusion classifier for predicting the subcellular localization of viral proteins within host and virus-infected cells. , 2007, Biopolymers.

[41]  Zuowei Zhao,et al.  Feature Selection Combined with Neural Network Structure Optimization for HIV-1 Protease Cleavage Site Prediction , 2015, BioMed research international.

[42]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[43]  Hasan Ogul Variable context Markov chains for HIV protease cleavage site prediction , 2009, Biosyst..

[44]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[45]  Thorsteinn S. Rögnvaldsson,et al.  Comprehensive Bioinformatic Analysis of the Specificity of Human Immunodeficiency Virus Type 1 Protease , 2005, Journal of Virology.

[46]  Su-Shing Chen,et al.  Information Fusion for Biological Prediction , 2010, Journal of Data Science.

[47]  A. Shanthini,et al.  Analyzing the effect of bagged ensemble approach for software fault prediction in class level and package level metrics , 2014, International Conference on Information Communication and Embedded Systems (ICICES2014).

[48]  Luc Montagnier,et al.  The discovery of HIV as the cause of AIDS. , 2003, The New England journal of medicine.

[49]  Paulo J. G. Lisboa,et al.  How to find simple and accurate rules for viral protease cleavage specificities , 2009, BMC Bioinformatics.

[50]  Robi Polikar,et al.  Majority Vote and Decision Template Based Ensemble Classifiers Trained on Event Related Potentials for Early Diagnosis of Alzheimer's Disease , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[51]  Francesca Mangili,et al.  Should We Really Use Post-Hoc Tests Based on Mean-Ranks? , 2015, J. Mach. Learn. Res..

[52]  O. J. Dunn Multiple Comparisons among Means , 1961 .

[53]  J. Chou,et al.  Predicting cleavability of peptide sequences by HIV protease via correlation-angle approach , 1993, Journal of protein chemistry.

[54]  H.-B. Shen,et al.  Using ensemble classifier to identify membrane protein types , 2006, Amino Acids.

[55]  Myoung-Jong Kim,et al.  Classifiers selection in ensembles using genetic algorithms for bankruptcy prediction , 2012, Expert Syst. Appl..

[56]  Shiow-Fen Hwang,et al.  ProLoc: Prediction of protein subnuclear localization using SVM with automatic selection from physicochemical composition features , 2007, Biosyst..

[57]  Yew-Soon Ong,et al.  Towards a new Praxis in optinformatics targeting knowledge re-use in evolutionary computation: simultaneous problem learning and optimization , 2016, Evolutionary Intelligence.

[58]  Thomas A. Darden,et al.  Gene selection for sample classification based on gene expression data: study of sensitivity to choice of parameters of the GA/KNN method , 2001, Bioinform..

[59]  Thorsteinn S. Rögnvaldsson,et al.  State of the art prediction of HIV-1 protease cleavage sites , 2015, Bioinform..

[60]  Hiroyuki Ogata,et al.  AAindex: Amino Acid Index Database , 1999, Nucleic Acids Res..

[61]  H. Scheraga,et al.  Statistical analysis of the physical properties of the 20 naturally occurring amino acids , 1985 .

[62]  Bernard Zenko,et al.  Is Combining Classifiers with Stacking Better than Selecting the Best One? , 2004, Machine Learning.