A Consistency-Based Feature Selection Method Allied with Linear SVMs for HIV-1 Protease Cleavage Site Prediction

Background Predicting type-1 Human Immunodeficiency Virus (HIV-1) protease cleavage site in protein molecules and determining its specificity is an important task which has attracted considerable attention in the research community. Achievements in this area are expected to result in effective drug design (especially for HIV-1 protease inhibitors) against this life-threatening virus. However, some drawbacks (like the shortage of the available training data and the high dimensionality of the feature space) turn this task into a difficult classification problem. Thus, various machine learning techniques, and specifically several classification methods have been proposed in order to increase the accuracy of the classification model. In addition, for several classification problems, which are characterized by having few samples and many features, selecting the most relevant features is a major factor for increasing classification accuracy. Results We propose for HIV-1 data a consistency-based feature selection approach in conjunction with recursive feature elimination of support vector machines (SVMs). We used various classifiers for evaluating the results obtained from the feature selection process. We further demonstrated the effectiveness of our proposed method by comparing it with a state-of-the-art feature selection method applied on HIV-1 data, and we evaluated the reported results based on attributes which have been selected from different combinations. Conclusion Applying feature selection on training data before realizing the classification task seems to be a reasonable data-mining process when working with types of data similar to HIV-1. On HIV-1 data, some feature selection or extraction operations in conjunction with different classifiers have been tested and noteworthy outcomes have been reported. These facts motivate for the work presented in this paper. Software availability The software is available at http://ozyer.etu.edu.tr/c-fs-svm.rar. The software can be downloaded at esnag.etu.edu.tr/software/hiv_cleavage_site_prediction.rar; you will find a readme file which explains how to set the software in order to work.

[1]  K. Chou,et al.  Predicting human immunodeficiency virus protease cleavage sites in proteins by a discriminant function method , 1996, Proteins.

[2]  K. Chou,et al.  Neural network prediction of the HIV-1 protease cleavage sites. , 1995, Journal of theoretical biology.

[3]  Yiying Zhang,et al.  Specificity rule discovery in HIV-1 protease cleavage site analysis , 2008, Comput. Biol. Chem..

[4]  Pedro Larrañaga,et al.  Feature subset selection from positive and unlabelled examples , 2009, Pattern Recognit. Lett..

[5]  Su-Shing Chen,et al.  Information Fusion for Biological Prediction , 2010, Journal of Data Science.

[6]  Kerrie L. Mengersen,et al.  Classification based upon gene expression data: bias and precision of error rates , 2007, Bioinform..

[7]  N. Deng,et al.  A Novel SVM-RFE for Gene Selection∗ , 2009 .

[8]  Tulio de Oliveira,et al.  An automated genotyping system for analysis of HIV-1 and other microbial sequences , 2005, Bioinform..

[9]  Thorsteinn S. Rögnvaldsson,et al.  Comprehensive Bioinformatic Analysis of the Specificity of Human Immunodeficiency Virus Type 1 Protease , 2005, Journal of Virology.

[10]  Lin Lu,et al.  HIV‐1 protease cleavage site prediction based on amino acid property , 2009, J. Comput. Chem..

[11]  M. Palaniswami,et al.  Cleavage knowledge extraction in HIV-1 protease using hidden Markov model , 2005, Proceedings of 2005 International Conference on Intelligent Sensing and Information Processing, 2005..

[12]  Loris Nanni,et al.  Machine learning for HIV-1 protease cleavage site prediction , 2006, Pattern Recognit. Lett..

[13]  Hyeoncheol Kim,et al.  An MLP-based feature subset selection for HIV-1 protease cleavage site analysis , 2010, Artif. Intell. Medicine.

[14]  Sohail Asghar,et al.  A REVIEW OF FEATURE SELECTION TECHNIQUES IN STRUCTURE LEARNING , 2013 .

[15]  Thorsteinn S. Rögnvaldsson,et al.  Why neural networks should not be used for HIV-1 protease cleavage site prediction , 2004, Bioinform..

[16]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[17]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[18]  William Stafford Noble,et al.  Support vector machine , 2013 .

[19]  Francisco Herrera,et al.  Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power , 2010, Inf. Sci..

[20]  Kuo-Chen Chou,et al.  Support vector machines for predicting HIV protease cleavage sites in protein , 2002, J. Comput. Chem..

[21]  Huan Liu,et al.  A Probabilistic Approach to Feature Selection - A Filter Solution , 1996, ICML.

[22]  Yingdong Zhao,et al.  Application of support vector machines for T-cell epitopes prediction , 2003, Bioinform..

[23]  Ajit Narayanan,et al.  Mining viral protease data to extract cleavage knowledge , 2002, ISMB.

[24]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[25]  Jonathan M. Garibaldi,et al.  Using Rule-Based Machine Learning for Candidate Disease Gene Prioritization and Sample Classification of Cancer Gene Expression Data , 2012, PloS one.

[26]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[27]  Francisco Herrera,et al.  A practical tutorial on the use of nonparametric statistical tests as a methodology for comparing evolutionary and swarm intelligence algorithms , 2011, Swarm Evol. Comput..

[28]  Hyeoncheol Kim,et al.  Feature Selection using Multi-Layer Perceptron in HIV-1 Protease Cleavage Data , 2008, 2008 International Conference on BioMedical Engineering and Informatics.

[29]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[30]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31]  Huan Liu,et al.  Feature Selection for Classification , 1997, Intell. Data Anal..

[32]  Hasan Ogul Variable context Markov chains for HIV protease cleavage site prediction , 2009, Biosyst..

[33]  Huan Liu,et al.  Consistency-based search in feature selection , 2003, Artif. Intell..

[34]  Kuo-Chen Chou,et al.  Bio-support vector machines for computational proteomics , 2004, Bioinform..

[35]  Loris Nanni,et al.  A reliable method for HIV-1 protease cleavage site prediction , 2006, Neurocomputing.

[36]  Huan Liu,et al.  Feature Selection: An Ever Evolving Frontier in Data Mining , 2010, FSDM.

[37]  Loris Nanni,et al.  A new encoding technique for peptide classification , 2011, Expert Syst. Appl..

[38]  K. Chou Prediction of human immunodeficiency virus protease cleavage sites in proteins. , 1996, Analytical biochemistry.

[39]  Zheng Rong Yang,et al.  Bio-basis function neural network for prediction of protease cleavage sites in proteins , 2005, IEEE Transactions on Neural Networks.

[40]  K C Chou,et al.  Artificial neural network model for predicting HIV protease cleavage sites in protein , 1998 .

[41]  Loris Nanni,et al.  Comparison among feature extraction methods for HIV-1 protease cleavage site prediction , 2006, Pattern Recognit..