Feature Extraction Using Clustering of Protein

In this paper we investigate the usage of a clustering algorithm as a feature extraction technique to find new features to represent the protein sequence. In particular, our work focuses on the prediction of HIV protease resistance to drugs. We use a biologically motivated similarity function based on the contact energy of the amino acid and the position in the sequence. The performance measure was computed taking into account the clustering reliability and the classification validity. An SVM using 10-fold crossvalidation and the k-means algorithm were used for classification and clustering respectively. The best results were obtained by reducing an initial set of 99 features to a lower dimensional feature set of 36-66 features.

[1]  R. Jernigan,et al.  Residue-residue potentials with a favorable contact pair term and an unfavorable high packing density term, for simulation and threading. , 1996, Journal of molecular biology.

[2]  Naftali Tishby,et al.  The Power of Word Clusters for Text Classification , 2006 .

[3]  Thomas Lengauer,et al.  Diversity and complexity of HIV-1 drug resistance: A bioinformatics approach to predicting phenotype from genotype , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[4]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[5]  Hong Huang Lin,et al.  Computer prediction of drug resistance mutations in proteins. , 2005, Drug discovery today.

[6]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[7]  David G. Stork,et al.  Pattern Classification , 1973 .

[8]  Alexander Bergo Text Categorization and Prototypes , 2001 .

[9]  J. Neyman Second Berkeley Symposium on Mathematical Statistics and Probability , 1951 .

[10]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[11]  Thomas Lengauer,et al.  Geno2pheno: estimating phenotypic drug resistance from HIV-1 genotypes , 2003, Nucleic Acids Res..

[12]  J. Hanley,et al.  The meaning and use of the area under a receiver operating characteristic (ROC) curve. , 1982, Radiology.

[13]  C. Metz Basic principles of ROC analysis. , 1978, Seminars in nuclear medicine.

[14]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[15]  David G. Stork,et al.  Pattern Classification (2nd ed.) , 1999 .

[16]  R L Jernigan,et al.  Protein stability for single substitution mutants and the extent of local compactness in the denatured state. , 1994, Protein engineering.

[17]  I K Fodor,et al.  A Survey of Dimension Reduction Techniques , 2002 .