Feature Selection for Improved Classification of Protein Structures

Many research groups work on analyzing the structures of protein molecules since that may help to gain knowledge that can be used for designing drugs. To understand the protein structures, it is very important to categorize them in corresponding classes. Therefore, protein classification is one of the main topics in bioinformatics. In this paper, we propose an approach for classifying protein structures. First, the characteristics of the proteins are extracted in corresponding feature vectors. Then, feature selection is made in order to reduce the dimensionality of the dataset, as well as to keep only the most significant features. For feature selection, we use various feature selection techniques. Finally, we build models by using different classification methods. The proposed approach is evaluated in details and also the benefits of applying feature selection are analyzed.

[1]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[2]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[3]  David W. Aha,et al.  Instance-Based Learning Algorithms , 1991, Machine Learning.

[4]  Srinivasan Parthasarathy,et al.  A multi-level approach to SCOP fold recognition , 2005, Fifth IEEE Symposium on Bioinformatics and Bioengineering (BIBE'05).

[5]  Yuan Qi,et al.  SCOPmap: Automated assignment of protein structures to evolutionary superfamilies , 2004, BMC Bioinformatics.

[6]  Pat Langley,et al.  Estimating Continuous Distributions in Bayesian Classifiers , 1995, UAI.

[7]  Mark A. Hall,et al.  Correlation-based Feature Selection for Machine Learning , 2003 .

[8]  Igor Kononenko,et al.  Estimating Attributes: Analysis and Extensions of RELIEF , 1994, ECML.

[9]  David C. Jones,et al.  CATH--a hierarchic classification of protein domain structures. , 1997, Structure.

[10]  Philip J. Stone,et al.  Experiments in induction , 1966 .

[11]  Georgina Mirceva,et al.  Improvement of protein binding sites prediction by selecting amino acid residues' features. , 2015, Journal of structural biology.

[12]  Céline Loscos,et al.  3D Model Retrieval , 2013 .

[13]  C. Sander,et al.  Protein structure comparison by alignment of distance matrices. , 1993, Journal of molecular biology.

[14]  Osvaldo Olmea,et al.  MAMMOTH (Matching molecular models obtained from theory): An automated method for model comparison , 2002, Protein science : a publication of the Protein Society.

[15]  Michael G. Strintzis,et al.  Three-Dimensional Shape-Structure Comparison Method for Protein Classification , 2006, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[16]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[17]  Nir Friedman,et al.  Bayesian Network Classifiers , 1997, Machine Learning.

[18]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[19]  Jinn-Moon Yang,et al.  fastSCOP: a fast web server for recognizing protein structural domains and SCOP superfamilies , 2007, Nucleic Acids Res..

[20]  P E Bourne,et al.  Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. , 1998, Protein engineering.

[21]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[22]  Georgina Mirceva,et al.  Efficient Approaches for Retrieving Protein Tertiary Structures , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[23]  Chi-Ren Shyu,et al.  Efficient protein tertiary structure retrievals and classifications using content based comparison algorithms , 2007 .

[24]  Dejan V. Vranic,et al.  3D model retrieval , 2004 .