Efficient Feature Selection and Classification of Protein Sequence Data in Bioinformatics

Bioinformatics has been an emerging area of research for the last three decades. The ultimate aims of bioinformatics were to store and manage the biological data, and develop and analyze computational tools to enhance their understanding. The size of data accumulated under various sequencing projects is increasing exponentially, which presents difficulties for the experimental methods. To reduce the gap between newly sequenced protein and proteins with known functions, many computational techniques involving classification and clustering algorithms were proposed in the past. The classification of protein sequences into existing superfamilies is helpful in predicting the structure and function of large amount of newly discovered proteins. The existing classification results are unsatisfactory due to a huge size of features obtained through various feature encoding methods. In this work, a statistical metric-based feature selection technique has been proposed in order to reduce the size of the extracted feature vector. The proposed method of protein classification shows significant improvement in terms of performance measure metrics: accuracy, sensitivity, specificity, recall, F-measure, and so forth.

[1]  Sanghamitra Bandyopadhyay,et al.  An efficient technique for superfamily classification of amino acid sequences: feature extraction, fuzzy clustering and prototype selection , 2005, Fuzzy Sets Syst..

[2]  S. Katebi,et al.  Protein Superfamily Classification Using Fuzzy Rule-Based Classifier , 2009, IEEE Transactions on NanoBioscience.

[3]  D. Mount,et al.  Comparison of the PAM and BLOSUM Amino Acid Substitution Matrices. , 2008, CSH protocols.

[4]  Rich Caruana,et al.  An Empirical Comparison of Supervised Learning Algorithms Using Different Performance Metrics , 2005 .

[5]  William Pearson,et al.  Finding Protein and Nucleotide Similarities with FASTA , 2003, Current protocols in bioinformatics.

[6]  D. Alahakoon,et al.  Classification of Protein Sequences using the Growing Self-Organizing Map , 2008, 2008 4th International Conference on Information and Automation for Sustainability.

[7]  Xing-Ming Zhao,et al.  A Novel Hybrid GA/SVM System for Protein Sequences Classification , 2004, IDEAL.

[8]  Swati Vipsita,et al.  Two-Stage Approach for Protein Superfamily Classification , 2013 .

[9]  Richard Hughey,et al.  Hidden Markov models for detecting remote protein homologies , 1998, Bioinform..

[10]  Dianhui Wang,et al.  Protein sequence classification using extreme learning machine , 2005, Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005..

[11]  T. Plotz,et al.  A new approach for HMM based protein sequence family modeling and its application to remote homology classification , 2005, IEEE/SP 13th Workshop on Statistical Signal Processing, 2005.

[12]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[13]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[14]  Matthew N. Davies,et al.  Alignment-Independent Techniques for Protein Classification , 2008 .

[15]  W. Ian Lipkin,et al.  Centroid based clustering of high throughput sequencing reads based on n-mer counts , 2013, BMC Bioinformatics.

[16]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[17]  W R Pearson,et al.  Using the FASTA program to search protein and DNA sequence databases. , 1994, Methods in molecular biology.

[18]  Samir Brahim Belhaouari,et al.  Data Mining of Protein Sequences with Amino Acid Position-Based Feature Encoding Technique , 2013, DaEng.

[19]  Santanu Kumar Rath,et al.  An efficient technique for protein classification using feature extraction by artificial neural networks , 2010, 2010 Annual IEEE India Conference (INDICON).

[20]  Stefan C. Kremer,et al.  Amino acid encoding schemes for machine learning methods , 2011, 2011 IEEE International Conference on Bioinformatics and Biomedicine Workshops (BIBMW).

[21]  Chenglong Yu,et al.  Protein sequence comparison based on K-string dictionary. , 2013, Gene.

[22]  Peter B. McGarvey,et al.  The Protein Information Resource (PIR) , 2000, Nucleic Acids Res..

[23]  W R Pearson Using the FASTA program to search protein and DNA sequence databases. , 1994, Methods in molecular biology.

[24]  Rolf Apweiler,et al.  The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000 , 2000, Nucleic Acids Res..

[25]  B. Luxon Bioinformatics for Dummies , 2003 .

[26]  André Luis Debiaso Rossi,et al.  Protein Classification Using Artificial Neural Networks with Different Protein Encoding Methods , 2007, Seventh International Conference on Intelligent Systems Design and Applications (ISDA 2007).

[27]  Dennis Shasha,et al.  New techniques for extracting features from protein sequences , 2001, IBM Syst. J..

[28]  U. B. Angadi,et al.  Structural SCOP Superfamily Level Classification Using Unsupervised Machine Learning , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[29]  Engelbert Mephu Nguifo,et al.  Protein sequences classification by means of feature extraction with substitution matrices , 2010, BMC Bioinformatics.

[30]  Eleazar Eskin,et al.  The Spectrum Kernel: A String Kernel for SVM Protein Classification , 2001, Pacific Symposium on Biocomputing.

[31]  Huan Liu,et al.  Feature Selection for Classification , 1997, Intell. Data Anal..

[32]  Xue-wen Chen,et al.  On Position-Specific Scoring Matrix for Protein Function Prediction , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[33]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[34]  Pablo A. Estévez,et al.  A review of feature selection methods based on mutual information , 2013, Neural Computing and Applications.

[35]  Cathy H. Wu,et al.  UniProt: the Universal Protein knowledgebase , 2004, Nucleic Acids Res..

[36]  Cornelia Caragea,et al.  Protein Sequence Classification Using Feature Hashing , 2011, BIBM.

[37]  N M Luscombe,et al.  What is Bioinformatics? A Proposed Definition and Overview of the Field , 2001, Methods of Information in Medicine.