Protein Classification using Machine Learning and Statistical Techniques: A Comparative Analysis

In recent era prediction of enzyme class from an unknown protein is one of the challenging tasks in bioinformatics. Day to day the number of proteins is increases as result the prediction of enzyme class gives a new opportunity to bioinformatics scholars. The prime objective of this article is to implement the machine learning classification technique for feature selection and predictions also find out an appropriate classification technique for function prediction. In this article the seven different classification technique like CRT, QUEST, CHAID, C5.0, ANN (Artificial Neural Network), SVM and Bayesian has been implemented on 4368 protein data that has been extracted from UniprotKB databank and categories into six different class. The proteins data is high dimensional sequence data and contain a maximum of 48 this http URL manipulate the high dimensional sequential protein data with different classification technique, the SPSS has been used as an experimental tool. Different classification techniques give different results for every model and shows that the data are imbalanced for class C4, C5 and C6. The imbalanced data affect the performance of model. In these three classes the precision and recall value is very less or negligible. The experimental results highlight that the C5.0 classification technique accuracy is more suited for protein feature classification and predictions. The C5.0 classification technique gives 95.56% accuracy and also gives high precision and recall value. Finally, we conclude that the features that is selected can be used for function prediction.

[1]  Sulin Pang,et al.  C5.0 Classification Algorithm and Application on Individual Credit Evaluation of Banks , 2009 .

[2]  P. Dobson,et al.  Predicting enzyme class from protein structure without alignments. , 2005, Journal of molecular biology.

[3]  S. Weinberg,et al.  Targeting mitochondria metabolism for cancer therapy. , 2015, Nature chemical biology.

[4]  Gajendra P. S. Raghava,et al.  A Machine Learning Based Method for the Prediction of Secretory Proteins Using Amino Acid Composition, Their Order and Similarity-Search , 2008, Silico Biol..

[5]  David A. Lee,et al.  CATH FunFHMMer web server: protein functional annotations using functional family assignments , 2015, Nucleic Acids Res..

[6]  K. Chou,et al.  iDNA-Prot: Identification of DNA Binding Proteins Using Random Forest with Grey Model , 2011, PloS one.

[7]  Geoff Holmes,et al.  Benchmarking Attribute Selection Techniques for Discrete Class Data Mining , 2003, IEEE Trans. Knowl. Data Eng..

[8]  Akshay Yadav,et al.  Structure based function prediction of proteins using fragment library frequency vectors , 2012, Bioinformation.

[9]  A.H. Nizar,et al.  Power Utility Nontechnical Loss Analysis With Extreme Learning Machine Method , 2008, IEEE Transactions on Power Systems.

[10]  Igor Jurisica,et al.  In silico prediction of physical protein interactions and characterization of interactome orphans , 2014, Nature Methods.

[11]  Goran Neshich,et al.  Predicting enzyme class from protein structure using Bayesian classification. , 2006, Genetics and molecular research : GMR.

[12]  Pierre Baldi,et al.  Assessing the accuracy of prediction algorithms for classification: an overview , 2000, Bioinform..

[13]  Nikos Paragios,et al.  A Machine Learning Methodology for Enzyme Functional Classification Combining Structural and Protein Sequence Descriptors , 2016, IWBBIO.

[14]  Paul J. M. Havinga,et al.  QUEST: Eliminating Online Supervised Learning for Efficient Classification Algorithms , 2016, Sensors.

[15]  P D Karp,et al.  What we do not know about sequence analysis and sequence databases. , 1998, Bioinformatics.

[16]  Sara Ahmed,et al.  Response shift in patients with multiple sclerosis: an application of three statistical techniques , 2011, Quality of Life Research.

[17]  B. Matthews Comparison of the predicted and observed secondary structure of T4 phage lysozyme. , 1975, Biochimica et biophysica acta.

[18]  Yahya Slimani,et al.  A Novel RFE-SVM-based Feature Selection Approach for Classification , 2012 .

[19]  Arvind Kumar Tiwari,et al.  Classification of enzyme functional classes and subclasses using support vector machine , 2015, 2015 International Conference on Futuristic Trends on Computational Analysis and Knowledge Management (ABLAZE).

[20]  Zhiyong Lu,et al.  On expert curation and scalability: UniProtKB/Swiss-Prot as a case study , 2017, Bioinform..

[21]  Marina Milanović,et al.  CHAID Decision Tree: Methodological Frame and Application , 2016 .

[22]  Tingting Fu,et al.  Therapeutic target database update 2018: enriched resource for facilitating bench-to-clinic research of targeted therapeutics , 2017, Nucleic Acids Res..

[23]  Miguel A. Andrade-Navarro,et al.  A novel approach for protein subcellular location prediction using amino acid exposure , 2013, BMC Bioinformatics.

[24]  Xiao Sun,et al.  Prediction of DNA-binding residues in proteins from amino acid sequences using a random forest model with a hybrid feature , 2008, Bioinform..

[25]  Sudhakar Tripathi,et al.  Protein Classification Using Hybrid Feature Selection Technique , 2016 .

[26]  Rich Caruana,et al.  Benefitting from the Variables that Variable Selection Discards , 2003, J. Mach. Learn. Res..

[27]  N. K. Bose,et al.  Neural Network Fundamentals with Graphs, Algorithms and Applications , 1995 .

[28]  Christine A. Orengo,et al.  Protein function prediction using domain families , 2013, BMC Bioinformatics.

[29]  J. Bartek,et al.  The DNA-damage response in human biology and disease , 2009, Nature.

[30]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[31]  Keun Ho Ryu,et al.  Design of a Novel Protein Feature and Enzyme Function Classification , 2008, 2008 IEEE 8th International Conference on Computer and Information Technology Workshops.

[32]  Shuang Li,et al.  SVM-Prot 2016: A Web-Server for Machine Learning Prediction of Protein Functional Families from Sequence Irrespective of Similarity , 2016, PloS one.

[33]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[34]  Li Hongwei,et al.  Ad hoc-based feature selection and support vector machine classifier for intrusion detection , 2007, 2007 IEEE International Conference on Grey Systems and Intelligent Services.

[35]  Silvio C. E. Tosatto,et al.  INGA: protein function prediction combining interaction networks, domain assignments and sequence similarity , 2015, Nucleic Acids Res..

[36]  Feng Xu,et al.  Therapeutic target database update 2016: enriched resource for bench to clinical drug target and targeted pathway information , 2015, Nucleic Acids Res..

[37]  Arvind Kumar Tiwari,et al.  A Survey of Computational Intelligence Techniques in Protein Function Prediction , 2014, International journal of proteomics.

[38]  Søren Brunak,et al.  Prediction of novel archaeal enzymes from sequence‐derived features , 2002, Protein science : a publication of the Protein Society.