Intelligent computational model for classification of sub-Golgi protein using oversampling and fisher feature selection methods

Golgi is one of the core proteins of a cell, constitutes in both plants and animals, which is involved in protein synthesis. Golgi is responsible for receiving and processing the macromolecules and trafficking of newly processed protein to its intended destination. Dysfunction in Golgi protein is expected to cause many neurodegenerative and inherited diseases that may be cured well if they are detected effectively and timely. Golgi protein is categorized into two parts cis-Golgi and trans-Golgi. The identification of Golgi protein via direct method is very hard due to limited available recognized structures. Therefore, the researchers divert their attention toward the sequences from structures. However, owing to technological advancement, exploration of huge amount of sequences was reported in the databases. So recognition of large amount of unprocessed data using conventional methods is very difficult. Therefore, the concept of intelligence was incorporated with computational model. Intelligence based computational model obtained reasonable results, but the gap of improvement is still under consideration. In this regard, an intelligent automatic recognition model is developed in order to enhance the true classification rate of sub-Golgi proteins. In this approach, discrete and evolutionary feature extraction methods are applied on the benchmark Golgi protein datasets to excerpt salient, propound and variant numerical descriptors. After that, an oversampling technique Syntactic Minority over Sampling Technique is employed to balance the data. Hybrid spaces are also generated with combination of these feature spaces. Further, Fisher feature selection method is utilized to reduce the extra noisy and redundant features from feature vector. Finally, k-nearest neighbor algorithm is used as learning hypothesis. Three distinct cross validation tests are used to examine the stability and efficiency of the proposed model. The predicted outcomes of proposed model are better than the existing models in the literature so far. Finally, it is anticipated that the proposed model will provide the foundation to pharmaceutical industry in drug design and research community to innovate new ideas in the area of computational biology and bioinformatics.

[1]  Krung Sinapiromsaran,et al.  The Effective Redistribution For Imbalance Dataset: Relocating Safe-Eevel Smote With Minority Outcast Handling , 2016 .

[2]  Hao Lin,et al.  Prediction of cell wall lytic enzymes using Chou's amphiphilic pseudo amino acid composition. , 2009, Protein and peptide letters.

[3]  G Atkinson,et al.  Statistical Methods For Assessing Measurement Error (Reliability) in Variables Relevant to Sports Medicine , 1998, Sports medicine.

[4]  D. Coomans,et al.  Alternative k-nearest neighbour rules in supervised pattern recognition : Part 1. k-Nearest neighbour classification by using alternative voting rules , 1982 .

[5]  Shenghui Liu,et al.  Protein sequence analysis by incorporating modified chaos game and physicochemical properties into Chou's general pseudo amino acid composition. , 2016, Journal of theoretical biology.

[6]  K. Chou,et al.  iCTX-Type: A Sequence-Based Predictor for Identifying the Types of Conotoxins in Targeting Ion Channels , 2014, BioMed research international.

[7]  Maqsood Hayat,et al.  "iSS-Hyb-mRMR": Identification of splicing sites using hybrid space of pseudo trinucleotide and pseudo tetranucleotide composition , 2016, Comput. Methods Programs Biomed..

[8]  Qing Chang,et al.  Feature selection methods for big data bioinformatics: A survey from the search perspective. , 2016, Methods.

[9]  David N. Mastronarde,et al.  Golgi Structure in Three Dimensions: Functional Insights from the Normal Rat Kidney Cell , 1999, The Journal of cell biology.

[10]  Hui Ding,et al.  Identify Golgi protein types with modified Mahalanobis discriminant algorithm and pseudo amino acid composition. , 2011, Protein and peptide letters.

[11]  K. Chou,et al.  Using Functional Domain Composition and Support Vector Machines for Prediction of Protein Subcellular Location* , 2002, The Journal of Biological Chemistry.

[12]  Charles X. Ling,et al.  Data Mining for Direct Marketing: Problems and Solutions , 1998, KDD.

[13]  Ning Zhong,et al.  On Data Mining for Direct Marketing , 2003, RSFDGrC.

[14]  Wei Chen,et al.  iOri-Human: identify human origin of replication by incorporating dinucleotide physicochemical properties into pseudo nucleotide composition , 2016, Oncotarget.

[15]  Yasen Jiao,et al.  Prediction of Golgi-resident protein types using general form of Chou's pseudo-amino acid compositions: Approaches with minimal redundancy maximal relevance feature selection. , 2016, Journal of theoretical biology.

[16]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[17]  N. Rao,et al.  The Influence of Dipeptide Composition on Protein Folding Rates , 2011 .

[18]  B. Park,et al.  Choice of neighbor order in nearest-neighbor classification , 2008, 0810.5276.

[19]  Bhaskar D. Kulkarni,et al.  Using pseudo amino acid composition to predict protein subnuclear localization: Approached with PSSM , 2007, Pattern Recognit. Lett..

[20]  Cajo J. F. ter Braak,et al.  Predicting sub-Golgi localization of type II membrane proteins , 2008, Bioinform..

[21]  Fred G. Silva Molecular Cell Biology, James Darnell, Harvey Lodish, David Baltimore. Scientific American Books, New York (1986), 1187 pages, $44.95 , 1987 .

[22]  Xin Yao,et al.  A Survey on Evolutionary Computation Approaches to Feature Selection , 2016, IEEE Transactions on Evolutionary Computation.

[23]  Huan Liu,et al.  Feature selection for classification: A review , 2014 .

[24]  B. Efron Bootstrap Methods: Another Look at the Jackknife , 1979 .

[25]  Michael J. Pazzani,et al.  Reducing Misclassification Costs , 1994, ICML.

[26]  Huan Liu,et al.  Feature Selection for Classification , 1997, Intell. Data Anal..

[27]  James G. Lyons,et al.  A feature extraction technique using bi-gram probabilities of position specific scoring matrix for protein fold recognition. , 2013, Journal of theoretical biology.

[28]  Sher Afzal Khan,et al.  iNuc-ext-PseTNC: an efficient ensemble model for identification of nucleosome positioning by extending the concept of Chou’s PseAAC to pseudo-tri-nucleotide composition , 2018, Molecular Genetics and Genomics.

[29]  C. Smith Diagnostic tests (1) – sensitivity and specificity , 2012, Phlebology.

[30]  H. Ding,et al.  Identification of mitochondrial proteins of malaria parasite using analysis of variance , 2014, Amino Acids.

[31]  Khurshid Ahmad,et al.  Identification of DNA binding proteins using evolutionary profiles position specific scoring matrix , 2016, Neurocomputing.

[32]  Pu-Feng Du,et al.  Predicting Golgi-resident protein types using pseudo amino acid compositions: Approaches with positional specific physicochemical properties. , 2016, Journal of theoretical biology.

[33]  Jian Huang,et al.  Prediction of Golgi-resident protein types by using feature selection technique , 2013 .

[34]  Junjie Chen,et al.  Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences , 2015, Nucleic Acids Res..

[35]  D. Kuntz,et al.  Structure of Golgi α‐mannosidase II: a target for inhibition of growth and metastasis of cancer cells , 2001, The EMBO journal.

[36]  Igor Kononenko,et al.  Machine learning for medical diagnosis: history, state of the art and perspective , 2001, Artif. Intell. Medicine.

[37]  Kuldip K. Paliwal,et al.  A Tri-Gram Based Feature Extraction Technique Using Linear Probabilities of Position Specific Scoring Matrix for Protein Fold Recognition , 2014, IEEE Transactions on NanoBioscience.

[38]  Xin Wang,et al.  PseAAC-Builder: a cross-platform stand-alone program for generating various special Chou's pseudo-amino acid compositions. , 2012, Analytical biochemistry.

[39]  N. Altman An Introduction to Kernel and Nearest-Neighbor Nonparametric Regression , 1992 .

[40]  Maqsood Hayat,et al.  iNuc-STNC: a sequence-based predictor for identification of nucleosome positioning in genomes by extending the concept of SAAC and Chou's PseAAC. , 2016, Molecular bioSystems.

[41]  Wei Chen,et al.  Prediction of thermophilic proteins using feature selection technique. , 2011, Journal of microbiological methods.

[42]  Hui Ding,et al.  AcalPred: A Sequence-Based Tool for Discriminating between Acidic and Alkaline Enzymes , 2013, PloS one.

[43]  Wei Chen,et al.  PseKNC-General: a cross-platform package for generating various modes of pseudo nucleotide compositions , 2015, Bioinform..

[44]  Wei Chen,et al.  Using Over-Represented Tetrapeptides to Predict Protein Submitochondria Locations , 2013, Acta Biotheoretica.

[45]  Gajendra P S Raghava,et al.  Classification of Nuclear Receptors Based on Amino Acid Composition and Dipeptide Composition* , 2004, Journal of Biological Chemistry.

[46]  D T Jones,et al.  Protein secondary structure prediction based on position-specific scoring matrices. , 1999, Journal of molecular biology.

[47]  Gail Gong Cross-Validation, the Jackknife, and the Bootstrap: Excess Error Estimation in Forward Logistic Regression , 1986 .

[48]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[49]  Md. Al Mehedi Hasan,et al.  Feature Fusion Based SVM Classifier for Protein Subcellular Localization Prediction. , 2016, Journal of integrative bioinformatics.

[50]  Ron Kohavi,et al.  Data mining using /spl Mscr//spl Lscr//spl Cscr/++ a machine learning library in C++ , 1996, Proceedings Eighth IEEE International Conference on Tools with Artificial Intelligence.

[51]  Hong-Bin Shen,et al.  Prediction of Protein–Protein Interaction Sites with Machine-Learning-Based Data-Cleaning and Post-Filtering Procedures , 2015, The Journal of Membrane Biology.

[52]  Kuo-Chen Chou,et al.  Prediction of Protein Structural Classes by Support Vector Machines , 2002, Comput. Chem..

[53]  Pufeng Du,et al.  PseAAC-General: Fast Building Various Modes of General Form of Chou’s Pseudo-Amino Acid Composition for Large-Scale Protein Datasets , 2014, International journal of molecular sciences.

[54]  Yasen Jiao,et al.  Performance measures in evaluating machine learning based bioinformatics predictors for classifications , 2016, Quantitative Biology.

[55]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[56]  Hui Ding,et al.  Predicting ion channels and their types by the dipeptide mode of pseudo amino acid composition. , 2011, Journal of theoretical biology.

[57]  K. Chou,et al.  Prediction of protein structural classes. , 1995, Critical reviews in biochemistry and molecular biology.

[58]  Hua Tang,et al.  Identification of Bacterial Cell Wall Lyases via Pseudo Amino Acid Composition , 2016, BioMed research international.

[59]  D. Altman,et al.  Statistics Notes: Diagnostic tests 1: sensitivity and specificity , 1994 .

[60]  S. Khan,et al.  Prediction of protein structure classes using hybrid space of multi-profile Bayes and bi-gram probability feature spaces. , 2014, Journal of theoretical biology.

[61]  Hao Lin The modified Mahalanobis Discriminant for predicting outer membrane proteins by using Chou's pseudo amino acid composition. , 2008, Journal of theoretical biology.

[62]  Joshua A. Kritzer,et al.  Compounds from an unbiased chemical screen reverse both ER-to-Golgi trafficking defects and mitochondrial dysfunction in Parkinson’s disease models , 2010, Disease Models & Mechanisms.

[63]  Jinchang Ren,et al.  ANN vs. SVM: Which one performs better in classification of MCCs in mammogram imaging , 2012, Knowl. Based Syst..

[64]  Dong-Sheng Cao,et al.  propy: a tool to generate various modes of Chou's PseAAC , 2013, Bioinform..

[65]  Maqsood Hayat,et al.  Machine learning approaches for discrimination of Extracellular Matrix proteins using hybrid feature space. , 2016, Journal of theoretical biology.

[66]  Maria Jesus Martin,et al.  High-quality Protein Knowledge Resource: SWISS-PROT and TrEMBL , 2002, Briefings Bioinform..

[67]  Peter Szolovits,et al.  Using Machine Learning to Predict Laboratory Test Results. , 2016, American journal of clinical pathology.

[68]  Tariq Habib Afridi,et al.  Mito-GSAAC: mitochondria prediction using genetic ensemble classifier and split amino acid composition , 2012, Amino Acids.

[69]  P. Doyle,et al.  The polymer physics of single DNA confined in nanochannels. , 2016, Advances in colloid and interface science.

[70]  S. Hoyer,et al.  Is sporadic Alzheimer disease the brain type of non-insulin dependent diabetes mellitus? A challenging hypothesis , 1998, Journal of Neural Transmission.

[71]  Chengjin Zhang,et al.  A Novel Feature Extraction Method with Feature Selection to Identify Golgi-Resident Protein Types from Imbalanced Data , 2016, International journal of molecular sciences.

[72]  T. Arendt,et al.  Dendritic changes in the basal nucleus of meynert and in the diagonal band nucleus in Alzheimer's disease—A quantitative Golgi investigation , 1986, Neuroscience.

[73]  Chengjin Zhang,et al.  Using the SMOTE technique and hybrid features to predict the types of ion channel-targeted conotoxins. , 2016, Journal of theoretical biology.

[74]  Wei Chen,et al.  Predicting bacteriophage proteins located in host cell with feature selection technique , 2016, Comput. Biol. Medicine.

[75]  Wei Chen,et al.  Identification of bacteriophage virion proteins by the ANOVA feature selection and analysis. , 2014, Molecular bioSystems.

[76]  David G. Stork,et al.  Pattern Classification , 1973 .

[77]  Egidio D'Angelo,et al.  The Critical Role of Golgi Cells in Regulating Spatio-Temporal Integration and Plasticity at the Cerebellum Input Stage , 2008, Front. Neurosci..

[78]  Kazuyuki Murase,et al.  Single-Layered Complex-Valued Neural Network with SMOTE for Imbalanced Data Classification , 2016, 2016 Joint 8th International Conference on Soft Computing and Intelligent Systems (SCIS) and 17th International Symposium on Advanced Intelligent Systems (ISIS).

[79]  Asifullah Khan,et al.  MemHyb: predicting membrane protein types by hybridizing SAAC and PSSM. , 2012, Journal of theoretical biology.

[80]  Kuldip K. Paliwal,et al.  Protein Structural Class Prediction via k-Separated Bigrams Using Position Specific Scoring Matrix , 2014, J. Adv. Comput. Intell. Intell. Informatics.