Novel Tree-Based Proximity Search with SMOTE and Compositional Indexing Techniques for Protein Domain Identification

Proteins are generally long chains and consist of multiple domains. Domains are distinct structural units of a protein that can evolve and function independently. The accurate and reliable identification of protein domains is a fundamental stage in protein tertiary structure prediction. It not only gives insight into the way proteins work and therefore enhance medicine and drug development, but also reduces the experimental cost of protein structure determination by allowing researchers to work on a set of smaller and informative unit. In this work, we introduce a novel domain identification approach based on protein primary structure information only. We propose a novel tree-based proximity search model trained by amino acid compositional index and physiochemical properties and enhanced by SMOTE (Synthetic Minority Over-sampling Technique).

[1]  Alessandro Vullo,et al.  Ab initio and homology based prediction of protein domains by recursive neural networks , 2009, BMC Bioinformatics.

[2]  Asifullah Khan,et al.  WRF-TMH: predicting transmembrane helix by fusing composition index and physicochemical properties of amino acids , 2013, Amino Acids.

[3]  Abdollah Dehzangi,et al.  A Combination of Feature Extraction Methods with an Ensemble of Different Classifiers for Protein Structural Class Prediction Problem , 2013, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[4]  D Fischer,et al.  CAFASP‐1: Critical assessment of fully automated structure prediction methods , 1999, Proteins.

[5]  Nazar Zaki,et al.  Prediction of Protein-Protein Interactions Using Pairwise Alignment and Inter-Domain Linker Region , 2008, Eng. Lett..

[6]  Xin Deng,et al.  DoBo: Protein domain boundary prediction by integrating evolutionary signals and machine learning , 2011, BMC Bioinformatics.

[7]  W. Taylor,et al.  The classification of amino acid conservation. , 1986, Journal of theoretical biology.

[8]  Narayanaswamy Balakrishnan,et al.  Transmembrane helix prediction using amino acid property features and latent semantic analysis , 2008, BMC Bioinformatics.

[9]  Albert Y. Zomaya,et al.  Improving the performance of DomainDiscovery of protein domain boundary assignment using inter-domain linker index , 2006, BMC Bioinformatics.

[10]  Peng Chen,et al.  DomSVR: Domain Boundary Prediction with Support Vector Regression and Evolutionary Information , 2009, 2009 3rd International Conference on Bioinformatics and Biomedical Engineering.

[11]  Yutaka Kuroda,et al.  Improvement of domain linker prediction by incorporating loop-length-dependent characteristics. , 2006, Biopolymers.

[12]  Xiaolong Wang,et al.  Domain boundary prediction based on profile domain linker propensity index , 2006, Comput. Biol. Chem..

[13]  Yi Pan,et al.  Identifying essential proteins based on protein domains in protein-protein interaction networks , 2013, 2013 IEEE International Conference on Bioinformatics and Biomedicine.

[14]  Jooyoung Lee,et al.  PPRODO: Prediction of protein domain boundaries using neural networks , 2005, Proteins.

[15]  Nazar Zaki,et al.  Domain Linker Region Knowledge Contributes to Protein-protein Interaction Prediction , 2009 .

[16]  Rolf Apweiler,et al.  The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000 , 2000, Nucleic Acids Res..

[17]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[18]  Robert B. Russell,et al.  GlobPlot: exploring protein sequences for globularity and disorder , 2003, Nucleic Acids Res..

[19]  Dong-Soo Han,et al.  A Computational Model for Predicting Protein Interactions Based on Multidomain Collaboration , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[20]  Arnaud Céol,et al.  3did: a catalog of domain-based interactions of known three-dimensional structure , 2013, Nucleic Acids Res..

[21]  Nazar Zaki,et al.  Inter-domain linker prediction using amino acid compositional index , 2015, Comput. Biol. Chem..

[22]  Yutaka Kuroda,et al.  Computer‐aided NMR assay for detecting natively folded structural domains , 2006, Protein science : a publication of the Protein Society.

[23]  Nazar Zaki,et al.  A genetic algorithm to enhance transmembrane helices prediction , 2011, GECCO '11.

[24]  Albert Y. Zomaya,et al.  DomNet: Protein Domain Boundary Prediction Using Enhanced General Regression Network and New Profiles , 2008, IEEE Transactions on NanoBioscience.

[25]  Aristotelis A. Chatziioannou,et al.  Prediction of enzymatic activity of proteins based on structural and functional domains , 2013, 13th IEEE International Conference on BioInformatics and BioEngineering.

[26]  Hae-Jin Hu,et al.  Improved protein secondary structure prediction using support vector machine with a new encoding scheme and an advanced tertiary classifier , 2004, IEEE Transactions on NanoBioscience.

[27]  Piyali Chatterjee,et al.  Improved prediction of Multi-domains in protein chains using a Support Vector Machine , 2009 .

[28]  Yutaka Kuroda,et al.  DROP: an SVM domain linker predictor trained with optimal features selected by random forest , 2011, Bioinform..

[29]  Osamu Ohara,et al.  DomCut: prediction of inter-domain linker regions in amino acid sequences , 2003, Bioinform..

[30]  R. Russell,et al.  Amino‐Acid Properties and Consequences of Substitutions , 2003 .

[31]  Dong Xu,et al.  ThreaDom: extracting protein domain boundary information from multiple threading alignments , 2013, Bioinform..

[32]  Albert Y. Zomaya,et al.  A modular kernel approach for integrative analysis of protein domain boundaries , 2009, BMC Genomics.

[33]  Nazar Zaki,et al.  A Combination of Compositional Index and Genetic Algorithm for Predicting Transmembrane Helical Segments , 2011, PloS one.

[34]  Pierre Baldi,et al.  DOMpro: Protein Domain Prediction Using Profiles, Secondary Structure, Relative Solvent Accessibility, and Recursive Neural Networks , 2006, Data Mining and Knowledge Discovery.

[35]  Jaap Heringa,et al.  Identifying foldable regions in protein sequence from the hydrophobic signal , 2007, Nucleic acids research.

[36]  Harpreet Kaur Saini,et al.  BIOINFORMATICS APPLICATIONS NOTE Structural bioinformatics Meta-DP: domain prediction meta-server , 2022 .

[37]  C. Chothia Proteins. One thousand families for the molecular biologist. , 1992, Nature.

[38]  Kuang Lin,et al.  Scooby-domain: prediction of globular domains in protein sequence , 2005, Nucleic Acids Res..

[39]  Ian W. Hunter,et al.  Automatic Classification of Protein Sequences into Structure/Function Groups via Parallel Cascade Identification: A Feasibility Study , 2000, Annals of Biomedical Engineering.

[40]  Jon Louis Bentley,et al.  An Algorithm for Finding Best Matches in Logarithmic Expected Time , 1977, TOMS.

[41]  Anders Wallqvist,et al.  FIEFDom: a transparent domain boundary recognition system using a fuzzy mean operator , 2008, Nucleic acids research.

[42]  Teppei Ebina,et al.  Loop‐length‐dependent SVM prediction of domain linkers for high‐throughput structural proteomics , 2009, Biopolymers.

[43]  Haesun Park,et al.  Prediction of protein relative solvent accessibility with support vector machines and long‐range interaction 3D local descriptor , 2004, Proteins.

[44]  Madhu Chetty,et al.  Clustered Memetic Algorithm With Local Heuristics for Ab Initio Protein Structure Prediction , 2013, IEEE Transactions on Evolutionary Computation.

[45]  Maqsood Hayat,et al.  Mem-PHybrid: hybrid features-based prediction system for classifying membrane protein types. , 2012, Analytical biochemistry.

[46]  C. Chothia One thousand families for the molecular biologist , 1992, Nature.