Exploring general-purpose protein features for distinguishing enzymes and non-enzymes within the twilight zone

BackgroundComputational prediction of protein function constitutes one of the more complex problems in Bioinformatics, because of the diversity of functions and mechanisms in that proteins exert in nature. This issue is reinforced especially for proteins that share very low primary or tertiary structure similarity to existing annotated proteomes. In this sense, new alignment-free (AF) tools are needed to overcome the inherent limitations of classic alignment-based approaches to this issue. We have recently introduced AF protein-numerical-encoding programs (TI2BioP and ProtDCal), whose sequence-based features have been successfully applied to detect remote protein homologs, post-translational modifications and antibacterial peptides. Here we aim to demonstrate the applicability of 4 AF protein descriptor families, implemented in our programs, for the identification enzyme-like proteins. At the same time, the use of our novel family of 3D–structure-based descriptors is introduced for the first time. The Dobson & Doig (D&D) benchmark dataset is used for the evaluation of our AF protein descriptors, because of its proven structural diversity that permits one to emulate an experiment within the twilight zone of alignment-based methods (pair-wise identity <30%). The performance of our sequence-based predictor was further assessed using a subset of formerly uncharacterized proteins which currently represent a benchmark annotation dataset.ResultsFour protein descriptor families (sequence-composition-based (0D), linear-topology-based (1D), pseudo-fold-topology-based (2D) and 3D–structure features (3D), were assessed using the D&D benchmark dataset. We show that only the families of ProtDCal’s descriptors (0D, 1D and 3D) encode significant information for enzymes and non-enzymes discrimination. The obtained 3D–structure-based classifier ranked first among several other SVM-based methods assessed in this dataset. Furthermore, the model leveraging 1D descriptors, showed a higher success rate than EzyPred on a benchmark annotation dataset from the Shewanella oneidensis proteome.ConclusionsThe applicability of ProtDCal as a general-purpose-AF protein modelling method is illustrated through the discrimination between two comprehensive protein functional classes. The observed performances using the highly diverse D&D dataset, and the set of formerly uncharacterized (hard-to-annotate) proteins of Shewanella oneidensis, places our methodology on the top range of methods to model and predict protein function using alignment-free approaches.

[1]  Antje Chang,et al.  BRENDA in 2013: integrated reactions, kinetic data, enzyme function data, improved disease classification: new options and contents in BRENDA , 2012, Nucleic Acids Res..

[2]  W. Dunn,et al.  Amino acid side chain descriptors for quantitative structure-activity relationship studies of peptide analogues. , 1995, Journal of medicinal chemistry.

[3]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[4]  Susan M. Bridges,et al.  Prediction of Cell Penetrating Peptides by Support Vector Machines , 2011, PLoS Comput. Biol..

[5]  Jürgen Bajorath,et al.  Variability of Molecular Descriptors in Compound Databases Revealed by Shannon Entropy Calculations , 2000, J. Chem. Inf. Comput. Sci..

[6]  Yovani Marrero-Ponce,et al.  IMMAN: free software for information theory-based chemometric analysis , 2015, Molecular Diversity.

[7]  Roman Garnett,et al.  Propagation kernels: efficient graph kernels from propagated information , 2015, Machine Learning.

[8]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[9]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[10]  Nicole Sips,et al.  Structural determinants of the rate of protein folding. , 2003, Journal of theoretical biology.

[11]  K. Chou,et al.  Prediction of protein subcellular locations by GO-FunD-PseAA predictor. , 2004, Biochemical and biophysical research communications.

[12]  Lourdes Santana,et al.  Proteomics, networks and connectivity indices , 2008, Proteomics.

[13]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[14]  K. Chou,et al.  EzyPred: a top-down approach for predicting enzyme functional classes and subclasses. , 2007, Biochemical and biophysical research communications.

[15]  Anthony J. Cesnik,et al.  Proteogenomics: Integrating Next-Generation Sequencing and Mass Spectrometry to Characterize Human Proteomic Variation. , 2016, Annual review of analytical chemistry.

[16]  Zheng Qifu,et al.  Support Vector Machine Based on Universal Kernel Function and Its Application in Quantitative Structure - Toxicity Relationship Model , 2009, 2009 International Forum on Information Technology and Applications.

[17]  Geng Li,et al.  Effective graph classification based on topological and label attributes , 2012, Stat. Anal. Data Min..

[18]  S. Wold,et al.  Peptide quantitative structure-activity relationships, a multivariate approach. , 1987, Journal of medicinal chemistry.

[19]  Edwin R. Hancock,et al.  Depth-based complexity traces of graphs , 2014, Pattern Recognit..

[20]  Tim J. P. Hubbard,et al.  Data growth and its impact on the SCOP database: new developments , 2007, Nucleic Acids Res..

[21]  L. Buydens,et al.  Facilitating the application of Support Vector Regression by using a universal Pearson VII function based kernel , 2006 .

[22]  K. Chou,et al.  Predicting protein quaternary structure by pseudo amino acid composition , 2003, Proteins.

[23]  J. Järv,et al.  Modeling of the Amino Acid Side Chain Effects on Peptide Conformation , 1999 .

[24]  Guangya Zhang,et al.  Support vector machine with a Pearson VII function kernel for discriminating halophilic and non-halophilic proteins , 2013, Comput. Biol. Chem..

[25]  Jose M. Such,et al.  International Joint Conference on Artificial Intelligence (IJCAI) , 2016 .

[26]  Tim J. P. Hubbard,et al.  SCOP: a structural classification of proteins database , 1998, Nucleic Acids Res..

[27]  Serafim Batzoglou,et al.  The many faces of sequence alignment , 2005, Briefings Bioinform..

[28]  Etsuko N Moriyama,et al.  Simple alignment-free methods for protein classification: a case study from G-protein-coupled receptors. , 2007, Genomics.

[29]  R. Doolittle,et al.  A simple method for displaying the hydropathic character of a protein. , 1982, Journal of molecular biology.

[30]  Kurt Mehlhorn,et al.  Weisfeiler-Lehman Graph Kernels , 2011, J. Mach. Learn. Res..

[31]  E. Birney,et al.  Pfam: the protein families database , 2013, Nucleic Acids Res..

[32]  Dietmar Schomburg,et al.  EnzymeDetector: an integrated enzyme function prediction tool and database , 2011, BMC Bioinformatics.

[33]  Charles F. Hockett,et al.  A mathematical theory of communication , 1948, MOCO.

[34]  Bülent Yener,et al.  Graph Classification via Topological and Label Attributes , 2011 .

[35]  Jürgen Bajorath,et al.  Chemical Descriptors with Distinct Levels of Information Content and Varying Sensitivity to Differences between Selected Compound Databases Identified by SE-DSE Analysis , 2002, J. Chem. Inf. Comput. Sci..

[36]  Vitor Vasconcelos,et al.  Exploring the Adenylation Domain Repertoire of Nonribosomal Peptide Synthetases Using an Ensemble of Sequence-Search Methods , 2013, PloS one.

[37]  Luc De Raedt,et al.  Graph Invariant Kernels , 2015, IJCAI.

[38]  M. Levitt Conformational preferences of amino acids in globular proteins. , 1978, Biochemistry.

[39]  Peter Tarczy-Hornoch,et al.  Validating annotations for uncharacterized proteins in Shewanella oneidensis. , 2008, Omics : a journal of integrative biology.

[40]  L. Hall,et al.  Molecular Structure Description: The Electrotopological State , 1999 .

[41]  Devdatt P. Dubhashi,et al.  Classifying Large Graphs with Differential Privacy , 2015, MDAI.

[42]  Kuo-Chen Chou,et al.  Using GO-PseAA predictor to predict enzyme sub-class. , 2004, Biochemical and biophysical research communications.

[43]  Gajendra P. S. Raghava,et al.  COPid: Composition Based Protein Identification , 2008, Silico Biol..

[44]  James Green,et al.  ProtDCal: A program to compute general-purpose-numerical descriptors for sequences and 3D-structures of proteins , 2015, BMC Bioinformatics.

[45]  Pramod P Wangikar,et al.  An alignment-free method for classification of protein sequences. , 2007, Protein and peptide letters.

[46]  Hiroyuki Ogata,et al.  AAindex: Amino Acid Index Database , 1999, Nucleic Acids Res..

[47]  Gail J. Bartlett,et al.  Analysis of catalytic residues in enzyme active sites. , 2002, Journal of molecular biology.

[48]  Lemont B. Kier,et al.  An Electrotopological-State Index for Atoms in Molecules , 1990, Pharmaceutical Research.

[49]  Mathieu Senelle Measures on graphs : from similarity to density , 2014 .

[50]  Kuo-Chen Chou,et al.  Predicting membrane protein type by functional domain composition and pseudo-amino acid composition. , 2006, Journal of theoretical biology.

[51]  N. Shervashidze Scalable graph kernels , 2012 .

[52]  B. Rost Twilight zone of protein sequence alignments. , 1999, Protein engineering.

[53]  Sean R. Eddy,et al.  Profile hidden Markov models , 1998, Bioinform..

[54]  Yunierkis Pérez-Castillo,et al.  TI2BioP: Topological Indices to BioPolymers. Its practical use to unravel cryptic bacteriocin-like domains , 2011, Amino Acids.

[55]  B. Rost Enzyme function less conserved than anticipated. , 2002, Journal of molecular biology.

[56]  Yovani Marrero-Ponce,et al.  Global stability of protein folding from an empirical free energy function. , 2013, Journal of theoretical biology.

[57]  Yovani Marrero-Ponce,et al.  Novel “extended sequons” of human N-glycosylation sites improve the precision of qualitative predictions: an alignment-free study of pattern recognition using ProtDCal protein features , 2017, Amino Acids.

[58]  B X Yan,et al.  Glycine Residues Provide Flexibility for Enzyme Active Sites* , 1997, The Journal of Biological Chemistry.

[59]  David A. Lee,et al.  CATH: comprehensive structural and functional annotations for genome sequences , 2014, Nucleic Acids Res..

[60]  Yovani Marrero-Ponce,et al.  A Hooke׳s law-based approach to protein folding rate. , 2015, Journal of theoretical biology.

[61]  Gary D Stormo,et al.  An Introduction to Sequence Similarity (“Homology”) Searching , 2009, Current protocols in bioinformatics.

[62]  K. Chou Prediction of protein cellular attributes using pseudo‐amino acid composition , 2001, Proteins.

[63]  Kuo-Chen Chou,et al.  Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes , 2005, Bioinform..

[64]  Z. R. Li,et al.  Update of PROFEAT: a web server for computing structural and physicochemical features of proteins and peptides from amino acid sequence , 2006, Nucleic Acids Res..

[65]  Yasser B. Ruiz-Blanco,et al.  A physics-based scoring function for protein structural decoys: Dynamic testing on targets of CASP-ROLL , 2014 .

[66]  José Ignacio Abreu Salas,et al.  Amino Acid Sequence Autocorrelation Vectors and Ensembles of Bayesian-Regularized Genetic Neural Networks for Prediction of Conformational Stability of Human Lysozyme Mutants , 2006, J. Chem. Inf. Model..

[67]  P. Dobson,et al.  Distinguishing enzyme structures from non-enzymes without alignments. , 2003, Journal of molecular biology.

[68]  Haruki Nakamura,et al.  Announcing the worldwide Protein Data Bank , 2003, Nature Structural Biology.

[69]  O. White,et al.  Genome sequence of the dissimilatory metal ion–reducing bacterium Shewanella oneidensis , 2002, Nature Biotechnology.

[70]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[71]  Amos Bairoch,et al.  The ENZYME database in 2000 , 2000, Nucleic Acids Res..

[72]  Jonatan Kilhamn Fast shortest-path kernel computations using aproximate methods , 2015 .

[73]  L. Holm,et al.  The Pfam protein families database , 2005, Nucleic Acids Res..

[74]  Huan Liu,et al.  Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution , 2003, ICML.

[75]  M. Natália D. S. Cordeiro,et al.  First Multitarget Chemo-Bioinformatic Model To Enable the Discovery of Antibacterial Peptides against Multiple Gram-Positive Pathogens , 2016, J. Chem. Inf. Model..

[76]  Sangya Pundir,et al.  UniProt Protein Knowledgebase. , 2017, Methods in molecular biology.

[77]  Abid Qureshi,et al.  AVCpred: an integrated web server for prediction and design of antiviral compounds , 2016, Chemical biology & drug design.

[78]  Daniel B. Roche,et al.  The enzymatic nature of an anonymous protein sequence cannot reliably be inferred from superfamily level structural information alone , 2015, Protein science : a publication of the Protein Society.

[79]  G. Habermehl Molecular Structure Description , 2001 .