ProFET: Feature engineering captures high-level protein functions

MOTIVATION The amount of sequenced genomes and proteins is growing at an unprecedented pace. Unfortunately, manual curation and functional knowledge lag behind. Homologous inference often fails at labeling proteins with diverse functions and broad classes. Thus, identifying high-level protein functionality remains challenging. We hypothesize that a universal feature engineering approach can yield classification of high-level functions and unified properties when combined with machine learning approaches, without requiring external databases or alignment. RESULTS In this study, we present a novel bioinformatics toolkit called ProFET (Protein Feature Engineering Toolkit). ProFET extracts hundreds of features covering the elementary biophysical and sequence derived attributes. Most features capture statistically informative patterns. In addition, different representations of sequences and the amino acids alphabet provide a compact, compressed set of features. The results from ProFET were incorporated in data analysis pipelines, implemented in python and adapted for multi-genome scale analysis. ProFET was applied on 17 established and novel protein benchmark datasets involving classification for a variety of binary and multi-class tasks. The results show state of the art performance. The extracted features' show excellent biological interpretability. The success of ProFET applies to a wide range of high-level functions such as subcellular localization, structural classes and proteins with unique functional properties (e.g. neuropeptide precursors, thermophilic and nucleic acid binding). ProFET allows easy, universal discovery of new target proteins, as well as understanding the features underlying different high-level protein functions. AVAILABILITY AND IMPLEMENTATION ProFET source code and the datasets used are freely available at https://github.com/ddofer/ProFET. CONTACT michall@cc.huji.ac.il SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

[1]  Wei Chen,et al.  Prediction of thermophilic proteins using feature selection technique. , 2011, Journal of microbiological methods.

[2]  C. Orengo,et al.  Protein function prediction--the power of multiplicity. , 2009, Trends in biotechnology.

[3]  Johannes Söding,et al.  Protein homology detection by HMM?CHMM comparison , 2005, Bioinform..

[4]  David T. Jones,et al.  Transmembrane protein topology prediction using support vector machines , 2009, BMC Bioinformatics.

[5]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[6]  R. Durbin,et al.  Pfam: A comprehensive database of protein domain families based on seed alignments , 1997, Proteins.

[7]  Sameer Velankar,et al.  Genome3D: a UK collaborative project to annotate genomic sequences with predicted 3D structures based on SCOP and CATH domains , 2012, Nucleic Acids Res..

[8]  S. Brunak,et al.  SignalP 4.0: discriminating signal peptides from transmembrane regions , 2011, Nature Methods.

[9]  Jaime Prilusky,et al.  FoldIndex copyright: a simple tool to predict whether a given protein sequence is intrinsically unfolded , 2005, Bioinform..

[10]  Richard J. Edwards,et al.  ELM—the database of eukaryotic linear motifs , 2011, Nucleic Acids Res..

[11]  J. Hoh,et al.  Reduced amino acid alphabet is sufficient to accurately recognize intrinsically disordered protein , 2004, FEBS letters.

[12]  Akin Ozcift Enhanced Cancer Recognition System Based on Random Forests Feature Elimination Algorithm , 2012, Journal of medical systems.

[13]  Dong-Sheng Cao,et al.  propy: a tool to generate various modes of Chou's PseAAC , 2013, Bioinform..

[14]  R. Doolittle,et al.  A simple method for displaying the hydropathic character of a protein. , 1982, Journal of molecular biology.

[15]  Baishan Fang,et al.  LogitBoost classifier for discriminating thermophilic and mesophilic proteins. , 2007, Journal of biotechnology.

[16]  Alfonso Valencia,et al.  Automated Alphabet Reduction for Protein Datasets , 2009, BMC Bioinformatics.

[17]  Kuang Lin,et al.  A simple and fast secondary structure prediction method using hidden neural networks , 2005, Bioinform..

[18]  Russell L. Marsden,et al.  Progress of structural genomics initiatives: an analysis of solved target structures. , 2005, Journal of molecular biology.

[19]  Ron D. Appel,et al.  ExPASy: the proteomics server for in-depth protein knowledge and analysis , 2003, Nucleic Acids Res..

[20]  Chris H. Q. Ding,et al.  Multi-class protein fold recognition using support vector machines and neural networks , 2001, Bioinform..

[21]  S. Eddy,et al.  Pfam: the protein families database , 2013, Nucleic Acids Res..

[22]  Bartek Wilczynski,et al.  Biopython: freely available Python tools for computational molecular biology and bioinformatics , 2009, Bioinform..

[23]  Kimmen Sjölander,et al.  COACH : profile-profile alignment of protein families using hidden Markov models , 2003 .

[24]  P. N. Suganthan,et al.  DNA-Prot: Identification of DNA Binding Proteins from Protein Sequence Information using Random Forest , 2009, Journal of biomolecular structure & dynamics.

[25]  Loris Nanni,et al.  An Empirical Study of Different Approaches for Protein Classification , 2014, TheScientificWorldJournal.

[26]  M. Michael Gromiha,et al.  A simple statistical method for discriminating outer membrane proteins with better accuracy , 2005, Bioinform..

[27]  Petr Klus,et al.  The cleverSuite approach for protein characterization: predictions of structural properties, solubility, chaperone requirements and RNA-binding abilities , 2014, Bioinform..

[28]  Q. Zou,et al.  Hierarchical Classification of Protein Folds Using a Novel Ensemble Classifier , 2013, PloS one.

[29]  Vladimir Vacic,et al.  Composition Profiler: a tool for discovery and visualization of amino acid composition differences , 2007, BMC Bioinformatics.

[30]  W. Atchley,et al.  Solving the protein sequence metric problem. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[31]  Rob Phillips,et al.  Reduced amino acid alphabets exhibit an improved sensitivity and selectivity in fold assignment , 2009, Bioinform..

[32]  Elon Portugaly,et al.  Selecting targets for structural determination by navigating in a graph of protein families , 2002, Bioinform..

[33]  Steven E. Brenner,et al.  SCOPe: Structural Classification of Proteins—extended, integrating SCOP and ASTRAL data and classification of new structures , 2013, Nucleic Acids Res..

[34]  David Haussler,et al.  A Discriminative Framework for Detecting Remote Protein Homologies , 2000, J. Comput. Biol..

[35]  Rolf Apweiler,et al.  InterPro and InterProScan , 2007 .

[36]  J. Veenstra,et al.  Mono- and dibasic proteolytic cleavage sites in insect neuroendocrine peptide precursors. , 2000, Archives of insect biochemistry and physiology.

[37]  Jason Weston,et al.  Mismatch string kernels for discriminative protein classification , 2004, Bioinform..

[38]  A Keith Dunker,et al.  TOP-IDP-scale: a new amino acid scale measuring propensity for intrinsic disorder. , 2008, Protein and peptide letters.

[39]  Marcel J. T. Reinders,et al.  SPiCE: a web-based tool for sequence-based protein classification and exploration , 2014, BMC Bioinformatics.

[40]  Michal Linial,et al.  ClanTox: a classifier of short animal toxins , 2009, Nucleic Acids Res..

[41]  Sohail Asghar,et al.  A REVIEW OF FEATURE SELECTION TECHNIQUES IN STRUCTURE LEARNING , 2013 .

[42]  Yu-Dong Cai,et al.  Support Vector Machines for predicting protein structural class , 2001, BMC Bioinformatics.

[43]  Cathy H. Wu,et al.  The Universal Protein Resource (UniProt): an expanding universe of protein information , 2005, Nucleic Acids Res..

[44]  Frances M. G. Pearl,et al.  The CATH domain structure database: new protocols and classification levels give a more comprehensive resource for exploring evolution , 2006, Nucleic Acids Res..

[45]  Bruce R. Southey,et al.  NeuroPred: a tool to predict cleavage sites in neuropeptide precursors and provide the masses of the resulting peptides , 2006, Nucleic Acids Res..

[46]  Patrice Koehl,et al.  The ASTRAL Compendium in 2004 , 2003, Nucleic Acids Res..

[47]  Pierre Baldi,et al.  SCRATCH: a protein structure and structural feature prediction server , 2005, Nucleic Acids Res..

[48]  David A. Gough,et al.  Predicting protein-protein interactions from primary structure , 2001, Bioinform..

[49]  Michal Linial,et al.  NeuroPID: a predictor for identifying neuropeptide precursors from metazoan proteomes , 2014, Bioinform..

[50]  A. Valencia Automatic annotation of protein function. , 2005, Current opinion in structural biology.

[51]  Alexander G. Georgiev,et al.  Interpretable Numerical Descriptors of Amino Acid Space , 2009, J. Comput. Biol..

[52]  Daniel W. A. Buchan,et al.  A large-scale evaluation of computational protein function prediction , 2013, Nature Methods.

[53]  Kuo-Chen Chou,et al.  Prediction and classification of protein subcellular location—sequence‐order effect and pseudo amino acid composition , 2003, Journal of cellular biochemistry.

[54]  R. Levy,et al.  Simplified amino acid alphabets for protein fold recognition and implications for folding. , 2000, Protein engineering.

[55]  Jack Y. Yang,et al.  BindN+ for accurate prediction of DNA and RNA-binding residues from protein sequence features , 2010, BMC Syst. Biol..

[56]  Robert C. Edgar,et al.  BIOINFORMATICS APPLICATIONS NOTE , 2001 .

[57]  Orna Man,et al.  Proteomic signatures: Amino acid and oligopeptide compositions differentiate among phyla , 2003, Proteins.

[58]  B. Rost,et al.  Automatic prediction of protein function , 2003, Cellular and Molecular Life Sciences CMLS.

[59]  I. Muchnik,et al.  Prediction of protein folding class using global description of amino acid sequence. , 1995, Proceedings of the National Academy of Sciences of the United States of America.

[60]  J. Beckmann,et al.  FoldIndex: a simple tool to predict whether a given protein sequence is intrinsically unfolded. , 2005, Bioinformatics.

[61]  Richard Hughey,et al.  Hidden Markov models for detecting remote protein homologies , 1998, Bioinform..

[62]  Michal Linial,et al.  When Less Is More: Improving Classification of Protein Families with a Minimal Set of Global Features , 2007, WABI.

[63]  Pierre Baldi,et al.  Improved residue contact prediction using support vector machines and a large feature set , 2007, BMC Bioinformatics.

[64]  Zhirong Sun,et al.  Support vector machine approach for protein subcellular localization prediction , 2001, Bioinform..

[65]  Michael Eickenberg,et al.  Machine learning for neuroimaging with scikit-learn , 2014, Front. Neuroinform..

[66]  Avner Schlessinger,et al.  PredictProtein—an open resource for online prediction of protein structural and functional features , 2014, Nucleic Acids Res..

[67]  Michal Linial,et al.  NeuroPID: a classifier of neuropeptide precursors , 2014, Nucleic Acids Res..

[68]  N. Mulder,et al.  InterPro and InterProScan: tools for protein sequence classification and comparison. , 2007, Methods in molecular biology.